Instructor Notes

This is a placeholder file. Please add content here.

Introduction


Intention: Step into the narrative

Set up narrative:

  • Important upcoming conference presentation
  • Time is ticking, the deadline is approaching way too fast
  • The talk is almost done, but, critically, we’re missing a picture for the title slide
  • It should contain three snowmen, and we’ve exhausted our credits for all generative AI models in previous chats with colleagues
  • => Ray tracing a scene to the rescue!
  • Issue: we need to try many different iterations of the scene to find the exact right picture. How can we maximise the number of raytraced snowman images before our conference deadline?
  • Ray tracing is expensive, but luckily we have access to an HPC system

What we’re doing here:

  • Run workflow example for the first time
  • Simple time measurement to get started
  • Introduce different perspectives on efficiency
  • Core-h and correlation to cost in energy/money
  • Either set up the first Slurm job here or in the next episode


TODO: Possible to highlight individual benefits of efficient jobs more?

Maybe good to also address perspective of “why should I care”. You get more out of your fair share. Shorter iteration times => more/better insight …



Instructor Note

#TODO: Can we use time and date to find the issue with the subshells?

Better to teach a way to find the issue, than staring at the script and thinking about it



TODO: Maybe move discussion eleswere?

Quality and necessity of calculations are important factors in efficiency. Redundant calculations are inefficient, for example. This section may be still too much of a detour from the introduction, at least in its current form. May be a chance to shorten the episode as well



TODO: Add actions / live-coding to sections below?

Maybe too much info vs. too little activity, currently?



Resource Requirements


Instructor Note

This discussion highly depends on the management philosophy of the cluster available to the learners. Some examples:

  • A partition with a high number of cores large amounts of memory per node is probably intended for SMP calculations.
  • A partition with a lot of nodes that each have only a (relatively) small number of cores and memory is probably intended for MPI calculations.
  • A partition with powerful GPUs, but only a small amount of CPU cores is likely intended for jobs where the majority of the work is offloaded to the GPUs.
  • A partition with less powerful GPUs but more CPU cores and memory is likely intended for hybrid workloads.


Instructor Note

The learners should realize that the per-user average they calculate here is very synthetic:

  • Many users do not use their full share of resources, which leaves room for others to use more.
  • The average we calculate is only an average over long periods of time. Short term you can usually use much more.
  • Not all users are equal. For example, if some research groups have contributed to the funding of the cluster, they should also get more resources than those who did not.
  • The world is not perfectly fair. Especially on larger clusters, HPC resources have to be requested via project proposals. Those who write more / better proposals can use more resources.


Instructor Note

For the next section, the exact memory requirements depend on the cluster configuration (e.g., the MPI backends used). You might have to adapt these numbers for your local cluster to see the out-of-memory behavior.



Instructor Note

At this point you might want to point out to your audience that for certain applications it can be disastrous for performance to set the memory constraint too tightly. The reason is that the memory limit enforced by Slurm does not only affect the resident set size of all the processes in the job allocation, but also the memory used for caching (e.g., file pages). If the allocation runs out of memory for the cache, it will have to evict memory pages to disk, which can cause I/O operations and new memory allocations to block for longer than usual. If the application makes heavy use of this cache (e.g., repeated read and/or write operations on the same file) and the memory pressure in the allocation is high, you can even run into a cache thrashing situation, where the job spends the majority of its time swapping data in and out of system memory and thus slows down to a crawl.



Instructor Note

This error message was generated with OpenMPI. Other MPI implementations might produce different messages.



Instructor Note

At this point you can present some scheduling strategies specific to your cluster. For the sake of time, you have likely reserved some resources for the course participants such that their jobs start instantly. Now would be a good time to show them the harsh reality of HPC scheduling on a contested partition and demonstrate that a major part of using an HPC cluster is waiting for your jobs to start.



Scheduler Tools


Intention: Introduce more basic performance metrics

Narrative:

  • Okay, so first couple of jobs ran, but were they “quick enough”?
  • How many renders could I generate per minute/hour/day according to the current utilization
  • Our cluster uses certain hardware, maybe we didn’t use it as much as we could have?
  • But I couldn’t see all metrics (may be cluster dependent) (Energy, Disk I/O, Network I/O?)

What we’re doing here:

  • What seff and sacct have to offer
  • Introduce simple relation to hardware, what does RSS, CPU, Disk read/write and their utilization mean?
  • Point out what’s missing from a complete picture

Note:

  • seff is an optional SLURM tool. It does not come standard with every SLURM installation. Therefore, make sure beforehand that this tool is available for the students.


Todo: give clear recommendation of what to aim for?

Maybe 80% of job time?



Todo: potential issue?

Running this on our cluster and adding a module load command resulted in 600MB of memory required. My guess is, this is due to cgroups_v2 and Page caches being counted towards the job as well, so loading the modules might spike the resource requirements as well?

Maybe we should play it safe and use a larger value in the following exercise. But we also want to teach not overdoing it, so it’d be good if we can find a useful but generic compromise here



Instructor Note

Note that the information sacct can provide depends on the information that SLURM stores on a given machine. By default this includes Billing, CPU, Energy, Memory, Node, FS/Disk, Pages and VMem. Additional information is available only when SLURM is configured to collect it. These additional trackable resources are listed in AccountingStorageTRES. For I/O fs/lustre is commonly useful, and for the interconnect communication ic/ofed is required. The setting AccountingStorageTRES is found in slurm.conf. Unfortunately there doesn’t seem to be a way to get sacct to print the optional trackable resources.



Todo: extend the following list and examples to include CPU

To reconstruct the CPU utilization reported by seff: - TotalCPU/CPUTime should give the percentage - Could also mention UserCPU and SystemCPU and discuss the difference? Both result in TotalCPU

Maybe remove AveCPUFreq instead, or do we try to teach something specific about it?

Don’t forget to change the example output of all saccts in the following examples/challenges!



Give more insight in the collected sacct metrics

  • AllocCPUS: number of CPU cores we requested for the job
  • MaxRSS = AveRSS: low fluctuation in memory, data is held throughout the whole job
  • MaxPages & AvePages: number of pages loaded into memory
  • MaxDiskRead: Data read from disk by the application, but also to start the application.


Scaling Study


Intention: Introduce/Recollect concept of Speedup and do a simple scaling study

Narrative:

  • We panic, maybe we need more resources to meet the deadline with our title picture!
  • Requesting resources with bigger systems requires a project proposal with an estimate of the resource demand


Slurm Reservation and specific Hardware?

You may need to reserve a set of resources for the course, such that enough resources for the following exercises are available. This is especially important for --exclusive access.

In that case, show how to use --reservation=reservationname to submit jobs.

It may be a good idea to point out the particular hardware of your cluster / partition to emphasize how many cores are available on a single node and when the scaling study goes beyond a single node.



Todo: show, don’t tell

info dump below in this section

Maybe be more specific about which overheads and how we can see them?



Performance Overview


Intention: Introduce third party tools for performance reports

Narrative:

  • Scaling study, scheduler tools, project proposal is written and handed in
  • Maybe I can squeeze out more from my current system by trying to understand better how it behaves
  • Another colleague told us about performance measurement tools
  • We are learning more about our application
  • Aha, there IS room to optimize! Compile with vectorization

What we’re doing here:

  • Get a complete picture
  • Introduce missing metrics / definitions, and popular representations of data, e.g. Roofline
  • Relate to hardware on the same level of detail


Pick a main tool

We go with three alternatives here, pick one an stick to it throughout your course, but highlight that there are alternatives and learners may not have access to certain tools on any cluster.



ToDo: Connect Hardware to Performance Measurements

Introduce hardware on the same level of detail and with the same terms as the performance reports by ClusterCockpit, LinaroForge, etc., as soon as they appear. Only introduce what we need, to avoid info dump. But point to additional information that gives a complete overview -> hpc-wiki!



ToDo: Clarify relation to hardware in this course

Maybe we should either focus on components (CPUs, memory, disk, accelerators, network cards) or functional entities (compute, data hierarchy, bandwidth, latency, parallel timelines)

We shouldn’t go into too much detail here. Define broad categories where performance can be good or bad. (calculations, data transfers, application logic, research objective (is the calculation meaningful?))

Reuse categories in the same order and fashion throughout the course, i.e. point out in what area a discovered inefficiency occurs.

Introduce detail about hardware later where it is needed, e.g. NUMA for pinning and hints.



Pinning


Intention: Go deeper in performance and hardware relationship

Narrative:

  • We get the feeling, that hardware has a lot to offer, but the rabbit hole is deep!
  • What are the “dimensions” in which we can optimize the throughput of snowman pictures per hour?
  • Can we improve how the work maps to certain CPUs / Memory regions?

What we’re doing here:

  • Introduce pinning and slurm hint options
  • Relate to hardware effects
  • Use third party performance tools to observe effects!


ToDo: Extract episode about pinning

Stick to simple options here. Put more complex options for pinning / hints, etc. into its own episode somewhere later in the course

Pinning is an important part of job optimization, but requires some knowledge, e.g. about the hardware hierarchies in a cluster, NUMA, etc. So it should be done after we’ve introduced different performance reports and their perspective on hardware

Maybe point to JSC pinning simulator and have similar diagrams as an independent “offline” version in this course



Note: Login to the compute job

This is cluster specific. It can possibly be done in two ways: 1. srun --pty --overlap --jobid=<jobid> /bin/bash 2. Check on which node job runs and login to the node via SSH



TODO: Show an animation

  • current behavior with overlapping threads on the same core.
  • Expected behavior when threads are pinned to separate cores.


Note

  • This exercise assumes the following hardware setup:
    • Dual-socket system (2 sockets, 48 cores per socket, 8 NUMA regions, 96 cores total).
    • Each MPI process can use multiple threads (-threads) for parallel execution.
  • The idea is to demonstrate oversubscription by giving more MPI processes than available sockets or NUMA regions, or by over-allocating threads per domain.
  • You are free to adjust -n and -threads based on your cluster.


How to identify a bottleneck?


Intention: Uncover one or two issues in the application

Narrative:

  • Okay, what’s slowest with creating snowman pictures?
  • Where does our system choke?

What we’re doing here:

  • What’s a bottleneck?
  • How can we identify a bottleneck?
  • “Online” and “after the fact” workflows of performance measurements (trace, accumulated results, attached to the process (live), or after it ran)
  • Point to additional resources of common performance/bottleneck issues, e.g. on hpc-wiki

Maybe something like this already occurred before in 4. Scaling Study, or 5. Performance Overview



Performance of Accelerators


Intention: Jump onto accelerator with the example application

Narrative:

  • The deadline is creeping up, only few ways to go!
  • Hey, we have a GPU partition! Maybe this will help us speed up the process!

What we’re doing here:

  • What changes?
  • New metrics
  • Transfer to/from accelerator
  • Different options/requirements to scheduler & performance measurement tools


ToDo

Don’t mention FPGAs too much, maybe just a node what accelerators could be, besides GPU. Goal is to keep it simple and accessible, focus on what’s common in most HPC systems these days



ToDo

Explain how to decide where to run something. CPU vs. small GPU vs. high-end GPUs. Touches on transfer overhead etc.



Next Steps


Intention: Provide a roadmap learners could follow

Most important: enable users to translate from example workload to their own code! Guide on how to translate learning goals and key points to their situation. Additionally, provide some info on where and how to dig deeper, if there is interest (application profiling, etc.)

All ideas in this episode may need to be reworked, since they were made with the outlook in mind, not so much to help learners transfer insight

Narrative:

  • Start with picture of beautiful title slide of the talk with the snowman picture
  • Next time we want to tackle the issue way in advance
  • Approach our raytracing application more systematically, such that we can get the title slide done much quicker
  • What could we do to dive deeper in optimizing the raytracer?
  • Where can we go from here?

What we’re doing here:

  • Learning important programming concepts (parallel programming on many levels)
  • Deeper application profiling & tools to use