Introduction
- Using a stopwatch like
timegives you a first tool to log actual versus expected runtimes; it is also useful for carrying out runtime comparisons. - Which hardware piece (CPU, memory/RAM, disk, network, etc.) poses a limiting factor, depends on the nature of a particular application.
- Large-scale computing is power hungry, so we want to use the energy wisely. As shown in the next episodes, you have more power than it may be expected over controlling job efficiency and thus overall energy footprint.
- Computing job efficiency goes beyond individual gain in runtime as shared resources are used more effectively, that is, the ratio \(\frac{useful\;work}{total\;energy\;expended}\sim\frac{number\;of\;users}{total\;energy\;expended}\) improves.
Resource Requirements
- Your cluster might seem to have an enormous amout of computing resources, but these resources are a shared good. You should only use as much as you need.
- Resource requests are a promise to the scheduler to not use more
than a specific amount of resources. If you break your promise to the
scheduler and try to use more resources, terrible things will happen.
- Overstepping memory or time allocations will result in your job being terminated.
- Oversubscribing CPU cores will at best do nothing and at worst diminish performance.
- Finding the minimal resource requirements takes a bit of trial and error. Slurm collects a lot of useful metrics to aid you in this.
Scheduler Tools
- Schedulers provide tools for a high level view on our jobs,
e.g.
sacctandseff - Important basic performance metrics we can gather this way are:
-
CPU Utilization, often as fraction of
time where CPU was active/elapsed time of the job - Memory utilization, often measured as Resident Set Size (RSS) and number of Pages
-
CPU Utilization, often as fraction of
-
sacctcan also provide metrics about disk I/O and energy consumption - Metrics through
sacctare accumulated for the whole job runtime and may be too broad for more specific insight
Scaling Study
- Jobs behave differently with increasing parallel resources and fixed or scaling workloads
- Scaling studies can help to quantitatively grasp this changing behavior
- Good working points are defined by configurations where more cores still provide sufficient speedup or improve quality through increasing workloads
- Amdahl’s law: speedup is limited by the serial fraction of a program
- Gustafson’s law: more resources for parallel processing still help, if larger workloads can meaningfully contribute to project results
Performance Overview
- First things first, second things second, …
- Profiling, tracing
- Sampling, summation
- Different HPC centers may provide different approaches to this workflow
- Performance reports offer more insight into the job and application behavior
Pinning
Always check how pinning works
Use verbose reporting (e.g.,--report-bindings) to see how MPI processes and threads are mapped to cores and sockets.Documentation is your friend
For OpenMPI withmpirun, consult the manual: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.phpKnow your hardware
Understanding the number of sockets, cores per socket, and NUMA regions on your cluster helps you make effective binding decisions.Avoid oversubscription
Assigning more threads or processes than available cores hurts performance — it causes contention and idle waits.Recommended practice for OpenMPI
Use--bind-to corealong with--map-byto control placement and threads per process to maximize throughput.
How to identify a bottleneck?
- General advice on the workflow
- Performance reports may provide an automated summary with recommendations
- Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
- Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
- Interpreting bottlenecks is closely related to what the application is supposed to do.
- Relative measurements (baseline vs. change)
- system is quiescent, fixed CPU freq + affinity, warmups, …
- Reproducibility -> link to git course?
- Scanning results for smoking guns
- Any best practices etc.
Performance of Accelerators
- Tools to measure GPU/FPGA performance of a job
- Common symptoms of GPU/FPGA problems
Next Steps
- There are many profilers, some are language-specific, others are vendor-related, …
- Simple profile with exclusive resources
- Repeated measurements for reliability