Introduction


  • Using a stopwatch liketimegives you a first tool to log actual versus expected runtimes; it is also useful for carrying out runtime comparisons.
  • Which hardware piece (CPU, memory/RAM, disk, network, etc.) poses a limiting factor, depends on the nature of a particular application.
  • Large-scale computing is power hungry, so we want to use the energy wisely. As shown in the next episodes, you have more power than it may be expected over controlling job efficiency and thus overall energy footprint.
  • Computing job efficiency goes beyond individual gain in runtime as shared resources are used more effectively, that is, the ratio \(\frac{useful\;work}{total\;energy\;expended}\sim\frac{number\;of\;users}{total\;energy\;expended}\) improves.

Resource Requirements


  • Your cluster might seem to have an enormous amout of computing resources, but these resources are a shared good. You should only use as much as you need.
  • Resource requests are a promise to the scheduler to not use more than a specific amount of resources. If you break your promise to the scheduler and try to use more resources, terrible things will happen.
    • Overstepping memory or time allocations will result in your job being terminated.
    • Oversubscribing CPU cores will at best do nothing and at worst diminish performance.
  • Finding the minimal resource requirements takes a bit of trial and error. Slurm collects a lot of useful metrics to aid you in this.

Scheduler Tools


  • Schedulers provide tools for a high level view on our jobs, e.g. sacct and seff
  • Important basic performance metrics we can gather this way are:
    • CPU Utilization, often as fraction of time where CPU was active/elapsed time of the job
    • Memory utilization, often measured as Resident Set Size (RSS) and number of Pages
  • sacct can also provide metrics about disk I/O and energy consumption
  • Metrics through sacct are accumulated for the whole job runtime and may be too broad for more specific insight

Scaling Study


  • Jobs behave differently with increasing parallel resources and fixed or scaling workloads
  • Scaling studies can help to quantitatively grasp this changing behavior
  • Good working points are defined by configurations where more cores still provide sufficient speedup or improve quality through increasing workloads
  • Amdahl’s law: speedup is limited by the serial fraction of a program
  • Gustafson’s law: more resources for parallel processing still help, if larger workloads can meaningfully contribute to project results

Performance Overview


  • First things first, second things second, …
  • Profiling, tracing
  • Sampling, summation
  • Different HPC centers may provide different approaches to this workflow
  • Performance reports offer more insight into the job and application behavior

Pinning


  • Always check how pinning works
    Use verbose reporting (e.g., --report-bindings) to see how MPI processes and threads are mapped to cores and sockets.

  • Documentation is your friend
    For OpenMPI with mpirun, consult the manual: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php

  • Know your hardware
    Understanding the number of sockets, cores per socket, and NUMA regions on your cluster helps you make effective binding decisions.

  • Avoid oversubscription
    Assigning more threads or processes than available cores hurts performance — it causes contention and idle waits.

  • Recommended practice for OpenMPI
    Use --bind-to core along with --map-by to control placement and threads per process to maximize throughput.

How to identify a bottleneck?


  • General advice on the workflow
  • Performance reports may provide an automated summary with recommendations
  • Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
  • Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
  • Interpreting bottlenecks is closely related to what the application is supposed to do.
  • Relative measurements (baseline vs. change)
    • system is quiescent, fixed CPU freq + affinity, warmups, …
    • Reproducibility -> link to git course?
  • Scanning results for smoking guns
  • Any best practices etc.

Performance of Accelerators


  • Tools to measure GPU/FPGA performance of a job
  • Common symptoms of GPU/FPGA problems

Next Steps


  • There are many profilers, some are language-specific, others are vendor-related, …
  • Simple profile with exclusive resources
  • Repeated measurements for reliability