Job Efficiency: Key Points

Introduction

Using a stopwatch liketimegives you a first tool to log actual versus expected runtimes; it is also useful for carrying out runtime comparisons.
Which hardware piece (CPU, memory/RAM, disk, network, etc.) poses a limiting factor, depends on the nature of a particular application.
Large-scale computing is power hungry, so we want to use the energy wisely. As shown in the next episodes, you have more power than it may be expected over controlling job efficiency and thus overall energy footprint.
Computing job efficiency goes beyond individual gain in runtime as shared resources are used more effectively, that is, the ratio \(\frac{useful\;work}{total\;energy\;expended}\sim\frac{number\;of\;users}{total\;energy\;expended}\) improves.

Your cluster might seem to have an enormous amout of computing resources, but these resources are a shared good. You should only use as much as you need.
Resource requests are a promise to the scheduler to not use more than a specific amount of resources. If you break your promise to the scheduler and try to use more resources, terrible things will happen.
- Overstepping memory or time allocations will result in your job being terminated.
- Oversubscribing CPU cores will at best do nothing and at worst diminish performance.
Finding the minimal resource requirements takes a bit of trial and error. Slurm collects a lot of useful metrics to aid you in this.

Schedulers provide tools for a high level view on our jobs, e.g. sacct and seff
Important basic performance metrics we can gather this way are:
- CPU Utilization, often as fraction of time where CPU was active/elapsed time of the job
- Memory utilization, often measured as Resident Set Size (RSS) and number of Pages
sacct can also provide metrics about disk I/O and energy consumption
Metrics through sacct are accumulated for the whole job runtime and may be too broad for more specific insight

Jobs behave differently with increasing parallel resources and fixed or scaling workloads
Scaling studies can help to quantitatively grasp this changing behavior
Good working points are defined by configurations where more cores still provide sufficient speedup or improve quality through increasing workloads
Amdahl’s law: speedup is limited by the serial fraction of a program
Gustafson’s law: more resources for parallel processing still help, if larger workloads can meaningfully contribute to project results

Always check how pinning works
Use verbose reporting (e.g., --report-bindings) to see how MPI processes and threads are mapped to cores and sockets.
Documentation is your friend
For OpenMPI with mpirun, consult the manual: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php
Know your hardware
Understanding the number of sockets, cores per socket, and NUMA regions on your cluster helps you make effective binding decisions.
Avoid oversubscription
Assigning more threads or processes than available cores hurts performance — it causes contention and idle waits.
Recommended practice for OpenMPI
Use --bind-to core along with --map-by to control placement and threads per process to maximize throughput.

General advice on the workflow
Performance reports may provide an automated summary with recommendations
Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
Interpreting bottlenecks is closely related to what the application is supposed to do.
Relative measurements (baseline vs. change)
- system is quiescent, fixed CPU freq + affinity, warmups, …
- Reproducibility -> link to git course?
Scanning results for smoking guns
Any best practices etc.

There are many profilers, some are language-specific, others are vendor-related, …
Simple profile with exclusive resources
Repeated measurements for reliability