Job Efficiency: Key Points

Introduction

Job performance affects you as a user
Different perspectives on efficiency
- Definitions: wall/human-time, compute-time, time-to-solution, energy (costs / environment), Money, opportunity cost (less research output)
Relationship between performance and computer hardware
Absolute vs. relative performance measurements
- time to establish a baseline
- Estimating energy consumption

Estimate resource requirements and request them in terms the scheduler understands
Be aware of your job in relation to the whole system (available hardware, size)
Aim for a good match between requested and utilized resources
Optimal time-to-solution by minimizing batch queue times and maximizing parallelism

Jobs behave differently with varying resources and workloads
Scaling study is necessary to proof a certain behavior of the application
Good working points defined by sections where more cores still provide sufficient speedup, but no costs due to overhead etc. occurs

sacct and seff for first results
Small scaling study, maximum of X% overhead is “still good” (larger resource req. vs. speedup)
Getting a feel for scale of the HPC system, e.g. “is 64 cores a lot?”, how large is my job in comparison?
CPU and Memory Utilization
Core-h and relationship to power efficiency

General advice on the workflow
Performance reports may provide an automated summary with recommendations
Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
Interpreting bottlenecks is closely related to what the application is supposed to do.
Relative measurements (baseline vs. change)
- system is quiescent, fixed CPU freq + affinity, warmups, …
- Reproducibility -> link to git course?
Scanning results for smoking guns
Any best practices etc.

There are many profilers, some are language-specific, others are vendor-related, …
Simple profile with exclusive resources
Repeated measurements for reliability