Content from Introduction
Last updated on 2025-07-01 | Edit this page
Overview
Questions
- Why should I care about my jobs performance?
- How is efficiency defined?
- How do I start measuring?
Objectives
After completing this episode, participants should be able to …
- Understand the benefits of efficient jobs.
- Identify which areas of computer hardware may affect performance.
- Use the
time
command for a first measurement.
Why Care About Performance?
Reasons from the perspective of learners (see profiles)
- Faster output, shorter iteration-/turn-around-time
- More research per time
- Opportunity costs when “accepting” worse performance
- Trade of between time spent on optimizing vs. doing actual research
- Potentially less wasted energy
- Core-h / device-h directly correlate to wattage
- Production of hardware and its operation costs energy (even when idle)
- => Buy as little hardware as possible and use it as much as you can, if you have meaningful computations
- Applying for HPC resources in a larger center
- Need estimate for expected resources
- Jobs need to be sufficiently efficient
- Is provided hardware a good fit for the applied computational workload?
Exercise: Why care about performance?
maybe true-false statements as warmup exercise? E.g. something like
- Better performance allows for more research
- Application performance matters less on new computer hardware
- Computations directory correlate to energy consumption
- Good performance does not matter on my own hardware
- True, shorter turn around times, more results per time, more nobel prices per second!
- False, new hardware might make performance issues less pressing, but it is still important (opportunity costs, wasted energy, shared resources)
- True, device-hours consume energy (variable depending on utilized features, amount of communication, etc.), but there is a direct correlation to W
- False, performance is especially important on shared systems, but energy and opportunity costs also affect researchers on their own hardware and exclusive allocations.
Discussion: How important is performance?
- Did you change your opinion about the importance of good performance?
- How much time do you want to/can you spend on assessing your jobs performance?
What is Efficient?
Challenge: Many perspectives on Efficiency
Write down your current definition or understanding of efficiency with respect to HPC jobs. (Shared document?)
(Exercise as think, pair, share?)
E.g. shortest time from submission to job completion.
Many definitions of efficiency (see below)
Discussion: Which definition should we take?
Are these perspectives equally useful? Is one particularly suited to our discussion?
Many definitions of efficiency (to be ordered and discussed):
- Minimal wall-/human-time of the job
- Minimal compute-time
- Minimal time-to-solution (like 1, including queue wait times, potentially multiple jobs for combined results)
- Minimal cost in terms of energy/someones money
- With regards to opportunity costs. Amount of research per job (including waiting times, computation time, slowdown through larger iteration cycles (turn around times))
Assuming only “useful” computations, no redundancies.
Which definition do refer to by default in the following episodes? (Do we need a default?)
How Does Performance Relate to Hardware?
(Following this structure throughout the course, trying to understand the performance in these terms)
Broad dimensions of performance:
- CPU (Front- and Backend, FLOPS)
- Frontend: decoding instructions, branch prediction, pipeline
- Backend: getting data from memory, cache hierarchy & alignment
- Raw calculations
- Vectorization
- Out-of-order execution
- Accelerators (e.g. GPUs)
- More calculations
- Offloading
- Memory & communication models
- Memory (data hierarchy)
- Working memory, reading data from/to disk
- Bandwidth of data
- I/O (broader data hierarchy: disk, network)
- Stored data
- Local disk (caching)
- Parallel fs (cluster-wide)
- MPI-Communiction
- Parallel timeline (synchronization, etc.)
- Application logic
Exercise: Match application behavior to hardware
Which part of the computer hardware may become an issue for the following application patterns:
- Calculating matrix multiplications
- Reading data from processes on other computers
- Calling many different functions from many equally likely if/else branches
- Writing very large files (TB)
- Comparing many different strings if they match
- Constructing a large simulation model
- Reading thousands of small files for each iteration
Maybe not the best questions, also missing something for accelerators.
- CPU (FLOPS) and/or Parallel timeline
- I/O (network)
- CPU (Front-End)
- I/O (disk)
- (?) CPU-Backend, getting strings through the cache?
- Memory (size)
- I/O (disk)
Setting the Baseline
Absolute performance is hard to determine:
- In comparison to current hardware (theoretical limits vs. real usage)
- Still important, if long way from theoretical limits
- Always limited by something
During optimization, performance is often expressed in relative terms to a baseline measurement. Define “baseline”. Comparison between before and after a change.
Exercise: Baseline Measurement with
time
Simple measurement with time
of example application.
Maybe also with hyperfine ?
Observe system, user, and wall time.
Repeat measurements somewhere 3-10 times to reduce noise
- Average time
- Minimum (observed best case)
Maybe make a simple/obvious change to compare change to baseline. How much relative improvement?
Discuss meaning of system, user, wall-time. Relate to efficiencies (minimal wall-time vs. minimal compute-time)
Define core-h. Device usage for X seconds correlates to estimated power draw. Real power usage depends on:
- Utilized features of the device (some more power-hungry than others)
- Amount of data movement through memory, data, network
- Cooling (rule of thumb factor \(\times 2\))
Exercise: Core-h and Energy consumption
- Figure out your current hardware (docu, cpuinfo, websearch, LLM)
- Calculate core-h for above test (either including or excluding repetitions)
- Estimate power usage with TDP
Summary
Exercise: Recollecting efficiency
Exercise to raise the question if example workload is efficient or not. Do we know yet? -> No, we can only tell how long it takes, estimate how much time/resources it consumes, and if there is a relative improvement on a change
Key Points
- Job performance affects you as a user
- Different perspectives on efficiency
- Definitions: wall/human-time, compute-time, time-to-solution, energy (costs / environment), Money, opportunity cost (less research output)
- Relationship between performance and computer hardware
- Absolute vs. relative performance measurements
-
time
to establish a baseline - Estimating energy consumption
-
Content from Resource Requirements
Last updated on 2025-07-02 | Edit this page
Overview
Questions
- How many resources should it request initially?
- What options does the scheduler give to request resources?
- How do I know if they are used well?
- How large is my HPC cluster?
Objectives
After completing this episode, participants should be able to …
- Identify the size of their jobs in relation to the HPC system.
- Request the right amount of resources from the scheduler.
- Change the parameters if the applications’ resource requirements change.
Starting Somewhere
Didactic path: I have no idea how many resources to ask for -> just guess and start with some combinations. Next identify slower, or failed (OOM, timelimit) and choose the best What does that say about efficiency?
Exercise: Starting Somewhere
- Run job with a timelimit of 1 minute -> Trigger timelimit. What’s a good timelimit for our task?
- Run job with few cores, but too much memory/core -> Trigger OOM. What’s a good memory limit for our task?
- Run job with requesting way too many cores -> Endless waiting or not accepted due to account limits. What’s a good CPU limit for our task?
-
squeue
to learn about scheduling issues / reasons
Summarize dimensions in which a job has to be sized correctly (time, cores, memory, gpus, …).
Compared to the HPC System
- What’s the relationship between your job and existing hardware of
the system?
- What hardware does your HPC system offer?
- Documentation and Slurm commands
- Is my job large or small?
- What’s considered large, medium, small? Maybe as percentage of whole system?
- Issues of large jobs: long waiting times
- Issues of many (thousands) small jobs:
- How many resources are currently free?
- How long do I have to wait? (looking up scheduler estimate + apply common sense)
Exercise: Comparing to the system
-
sinfo
to learn about partitions and free resources -
scontrol
to learn about nodes in those partitions -
lscpu
andcat /proc/cpuinfo
- Submit a job with a reasonable number of resources and use
squeue
and/orscontrol show job
to learn about Slurms estimated start time
Answer questions about number and type of CPUs, HT/SMT, memory/core, timelimits.
Summarize with a well sized job that’s a good start for the example.
Requesting Resources
More detail about what Slurm provides (among others):
-
-t, --time=<time>
: Time limit of the job -
-N, --nodes
: Number of nodes -
-n, --ntasks
: Number of tasks/processes -
-c, --cpus-per-task
: Number of CPUs per task/process -
--threads-per-core=<threads>
: Select nodes with at least the number threads per CPU -
--mem=<size>[units]
: Memory, but can also be as--mem-per-cpu
, … -
-G, --gpus
: Number of GPUs --exclusive
Binding:
--mem-bind=[{quiet|verbose},]<type>
-m, --distribution={*|block|cyclic|arbitrary|plane=<size>}[:{*|block|cyclic|fcyclic}[:{*|block|cyclic|fcyclic}]][,{Pack|NoPack}]
-
--hint=
: Hints for CPU- (compute_bound
) and memory-bound (memory_bound
), but alsomultithread
,nomultithread
-
--cpu-bind=[{quiet|verbose},]<type>
(srun
) - Mapping of application <-> job resources
Maybe discuss:
- Minimizing/maximizing involved number of nodes
- Shared nodes: longer waiting times until a whole node is empty
- Min/max number of nodes min/maximizes communication
- Different wait times for certain configurations
- Few tasks on many shared nodes might schedule faster than many tasks on few exclusive nodes.
- What is a task / process – Difference?
- Requesting memory, more than mem/core -> idle cores
Changing requirements
- Motivate why requirements might change (resolution in simulation, more data, more complex model, …)
- How to change requested resources if application should run differently? (e.g. more processes)
- Considerations & estimates for
- changing compute-time (more/less workload)
- changing memory requirements (smaller/larger model)
- changing number of processes / nodes
- changing I/O -> more/less or larger/smaller files
Exercise: Changing requirements
- Walk through how to estimate increase in CPU cores / memory, etc.
- Run previous job with larger workload
- Check if and how it behaves differently than the smaller job
Summary
Discussion: Recollection
Circle back to efficiency. What’s considered good/efficient in context of job requirements and parameters?
Key Points
- Estimate resource requirements and request them in terms the scheduler understands
- Be aware of your job in relation to the whole system (available hardware, size)
- Aim for a good match between requested and utilized resources
- Optimal time-to-solution by minimizing batch queue times and maximizing parallelism
Content from Scaling Study
Last updated on 2025-07-02 | Edit this page
Overview
Questions
- How can I decide the amount of resources I should request for my job?
- How do I know how my application behaves at different scales?
Objectives
After completing this episode, participants should be able to …
- Perform a simple scaling study for a given application.
- Identify good working points for the job configuration.
What do we look at?
- Amdahl’s vs. Gustavsons’s law / strong and weak scaling
- Walltime, Speedup, efficiency
Discussion: What dimensions can we look at?
- CPUs
- Nodes
- Workload/problem size
- Define example payload
- Long enough to be significant
- Short enough to be feasible for a quick study
- Identify dimension for scaling study, e.g.
- number of processes (on a single node)
- number of processes (across nodes)
- number of nodes involved (network-communication boundary)
- size of workload
- Decide on number of processes across node, fixed workload size
- Choose limits (e.g. 1, 2, 4, … cores), within reasonable size for given Cluster
- Beyond nodes? Set to one node?
Parameter Scan
- Take measurements
- Use
time
and repeating measurements (something like 3 or 10) - Vary scaling parameter
- Use
Exercise: Run the Example with different -n
- 1, 2, 4, 8, 16, 32, … cores and same workload
- Take
time
measurements (ideally multiple and with--exclusive
)
Analyzing results
Exercise: Plot the scaling
- Plot it against
time
- Calculate speedup with respect to baseline with 1 core
- What’s a good working point? How
- Overhead
- Efficiency: not wasting cores if adding them doesn’t do much
Summary
What’s a good working point for our example (at a given workload)?
Key Points
- Jobs behave differently with varying resources and workloads
- Scaling study is necessary to proof a certain behavior of the application
- Good working points defined by sections where more cores still provide sufficient speedup, but no costs due to overhead etc. occurs
Content from Scheduler Tools
Last updated on 2025-07-02 | Edit this page
Overview
Questions
- What information can the scheduler provide about my jobs performance?
- What’s the meaning of the collected metrics?
Objectives
After completing this episode, participants should be able to …
- Explain basic performance metrics.
- Use tools provided by the scheduler to collect basic performance metrics of their jobs.
Scheduler Tools
-
sacct
-
MaxRSS
,AvgRSS
-
MaxPages
,AvgPages
-
AvgCPU
,AllocCPUS
- `ElapsedI
-
MaxDiskRead
, AvgDiskRead`, -
MaxDiskWrite
,AvgDiskWrite
energy
-
-
seff
- Utilization of time allocation
- Utilization of allocated CPUs (is 100% <=> efficient? Not if calculations are redundant etc.!)
- Utilization of allocated memory
Shortcomings
- Not enough info about e.g. I/O, no timeline of metrics during job
execution, …
- I/O may be available, but likely only for local disks
- => no parallel FS
- => no network
- Energy demand may be missing or wrong
- Depends on available features
- Doesn’t estimate energy for network switches, cooling, etc.
- => trying other tools! (motivation for subsequent episodes)
Summary
Key Points
-
sacct
andseff
for first results - Small scaling study, maximum of X% overhead is “still good” (larger resource req. vs. speedup)
- Getting a feel for scale of the HPC system, e.g. “is 64 cores a lot?”, how large is my job in comparison?
- CPU and Memory Utilization
- Core-h and relationship to power efficiency
Content from Workflow of Performance Measurements
Last updated on 2025-06-26 | Edit this page
Overview
Questions
- Why are simple tools like
seff
andsacct
not enough? - What steps can I take to assess a jobs performance?
- What popular types of reports exist? (e.g. Roofline)
Objectives
After completing this episode, participants should be able to …
- Explain different approaches to performance measurements.
- Understand common terms and concepts in performance analyses.
- Create a performance report through a third-party tool.
- Describe what a performance report is meant for (establish baseline, documentation of issues and improvements through optimization, publication of results, finding the next thread to pull in a quest for optimization)
- Measure the performance of central components of underlying hardware (CPU, Memory, I/O, …) (split episode?)
Workflow
- Define sampling and tracing
- Describe common approaches
Tools
General report
- General reports show direction in which to continue
- Specialized tools may be necessary
Key Points
- First things first, second things second, …
- Profiling, tracing
- Sampling, summation
- Different HPC centers may provide different approaches to this workflow
- Performance reports offer more insight into the job and application behavior
Content from How to identify a bottleneck?
Last updated on 2025-06-26 | Edit this page
Overview
Questions
- How can I find the bottlenecks in a job at hand?
Objectives
After completing this episode, participants should be able to …
- Name typical performance issues.
- Determine if their job is affected by one of these issues.
How to identify a bottleneck?
Key Points
- General advice on the workflow
- Performance reports may provide an automated summary with recommendations
- Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
- Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
- Interpreting bottlenecks is closely related to what the application is supposed to do.
- Relative measurements (baseline vs. change)
- system is quiescent, fixed CPU freq + affinity, warmups, …
- Reproducibility -> link to git course?
- Scanning results for smoking guns
- Any best practices etc.
Content from Special Aspects of Accelerators
Last updated on 2025-06-26 | Edit this page
Overview
Questions
- What are accelerators?
- How do they affect my jobs performance?
Objectives
After completing this episode, participants should be able to …
- Understand difference of performance measurements on accelerators (GPUs, FPGAs) to CPUs.
- Understand how batch systems and performance measurements tools treat accelerators.
Introduction
Run the same example workload on GPU and compare.
Key Points
- Tools to measure GPU/FPGA performance of a job
- Common symptoms of GPU/FPGA problems
Content from Next Steps
Last updated on 2025-06-26 | Edit this page
Overview
Questions
- Are there common patterns of “pathological” performance?
- How can I evaluate the performance of my application in greater detail?
Objectives
After completing this episode, participants should be able to …
- Find collection of performance patterns on hpc-wiki.info
- Identify next steps to take with regard to performance optimization.
Next Steps
hpc-wiki.info - I/O - CPU Front End - CPU Back End - Memory leak - Oversubscription - Underutilization
Key Points
- There are many profilers, some are language-specific, others are vendor-related, …
- Simple profile with exclusive resources
- Repeated measurements for reliability