Content from Introduction


Last updated on 2025-07-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • Why should I care about my jobs performance?
  • How is efficiency defined?
  • How do I start measuring?

Objectives

After completing this episode, participants should be able to …

  • Use the time command for a first measurement.
  • Understand the benefits of efficient jobs.
  • Roughly estimate a job energy consumption based on core-h.
  • Identify which general areas of computer hardware may affect performance.
  • Needs a specific example job.
  • Gradual improvement would be great
  • Start with baseline measurement at the very beginning?
    • time can raise questions of efficiency, “what’s good?”, etc.

Setting the Baseline


Absolute performance is hard to determine:

  • In comparison to current hardware (theoretical limits vs. real usage)
  • Still important, if long way from theoretical limits
  • Always limited by something (one optimization just shifts to the next saturated bottleneck)

During optimization, performance is often expressed in relative terms to a baseline measurement. Define “baseline”. Comparison between before and after a change.

Exercise: Baseline Measurement with time

Simple measurement with time of example application. Maybe also with hyperfine ?

Observe system, user, and wall time.

Repeat measurements somewhere 3-10 times to reduce noise

  • Average time
  • Minimum (observed best case)

Maybe make a simple/obvious change to compare change to baseline. How much relative improvement?

Example of how to run it and what the result looks like

Discuss meaning of system, user, wall-time. Relate to efficiencies (minimal wall-time vs. minimal compute-time)

Why Care About Performance?


Reasons from the perspective of learners (see profiles)

  • Faster output, shorter iteration-/turn-around-time
    • More research per time
    • Opportunity costs when “accepting” worse performance
    • Trade of between time spent on optimizing vs. doing actual research
  • Potentially less wasted energy
    • Core-h / device-h directly correlate to wattage
    • Production of hardware and its operation costs energy (even when idle)
    • => Buy as little hardware as possible and use it as much as you can, if you have meaningful computations
  • Applying for HPC resources in a larger center
    • Need estimate for expected resources
    • Jobs need to be sufficiently efficient
    • Is provided hardware a good fit for the applied computational workload?

Exercise: Why care about performance?

maybe true-false statements as warmup exercise? E.g. something like

  • Better performance allows for more research
  • Application performance matters less on new computer hardware
  • Computations directory correlate to energy consumption
  • Good performance does not matter on my own hardware

All statements should be connected to the example job & narrative!

  • True, shorter turn around times, more results per time, more nobel prices per second!
  • False, new hardware might make performance issues less pressing, but it is still important (opportunity costs, wasted energy, shared resources)
  • True, device-hours consume energy (variable depending on utilized features, amount of communication, etc.), but there is a direct correlation to W
  • False, performance is especially important on shared systems, but energy and opportunity costs also affect researchers on their own hardware and exclusive allocations.

Core-h and Energy


Define core-h. Device usage for X seconds correlates to estimated power draw. Real power usage depends on:

  • Utilized features of the device (some more power-hungry than others)
  • Amount of data movement through memory, data, network
  • Cooling (rule of thumb factor \(\times 2\))

Looking at energy is one perspective on “efficiency”.

Exercise: Core-h and Energy consumption

  • Figure out your current hardware (docu, cpuinfo, websearch, LLM)
  • Calculate core-h for above test (either including or excluding repetitions)
  • Estimate power usage with TDP
  • Keep it simple, back of the envelope calculations

Example for an existing cluster. Stick to CPU TDP, maybe rough number for whole node from somewhere, multiply factor 2 for cooling, mention not-covered network and storage infrastructure

What is Efficient?


Challenge: Many perspectives on Efficiency

Write down your current definition or understanding of efficiency with respect to HPC jobs. (Shared document?)

(Exercise as think, pair, share?)

E.g. shortest time from submission to job completion.

Many definitions of efficiency (see below)

Discussion: Which definition should we take?

Are these perspectives equally useful? Is one particularly suited to our discussion?

Many definitions of efficiency (to be ordered and discussed):

  1. Minimal wall-/human-time of the job
  2. Minimal compute-time
  3. Minimal time-to-solution (like 1, including queue wait times, potentially multiple jobs for combined results)
  4. Minimal cost in terms of energy/someones money
  5. With regards to opportunity costs. Amount of research per job (including waiting times, computation time, slowdown through larger iteration cycles (turn around times))

Assuming only “useful” computations, no redundancies.

Which definition do refer to by default in the following episodes? (Do we need a default?)

How Does Performance Relate to Hardware?


(Following this structure throughout the course, trying to understand the performance in these terms)

Broad dimensions of performance:

  • CPU (Front- and Backend, FLOPS)
    • Frontend: decoding instructions, branch prediction, pipeline
    • Backend: getting data from memory, cache hierarchy & alignment
    • Raw calculations
    • Vectorization
    • Out-of-order execution
  • Accelerators (e.g. GPUs)
    • More calculations
    • Offloading
    • Memory & communication models
  • Memory (data hierarchy)
    • Working memory, reading data from/to disk
    • Bandwidth of data
  • I/O (broader data hierarchy: disk, network)
    • Stored data
    • Local disk (caching)
    • Parallel fs (cluster-wide)
    • MPI-Communiction
  • Parallel timeline (synchronization, etc.)
    • Application logic

Maybe we should either focus on components (CPUs, memory, disk, accelerators, network cards) or functional entities (compute, data hierarchy, bandwidth, latency, parallel timelines)

We shouldn’t go into too much detail here. Define broad categories where performance can be good or bad. (calculations, data transfers, application logic, research objective (is the calculation meaningful?))

Reuse categories in the same order and fashion throughout the course, i.e. point out in what area a discovered inefficiency occurs.

Introduce detail about hardware later where it is needed, e.g. NUMA for pinning and hints.

Hardware
Hardware

Exercise: Match application behavior to hardware

Which part of the computer hardware may become an issue for the following application patterns:

  1. Calculating matrix multiplications
  2. Reading data from processes on other computers
  3. Calling many different functions from many equally likely if/else branches
  4. Writing very large files (TB)
  5. Comparing many different strings if they match
  6. Constructing a large simulation model
  7. Reading thousands of small files for each iteration

Maybe not the best questions, also missing something for accelerators.

  1. CPU (FLOPS) and/or Parallel timeline
  2. I/O (network)
  3. CPU (Front-End)
  4. I/O (disk)
  5. (?) CPU-Backend, getting strings through the cache?
  6. Memory (size)
  7. I/O (disk)

Summary


Exercise: Recollecting efficiency

Exercise to raise the question if example workload is efficient or not. Do we know yet? -> No, we can only tell how long it takes, estimate how much time/resources it consumes, and if there is a relative improvement on a change

Key Points

  • Absolute vs. relative performance measurements
    • time to establish a baseline
    • Estimating energy consumption
  • Job performance affects you as a user
  • Core-h and very rough energy estimate
  • Different perspectives on efficiency
    • Definitions: wall/human-time, compute-time, time-to-solution, energy (costs / environment), Money, opportunity cost (less research output)
  • Relationship between performance and computer hardware

Content from Resource Requirements


Last updated on 2025-07-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How many resources should it request initially?
  • What options does the scheduler give to request resources?
  • How do I know if they are used well?
  • How large is my HPC cluster?

Objectives

After completing this episode, participants should be able to …

  • Identify the size of their jobs in relation to the HPC system.
  • Request the right amount of resources from the scheduler.
  • Change the parameters if the applications’ resource requirements change.

Starting Somewhere


Didactic path: I have no idea how many resources to ask for -> just guess and start with some combinations. Next identify slower, or failed (OOM, timelimit) and choose the best What does that say about efficiency?

Exercise: Starting Somewhere

  • Run job with a timelimit of 1 minute -> Trigger timelimit. What’s a good timelimit for our task?
  • Run job with few cores, but too much memory/core -> Trigger OOM. What’s a good memory limit for our task?
  • Run job with requesting way too many cores -> Endless waiting or not accepted due to account limits. What’s a good CPU limit for our task?
  • squeue to learn about scheduling issues / reasons

Summarize dimensions in which a job has to be sized correctly (time, cores, memory, gpus, …).

Compared to the HPC System


  • What’s the relationship between your job and existing hardware of the system?
    • What hardware does your HPC system offer?
    • Documentation and Slurm commands
  • Is my job large or small?
    • What’s considered large, medium, small? Maybe as percentage of whole system?
    • Issues of large jobs: long waiting times
    • Issues of many (thousands) small jobs:
  • How many resources are currently free?
  • How long do I have to wait? (looking up scheduler estimate + apply common sense)

Exercise: Comparing to the system

  • sinfo to learn about partitions and free resources
  • scontrol to learn about nodes in those partitions
  • lscpu and cat /proc/cpuinfo
  • Submit a job with a reasonable number of resources and use squeue and/or scontrol show job to learn about Slurms estimated start time

Answer questions about number and type of CPUs, HT/SMT, memory/core, timelimits.

Summarize with a well sized job that’s a good start for the example.

Requesting Resources


More detail about what Slurm provides (among others):

  • -t, --time=<time>: Time limit of the job
  • -N, --nodes: Number of nodes
  • -n, --ntasks: Number of tasks/processes
  • -c, --cpus-per-task: Number of CPUs per task/process
  • --threads-per-core=<threads>: Select nodes with at least the number threads per CPU
  • --mem=<size>[units]: Memory, but can also be as --mem-per-cpu, …
  • -G, --gpus: Number of GPUs
  • --exclusive

Stick to simple options here. Put more complex options for pinning / hints, etc. into its own episode somewhere later in the course

Pinning is an important part of job optimization, but requires some knowledge, e.g. about the hardware hierarchies in a cluster, NUMA, etc.

Binding / pinning:

  • --mem-bind=[{quiet|verbose},]<type>
  • -m, --distribution={*|block|cyclic|arbitrary|plane=<size>}[:{*|block|cyclic|fcyclic}[:{*|block|cyclic|fcyclic}]][,{Pack|NoPack}]
  • --hint=: Hints for CPU- (compute_bound) and memory-bound (memory_bound), but also multithread, nomultithread
  • --cpu-bind=[{quiet|verbose},]<type> (srun)
  • Mapping of application <-> job resources

Maybe discuss:

  • Minimizing/maximizing involved number of nodes
    • Shared nodes: longer waiting times until a whole node is empty
    • Min/max number of nodes min/maximizes communication
  • Different wait times for certain configurations
    • Few tasks on many shared nodes might schedule faster than many tasks on few exclusive nodes.
  • What is a task / process – Difference?
  • Requesting memory, more than mem/core -> idle cores

This section is just an info dump, how do we make it useful and approachable? What’s a useful exercise? Maybe put info here in other sections?

Changing requirements


  • Motivate why requirements might change (resolution in simulation, more data, more complex model, …)
  • How to change requested resources if application should run differently? (e.g. more processes)
  • Considerations & estimates for
    • changing compute-time (more/less workload)
    • changing memory requirements (smaller/larger model)
    • changing number of processes / nodes
    • changing I/O -> more/less or larger/smaller files

Exercise: Changing requirements

  • Walk through how to estimate increase in CPU cores / memory, etc.
  • Run previous job with larger workload
  • Check if and how it behaves differently than the smaller job

Summary


Discussion: Recollection

Circle back to efficiency. What’s considered good/efficient in context of job requirements and parameters?

Key Points

  • Estimate resource requirements and request them in terms the scheduler understands
  • Be aware of your job in relation to the whole system (available hardware, size)
  • Aim for a good match between requested and utilized resources
  • Optimal time-to-solution by minimizing batch queue times and maximizing parallelism

Content from Scaling Study


Last updated on 2025-07-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How can I decide the amount of resources I should request for my job?
  • How do I know how my application behaves at different scales?

Objectives

After completing this episode, participants should be able to …

  • Perform a simple scaling study for a given application.
  • Identify good working points for the job configuration.

What do we look at?


  • Amdahl’s vs. Gustavsons’s law / strong and weak scaling
  • Walltime, Speedup, efficiency

Discussion: What dimensions can we look at?

  • CPUs
  • Nodes
  • Workload/problem size
  • Define example payload
    • Long enough to be significant
    • Short enough to be feasible for a quick study
  • Identify dimension for scaling study, e.g.
    • number of processes (on a single node)
    • number of processes (across nodes)
    • number of nodes involved (network-communication boundary)
    • size of workload
    • Decide on number of processes across node, fixed workload size
  • Choose limits (e.g. 1, 2, 4, … cores), within reasonable size for given Cluster
  • Beyond nodes? Set to one node?

Parameter Scan


  • Take measurements
    • Use time and repeating measurements (something like 3 or 10)
    • Vary scaling parameter

Exercise: Run the Example with different -n

  • 1, 2, 4, 8, 16, 32, … cores and same workload
  • Take time measurements (ideally multiple and with --exclusive)

Analyzing results


Exercise: Plot the scaling

  • Plot it against time
  • Calculate speedup with respect to baseline with 1 core
  • What’s a good working point? How
  • Overhead
  • Efficiency: not wasting cores if adding them doesn’t do much

Summary


What’s a good working point for our example (at a given workload)?

Note on compute time application that need estimate of required compute resources and touch on scaling behavior here? Could be important for one type of learner, if this is given in a context like HPC.NRW. Optional for many others, but maybe interesting.

Key Points

  • Jobs behave differently with varying resources and workloads
  • Scaling study is necessary to proof a certain behavior of the application
  • Good working points defined by sections where more cores still provide sufficient speedup, but no costs due to overhead etc. occurs

Content from Scheduler Tools


Last updated on 2025-07-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • What information can the scheduler provide about my jobs performance?
  • What’s the meaning of the collected metrics?

Objectives

After completing this episode, participants should be able to …

  • Explain basic performance metrics.
  • Use tools provided by the scheduler to collect basic performance metrics of their jobs.

Scheduler Tools


  • sacct
    • MaxRSS, AvgRSS
    • MaxPages, AvgPages
    • AvgCPU, AllocCPUS
    • `ElapsedI
    • MaxDiskRead, AvgDiskRead`,
    • MaxDiskWrite, AvgDiskWrite
    • energy
  • seff
    • Utilization of time allocation
    • Utilization of allocated CPUs (is 100% <=> efficient? Not if calculations are redundant etc.!)
    • Utilization of allocated memory

Shortcomings


  • Not enough info about e.g. I/O, no timeline of metrics during job execution, …
    • I/O may be available, but likely only for local disks
    • => no parallel FS
    • => no network
  • Energy demand may be missing or wrong
    • Depends on available features
    • Doesn’t estimate energy for network switches, cooling, etc.
  • => trying other tools! (motivation for subsequent episodes)

Can / should we cover I/O and energy metrics at this point?

E.g. use something like beegfs-ctl to get a rough estimate of parallel FS performance. Use pidstat etc. to get numbers on node-local I/O (and much more)

Summary


Key Points

  • sacct and seff for first results
  • Small scaling study, maximum of X% overhead is “still good” (larger resource req. vs. speedup)
  • Getting a feel for scale of the HPC system, e.g. “is 64 cores a lot?”, how large is my job in comparison?
  • CPU and Memory Utilization
  • Core-h and relationship to power efficiency

Content from Workflow of Performance Measurements


Last updated on 2025-07-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • Why are simple tools like seff and sacct not enough?
  • What steps can I take to assess a jobs performance?
  • What popular types of reports exist? (e.g. Roofline)

Objectives

After completing this episode, participants should be able to …

  • Explain different approaches to performance measurements.
  • Understand common terms and concepts in performance analyses.
  • Create a performance report through a third-party tool.
  • Describe what a performance report is meant for (establish baseline, documentation of issues and improvements through optimization, publication of results, finding the next thread to pull in a quest for optimization)
  • Measure the performance of central components of underlying hardware (CPU, Memory, I/O, …) (split episode?)

Workflow


  • Define sampling and tracing
  • Describe common approaches

Tools


Performance counters and permissions, may require --exclusive, depends on system! Look at documentation / talk to your administrators / support.

cap_perfmon,cap_sys_ptrace,cap_syslog=ep
kernel.perf_event_paranoid

General report


  • General reports show direction in which to continue
    • Specialized tools may be necessary

Key Points

  • First things first, second things second, …
  • Profiling, tracing
  • Sampling, summation
  • Different HPC centers may provide different approaches to this workflow
  • Performance reports offer more insight into the job and application behavior

Content from How to identify a bottleneck?


Last updated on 2025-06-26 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How can I find the bottlenecks in a job at hand?

Objectives

After completing this episode, participants should be able to …

  • Name typical performance issues.
  • Determine if their job is affected by one of these issues.

How to identify a bottleneck?


Key Points

  • General advice on the workflow
  • Performance reports may provide an automated summary with recommendations
  • Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
  • Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
  • Interpreting bottlenecks is closely related to what the application is supposed to do.
  • Relative measurements (baseline vs. change)
    • system is quiescent, fixed CPU freq + affinity, warmups, …
    • Reproducibility -> link to git course?
  • Scanning results for smoking guns
  • Any best practices etc.

Content from Special Aspects of Accelerators


Last updated on 2025-07-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • What are accelerators?
  • How do they affect my jobs performance?

Objectives

After completing this episode, participants should be able to …

  • Understand difference of performance measurements on accelerators (GPUs, FPGAs) to CPUs.
  • Understand how batch systems and performance measurements tools treat accelerators.

Introduction


Run the same example workload on GPU and compare.

Don’t mention FPGAs too much, maybe just a node what accelerators could be, besides GPU. Goal is to keep it simple and accessible, focus on what’s common in most HPC systems these days

Explain how to decide where to run something. CPU vs. small GPU vs. high-end GPUs. Touches on transfer overhead etc.

Key Points

  • Tools to measure GPU/FPGA performance of a job
  • Common symptoms of GPU/FPGA problems

Content from Next Steps


Last updated on 2025-06-26 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • Are there common patterns of “pathological” performance?
  • How can I evaluate the performance of my application in greater detail?

Objectives

After completing this episode, participants should be able to …

  • Find collection of performance patterns on hpc-wiki.info
  • Identify next steps to take with regard to performance optimization.

Next Steps


hpc-wiki.info - I/O - CPU Front End - CPU Back End - Memory leak - Oversubscription - Underutilization

Key Points

  • There are many profilers, some are language-specific, others are vendor-related, …
  • Simple profile with exclusive resources
  • Repeated measurements for reliability