Content from Introduction


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • Why should I care about job performance?
  • How is efficiency defined?
  • How do I start measuring?
  • Is my job fast enough?

Objectives

After completing this episode, participants should be able to …

  • Use the time command for a first measurement.
  • Understand the benefits of efficient jobs.
  • Roughly estimate a job energy consumption based on core-h.

Setting the Baseline


Absolute performance is hard to determine:

  • In comparison to current hardware (theoretical limits vs. real usage)
  • Still important, if long way from theoretical limits
  • Always limited by something (one optimization just shifts to the next saturated bottleneck)

During optimization, performance is often expressed in relative terms to a baseline measurement. Define “baseline”. Comparison between before and after a change.

Challenge

Exercise: Baseline Measurement with time

Simple measurement with time of example application. Maybe also with hyperfine ?

Observe system, user, and wall time.

Repeat measurements somewhere 3-10 times to reduce noise

  • Average time
  • Minimum (observed best case)

Maybe make a simple/obvious change to compare change to baseline. How much relative improvement?

Example of how to run it and what the result looks like

Discuss meaning of system, user, wall-time. Relate to efficiencies (minimal wall-time vs. minimal compute-time)

Why Care About Performance?


Reasons from the perspective of learners (see profiles)

  • Faster output, shorter iteration-/turn-around-time
    • More research per time
    • Opportunity costs when “accepting” worse performance
    • Trade of between time spent on optimizing vs. doing actual research
  • Potentially less wasted energy
    • Core-h / device-h directly correlate to wattage
    • Production of hardware and its operation costs energy (even when idle)
    • => Buy as little hardware as possible and use it as much as you can, if you have meaningful computations
  • Applying for HPC resources in a larger center
    • Need estimate for expected resources
    • Jobs need to be sufficiently efficient
    • Is provided hardware a good fit for the applied computational workload?
Challenge

Exercise: Why care about performance?

maybe true-false statements as warmup exercise? E.g. something like

  • Better performance allows for more research
  • Application performance matters less on new computer hardware
  • Computations directory correlate to energy consumption
  • Good performance does not matter on my own hardware

All statements should be connected to the example job & narrative!

  • True, shorter turn around times, more results per time, more nobel prices per second!
  • False, new hardware might make performance issues less pressing, but it is still important (opportunity costs, wasted energy, shared resources)
  • True, device-hours consume energy (variable depending on utilized features, amount of communication, etc.), but there is a direct correlation to W
  • False, performance is especially important on shared systems, but energy and opportunity costs also affect researchers on their own hardware and exclusive allocations.

Core-h and Energy


Define core-h. Device usage for X seconds correlates to estimated power draw. Real power usage depends on:

  • Utilized features of the device (some more power-hungry than others)
  • Amount of data movement through memory, data, network
  • Cooling (rule of thumb factor \(\times 2\))

Looking at energy is one perspective on “efficiency”.

Challenge

Exercise: Core-h and Energy consumption

  • Figure out your current hardware (docu, cpuinfo, websearch, LLM)
  • Calculate core-h for above test (either including or excluding repetitions)
  • Estimate power usage with TDP
  • Keep it simple, back of the envelope calculations

Example for an existing cluster. Stick to CPU TDP, maybe rough number for whole node from somewhere, multiply factor 2 for cooling, mention not-covered network and storage infrastructure

What is Efficient?


Challenge

Challenge: Many perspectives on Efficiency

Write down your current definition or understanding of efficiency with respect to HPC jobs. (Shared document?)

(Exercise as think, pair, share?)

E.g. shortest time from submission to job completion.

Many definitions of efficiency (see below)

Discussion

Discussion: Which definition should we take?

Are these perspectives equally useful? Is one particularly suited to our discussion?

Many definitions of efficiency (to be ordered and discussed):

  1. Minimal wall-/human-time of the job
  2. Minimal compute-time
  3. Minimal time-to-solution (like 1, including queue wait times, potentially multiple jobs for combined results)
  4. Minimal cost in terms of energy/someones money
  5. With regards to opportunity costs. Amount of research per job (including waiting times, computation time, slowdown through larger iteration cycles (turn around times))

Assuming only “useful” computations, no redundancies.

Which definition do refer to by default in the following episodes? (Do we need a default?)

Summary


Discussion

Exercise: Recollecting efficiency

Exercise to raise the question if example workload is efficient or not. Do we know yet? -> No, we can only tell how long it takes, estimate how much time/resources it consumes, and if there is a relative improvement on a change

Leading question: Single baseline measurement doesn’t say much about the application performance, how can I get an understanding of performance? -> Vary a parameter in the next episode and touch on Slurm options

Key Points
  • Absolute vs. relative performance measurements
    • time to establish a baseline
    • Estimating energy consumption
  • Job performance affects you as a user
  • Core-h and very rough energy estimate
  • Different perspectives on efficiency
    • Definitions: wall/human-time, compute-time, time-to-solution, energy (costs / environment), Money, opportunity cost (less research output)
  • Relationship between performance and computer hardware

Content from Resource Requirements


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • How many resources should I request initially?
  • What scheduler options exist to request resources?
  • How do I know if they are used well?
  • How large is my HPC cluster?

Objectives

After completing this episode, participants should be able to …

  • Identify the size of their jobs in relation to the HPC system.
  • Request a good amount of resources from the scheduler.
  • Change the parameters to see how the execution time changes.

Starting Somewhere


Didactic path: I have no idea how many resources to ask for -> just guess and start with some combinations. Next identify slower, or failed (OOM, timelimit) and choose the best What does that say about efficiency?

Discussion

Exercise: Starting Somewhere

  • Run job with a timelimit of 1 minute -> Trigger timelimit. What’s a good timelimit for our task?
  • Run job with few cores, but too much memory/core -> Trigger OOM. What’s a good memory limit for our task?
  • Run job with requesting way too many cores -> Endless waiting or not accepted due to account limits. What’s a good CPU limit for our task?
  • squeue to learn about scheduling issues / reasons

Summarize dimensions in which a job has to be sized correctly (time, cores, memory, gpus, …).

Compared to the HPC System


  • What’s the relationship between your job and existing hardware of the system?
    • What hardware does your HPC system offer?
    • Documentation and Slurm commands
  • Is my job large or small?
    • What’s considered large, medium, small? Maybe as percentage of whole system?
    • Issues of large jobs: long waiting times
    • Issues of many (thousands) small jobs:
  • How many resources are currently free?
  • How long do I have to wait? (looking up scheduler estimate + apply common sense)
Discussion

Exercise: Comparing to the system

  • sinfo to learn about partitions and free resources
  • scontrol to learn about nodes in those partitions
  • lscpu and cat /proc/cpuinfo
  • Submit a job with a reasonable number of resources and use squeue and/or scontrol show job to learn about Slurms estimated start time

Answer questions about number and type of CPUs, HT/SMT, memory/core, timelimits.

Summarize with a well sized job that’s a good start for the example.

Requesting Resources


More detail about what Slurm provides (among others):

  • -t, --time=<time>: Time limit of the job
  • -N, --nodes: Number of nodes
  • -n, --ntasks: Number of tasks/processes
  • -c, --cpus-per-task: Number of CPUs per task/process
  • --threads-per-core=<threads>: Select nodes with at least the number threads per CPU
  • --mem=<size>[units]: Memory, but can also be as --mem-per-cpu, …
  • -G, --gpus: Number of GPUs
  • --exclusive

Maybe discuss:

  • Minimizing/maximizing involved number of nodes
    • Shared nodes: longer waiting times until a whole node is empty
    • Min/max number of nodes min/maximizes communication
  • Different wait times for certain configurations
    • Few tasks on many shared nodes might schedule faster than many tasks on few exclusive nodes.
  • What is a task / process – Difference?
  • Requesting memory, more than mem/core -> idle cores

Changing requirements


  • Motivate why requirements might change (resolution in simulation, more data, more complex model, …)
  • How to change requested resources if application should run differently? (e.g. more processes)
  • Considerations & estimates for
    • changing compute-time (more/less workload)
    • changing memory requirements (smaller/larger model)
    • changing number of processes / nodes
    • changing I/O -> more/less or larger/smaller files
Discussion

Exercise: Changing requirements

  • Walk through how to estimate increase in CPU cores / memory, etc.
  • Run previous job with larger workload
  • Check if and how it behaves differently than the smaller job

Summary


Discussion

Discussion: Recollection

Circle back to efficiency. What’s considered good/efficient in context of job requirements and parameters?

Leading question: time doesn’t give much information, is there an easy way to get more? -> See what Slurm tools can tell about our previous jobs

Key Points
  • Estimate resource requirements and request them in terms the scheduler understands
  • Be aware of your job in relation to the whole system (available hardware, size)
  • Aim for a good match between requested and utilized resources
  • Optimal time-to-solution by minimizing batch queue times and maximizing parallelism

Content from Scheduler Tools


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • What can the scheduler tell about job performance?
  • What’s the meaning of collected metrics?

Objectives

After completing this episode, participants should be able to …

  • Explain basic performance metrics.
  • Use tools provided by the scheduler to collect basic performance metrics of their jobs.

Scheduler Tools


  • sacct
    • MaxRSS, AvgRSS
    • MaxPages, AvgPages
    • AvgCPU, AllocCPUS
    • `ElapsedI
    • MaxDiskRead, AvgDiskRead`,
    • MaxDiskWrite, AvgDiskWrite
    • energy
  • seff
    • Utilization of time allocation
    • Utilization of allocated CPUs (is 100% <=> efficient? Not if calculations are redundant etc.!)
    • Utilization of allocated memory

Shortcomings


  • Not enough info about e.g. I/O, no timeline of metrics during job execution, …
    • I/O may be available, but likely only for local disks
    • => no parallel FS
    • => no network
  • Energy demand may be missing or wrong
    • Depends on available features
    • Doesn’t estimate energy for network switches, cooling, etc.
  • => trying other tools! (motivation for subsequent episodes)

Summary


Leading question: Is there a systematic approach to study a jobs performance at different scales? -> Scaling study

Key Points
  • sacct and seff for first results
  • Small scaling study, maximum of X% overhead is “still good” (larger resource req. vs. speedup)
  • Getting a feel for scale of the HPC system, e.g. “is 64 cores a lot?”, how large is my job in comparison?
  • CPU and Memory Utilization
  • Core-h and relationship to power efficiency

Content from Scaling Study


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • How to decide the amount of resources for a job?
  • How does my application behave at different scales?

Objectives

After completing this episode, participants should be able to …

  • Perform a simple scaling study for a given application.
  • Identify good working points for the job configuration.

What do we look at?


  • Amdahl’s vs. Gustavsons’s law / strong and weak scaling
  • Walltime, Speedup, efficiency
Discussion

Discussion: What dimensions can we look at?

  • CPUs
  • Nodes
  • Workload/problem size
Discussion

Exercise: Factors effecting scaling

  • How serial portion of the code effects the scaling? (May be a numerical would help)
  • If we have a infinte number of workers or processes doing a higy parallel code which is 99% is parallized but 1% is serial execution. The speedup will be 100. What is a ideal limit to the speedup.
  • How the communication effects the scaling?
  • Define example payload
    • Long enough to be significant
    • Short enough to be feasible for a quick study
  • Identify dimension for scaling study, e.g.
    • number of processes (on a single node)
    • number of processes (across nodes)
    • number of nodes involved (network-communication boundary)
    • size of workload
    • Decide on number of processes across node, fixed workload size
  • Choose limits (e.g. 1, 2, 4, … cores), within reasonable size for given Cluster
  • Beyond nodes? Set to one node?

Parameter Scan


  • Take measurements
    • Use time and repeating measurements (something like 3 or 10)
    • Vary scaling parameter
Discussion

Exercise: Run the Example with different -n

  • 1, 2, 4, 8, 16, 32, … cores and same workload
  • Take time measurements (ideally multiple and with --exclusive)

Analyzing results


Discussion

Exercise: Plot the scaling

  • Plot it against time
  • Calculate speedup with respect to baseline with 1 core
  • What’s a good working point? How
  • Overhead
  • Efficiency: not wasting cores if adding them doesn’t do much

Summary


What’s a good working point for our example (at a given workload)?

Leading question: time and scheduler tools still don’t provide a complete picture, what other ways are there? -> Introduce third party tools to get a good performance overview

Key Points
  • Jobs behave differently with varying resources and workloads
  • Scaling study is necessary to proof a certain behavior of the application
  • Good working points defined by sections where more cores still provide sufficient speedup, but no costs due to overhead etc. occurs

Content from Performance Overview


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • Why are tools like seff and sacct not enough?
  • What steps can I take to assess a jobs performance?
  • What popular types of reports exist? (e.g. Roofline)

Objectives

After completing this episode, participants should be able to …

  • Explain different approaches to performance measurements.
  • Understand common terms and concepts in performance analyses.
  • Create a performance report through a third-party tool.
  • Describe what a performance report is meant for (establish baseline, documentation of issues and improvements through optimization, publication of results, finding the next thread to pull in a quest for optimization)
  • Measure the performance of central components of underlying hardware (CPU, Memory, I/O, …) (split episode?)
  • Identify which general areas of computer hardware may affect performance.

Workflow


  • Define sampling and tracing
  • Describe common approaches

Tools


Performance counters and permissions, may require --exclusive, depends on system! Look at documentation / talk to your administrators / support.

cap_perfmon,cap_sys_ptrace,cap_syslog=ep
kernel.perf_event_paranoid

General report


  • General reports show direction in which to continue
    • Specialized tools may be necessary

How Does Performance Relate to Hardware?


(Following this structure throughout the course, trying to understand the performance in these terms)

Broad dimensions of performance:

  • CPU (Front- and Backend, FLOPS)
    • Frontend: decoding instructions, branch prediction, pipeline
    • Backend: getting data from memory, cache hierarchy & alignment
    • Raw calculations
    • Vectorization
    • Out-of-order execution
  • Accelerators (e.g. GPUs)
    • More calculations
    • Offloading
    • Memory & communication models
  • Memory (data hierarchy)
    • Working memory, reading data from/to disk
    • Bandwidth of data
  • I/O (broader data hierarchy: disk, network)
    • Stored data
    • Local disk (caching)
    • Parallel fs (cluster-wide)
    • MPI-Communiction
  • Parallel timeline (synchronization, etc.)
    • Application logic
Hardware
Hardware
Challenge

Exercise: Match application behavior to hardware

Which part of the computer hardware may become an issue for the following application patterns:

  1. Calculating matrix multiplications
  2. Reading data from processes on other computers
  3. Calling many different functions from many equally likely if/else branches
  4. Writing very large files (TB)
  5. Comparing many different strings if they match
  6. Constructing a large simulation model
  7. Reading thousands of small files for each iteration

Maybe not the best questions, also missing something for accelerators.

  1. CPU (FLOPS) and/or Parallel timeline
  2. I/O (network)
  3. CPU (Front-End)
  4. I/O (disk)
  5. (?) CPU-Backend, getting strings through the cache?
  6. Memory (size)
  7. I/O (disk)

Summary


Leading question: Connection to hardware is quite deep, why does it matter? -> Drill deeper, e.g. on NUMA & pinning

Key Points
  • First things first, second things second, …
  • Profiling, tracing
  • Sampling, summation
  • Different HPC centers may provide different approaches to this workflow
  • Performance reports offer more insight into the job and application behavior

Content from Pinning


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • What is “pinning” of job resources?
  • How can pinning improve the performance?
  • How can I see, if pinning resources would help?
  • What requirement hints can I give to the scheduler?

Objectives

After completing this episode, participants should be able to …

  • Define the concept of “pinning” and how it can affect job performance.
  • Name Slurms options for memory- and cpu- binding.
  • Use hints to tell Slurm how to optimize their job allocation.

Binding / pinning:

  • --mem-bind=[{quiet|verbose},]<type>
  • -m, --distribution={*|block|cyclic|arbitrary|plane=<size>}[:{*|block|cyclic|fcyclic}[:{*|block|cyclic|fcyclic}]][,{Pack|NoPack}]
  • --hint=: Hints for CPU- (compute_bound) and memory-bound (memory_bound), but also multithread, nomultithread
  • --cpu-bind=[{quiet|verbose},]<type> (srun)
  • Mapping of application <-> job resources

Why what how?


B

Summary


Leading question: Pinning is very specific, but was it really limiting the performance of out application? How can I identify the biggest issue?

Key Points
  • C

Content from How to identify a bottleneck?


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • How can I find the bottlenecks in a given job?
  • What are common workflows to evaluate performance?
  • What are some common types of bottlenecks?

Objectives

After completing this episode, participants should be able to …

  • Choose between multiple workflows to evaluate job performance.
  • Name typical performance issues.
  • Determine if their job is affected by one of these issues.

How to identify a bottleneck?


Summary


Leading question: We were looking at a standard configuration with CPU, Memory, Disks, Network, so far. What about GPU applications, which are very common these days?

Key Points
  • General advice on the workflow
  • Performance reports may provide an automated summary with recommendations
  • Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
  • Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
  • Interpreting bottlenecks is closely related to what the application is supposed to do.
  • Relative measurements (baseline vs. change)
    • system is quiescent, fixed CPU freq + affinity, warmups, …
    • Reproducibility -> link to git course?
  • Scanning results for smoking guns
  • Any best practices etc.

Content from Performance of Accelerators


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • What are accelerators?
  • How do they affect my jobs performance?
  • How can I measure accelerator utilization?

Objectives

After completing this episode, participants should be able to …

  • Understand difference of performance measurements on accelerators (GPUs, FPGAs) to CPUs.
  • Understand how batch systems and performance measurements tools treat accelerators.

Introduction


Run the same example workload on GPU and compare.

Summary


Leading question: Performance optimization is a deep topic and we are not done learning. How could I continue exploring the topic?

Key Points
  • Tools to measure GPU/FPGA performance of a job
  • Common symptoms of GPU/FPGA problems

Content from Next Steps


Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • What are other patterns of performance bottlenecks?
  • How to evaluate an application in more detail?

Objectives

After completing this episode, participants should be able to …

  • Find collection of performance patterns on hpc-wiki.info
  • Identify next steps to take with regard to performance optimization.

Next Steps


hpc-wiki.info - I/O - CPU Front End - CPU Back End - Memory leak - Oversubscription - Underutilization

Summary


Key Points
  • There are many profilers, some are language-specific, others are vendor-related, …
  • Simple profile with exclusive resources
  • Repeated measurements for reliability