Summary and Schedule

Outlining the course

Targeted audience (see learner profiles: New HPC users, Research Software Engineers with users on HPC systems, researchers in HPC.NRW)
Estimated length and recommended formats (e.g. X full days, X * 2 half days, in-person/online, live-coding)
Course intentions (focus on learners perspective!):
- Speed up research (efficient computations, more per time, shorter iteration times, “less in the way”)
- Convey intuition about job-sizes. What is considered large, what small?
- Improve batch utilization through matching application requirements to requested hardware (minimal resource requirements, maximum resource utilization)
- Sharpen awareness for importance to avoid wasting time/energy on a shared system
- Teach common concepts and terms of performance
- First steps into performance optimizations (cluster-, node-, and application level)
Course context for learners:
- Working on HPC Systems (Batch system, shared file systems, software modules, …)
- Performance of scheduled batch jobs
- Application performance is touched (related to job efficiency), but in-depth is outside of the scope. Next steps point towards deeper performance analyses, e.g. with tracers and profilers

Setup Instructions Download files required for the lesson

Duration: 00h 00m 1. Introduction Why should I care about job performance?
How is efficiency defined?
How do I start measuring?
Is my job fast enough?

Duration: 00h 10m 2. Resource Requirements How many resources should I request initially?
What scheduler options exist to request resources?
How do I know if they are used well?
How large is my HPC cluster?

Duration: 00h 20m 3. Scheduler Tools What can the scheduler tell about job performance?
What’s the meaning of collected metrics?

Duration: 00h 30m 4. Scaling Study How to decide the amount of resources for a job?
How does my application behave at different scales?

Duration: 00h 40m 5. Performance Overview Why are tools like seff and sacct not enough?
What steps can I take to assess a jobs performance?
What popular types of reports exist? (e.g. Roofline)

Duration: 00h 50m 6. Pinning What is “pinning” of job resources?
How can pinning improve the performance?
How can I see, if pinning resources would help?
What requirement hints can I give to the scheduler?

Duration: 01h 00m 7. How to identify a bottleneck? How can I find the bottlenecks in a given job?
What are common workflows to evaluate performance?
What are some common types of bottlenecks?

Duration: 01h 10m 8. Performance of Accelerators What are accelerators?
How do they affect my jobs performance?
How can I measure accelerator utilization?

Duration: 01h 20m 9. Next Steps What are other patterns of performance bottlenecks?
How to evaluate an application in more detail?

Duration: 01h 30m Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Learning Objectives

After attending this training, participants will be able to:

Explain efficiency in the context of HPC systems
Use batch system tools and third party tools to measure job efficiency
Discern between worse and better performing jobs
Describe common concepts and terms related to performance on HPC systems
Identify hardware components involved in performance considerations
Achieve first results in performance optimization of their application
Recall next steps to take towards learning performance optimization

Prerequisites

Prerequisite

Access to an HPC system
Example workload setup
Basic knowledge of HPC systems (batch systems, parallel file systems, modules) – being able to submit a simple job and understand what happens in broad terms
Knowledge of tools to work with HPC systems:
- Bash shell & scripting
- ssh & scp
- Simple slurm jobscripts and commands like srun, sbatch, squeue, scancel
- git

ToDo: Improve prerequisites

Link to external resources in prerequisites:

HPC Intro
HPC Shell
HPC.NRW
Amount of knowledge about MPI, OpenMPI, CUDA, etc.?
- Don’t require in-depth MPI knowledge, but some basic understanding might be necessary?

Maybe make sure required definitions / concepts are available in the hpc-wiki and link to those? But this course should be somewhat self-contained. “Jargon buster” similar to HPC intro?

Maybe add some form of self test, e.g. like PC2 HPC and Linux self test? Or as an exercise in the setup / prerequisites sections?

Selftest should help to answer “Is the course for me?”, i.e. prerequisites should be mostly green, course material should be mostly red

HPC Access

Tell learners how to get access to an HPC System

Do they need to apply somewhere?
Are they eligible to request access to another system?
Are they expected to already have an account?
Could they try to log in in advance?
Is there maybe some test cluster in the cloud?

You will need access to an HPC cluster to run the examples in this lesson. Discuss how to find out where to apply for access as a researcher (in general, in EU, in Germany, in NRW?). Refer to the HPC Introduction lessons to learn how to access and use a compute cluster of that scale.

Executive summary of typical HPC workflow? Or refer to other HPCC courses that cover this
“HPC etiquette”
- E.g. don’t run benchmarks on login node
- Don’t disturb jobs on shared nodes
Setup of example for performance studies

Discussion

Common Software on HPC Systems

Working on an HPC system commonly involves a

batch system to schedule jobs (e.g. Slurm, PBS Pro, HTCondor, …), a
module system to load certain versions of centrally provided software and a
way to log in to a login node of the cluster.

To login via ssh, you can use on (remove this since it’s discussed in HPC introduction?)

Windows

PuTTY
ssh in PowerShell

MacOS

ssh in Terminal.app

Linux

ssh in Terminal

Example Workload: Snowman Raytracer

Episodes are tied together with a narrative around the example job

Needs a specific example job.
Gradual improvement throughout the course
Introduce only topics that are directly observed/experienced with the example
Point to additional information/overview in hpc-wiki where useful
Maybe close every episode with the same metric? (snowman pictures / hour at a given energy?)
- Could start with “?” when we didn’t learn yet how to do it in the first episodes
- Motivates the discovery of certain metrics, tools, etc.

Get the code:

BASH

git clone --recursive git@github.com:HellmannM/raytracer-vectorization-example.git
cd raytracer-vectorization-example.git
git checkout CUDA_snowman

CPU Build

Prepare the out-of-source build:

BASH

cd ..
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=OFF ../raytracer-vectorization-example.git

To build the example, you need to provide the following dependencies:

Compiler, e.g. GCC
MPI, e.g. OpenMPI
CMake
Boost
libpng

In HPC systems this often happens through loading software modules. How exactly the modules are named and what has to be loaded can very much depend on the specific configuration of your cluster. In this case it looks like this:

BASH

module load 2025 GCC/13.2.0 OpenMPI/4.1.6 buildenv/default Boost/1.83.0 CMake/3.27.6 libpng/1.6.40

Finally build and run the code

BASH

cmake --build . --parallel
mpirun -n 4 ./build/raytracer -width=512 -height=512 -spp=128 -threads=1 -png=snowman.png

This is

starting the raytracer, with a prepared scene,
calculating the raytraced picture with \(N = 4\) MPI processes, each using a single thread (-threads=1),
calculating \(128 / N = 32\) samples per pixel (-spp=128) in each MPI process,
setting height and width of the resulting picture to \(512\) pixel, and finally
storing the picture as snowman.png.

More on what a raytracer is and how it works. How does it parallelize?

CUDA Build

Prepare the out-of-source build:

BASH

cd ..
mkdir build_gpu && cd build_gpu
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=ON ../raytracer-vectorization-example.git

Additionally to above dependencies, this relies on CUDA and corresponding modules of your site. The application is still run with MPI, but mostly to manage multiple processes, e.g. one per GPU:

BASH

cmake --build . --parallel
export CUDA_VISIBLE_DEVICES=0,1,2,3
mpirun -n 4 ./build/raytracer -width=512 -height=512 -spp=128 -threads=1 -png=snowman.png

Acknowledgements

Course created in context of HPC.NRW.