Summary and Schedule
Outlining the course
- Targeted audience (see learner profiles: New HPC users, Research Software Engineers with users on HPC systems, researchers in HPC.NRW)
- Estimated length and recommended formats (e.g. X full days, X * 2 half days, in-person/online, live-coding)
- Course intentions (focus on learners perspective!):
- Speed up research (efficient computations, more per time, shorter iteration times, “less in the way”)
- Convey intuition about job-sizes. What is considered large, what small?
- Improve batch utilization through matching application requirements to requested hardware (minimal resource requirements, maximum resource utilization)
- Sharpen awareness for importance to avoid wasting time/energy on a shared system
- Teach common concepts and terms of performance
- First steps into performance optimizations (cluster-, node-, and application level)
- Course context for learners:
- Working on HPC Systems (Batch system, shared file systems, software modules, …)
- Performance of scheduled batch jobs
- Application performance is touched (related to job efficiency), but in-depth is outside of the scope. Next steps point towards deeper performance analyses, e.g. with tracers and profilers
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction |
Why should I care about job performance? How is efficiency defined? How do I start measuring? Is my job fast enough? |
Duration: 00h 10m | 2. Resource Requirements |
How many resources should I request initially? What scheduler options exist to request resources? How do I know if they are used well? How large is my HPC cluster? |
Duration: 00h 20m | 3. Scheduler Tools |
What can the scheduler tell about job performance? What’s the meaning of collected metrics? |
Duration: 00h 30m | 4. Scaling Study |
How to decide the amount of resources for a job? How does my application behave at different scales? |
Duration: 00h 40m | 5. Performance Overview |
Why are tools like seff and sacct not
enough?What steps can I take to assess a jobs performance? What popular types of reports exist? (e.g. Roofline) |
Duration: 00h 50m | 6. Pinning |
What is “pinning” of job resources? How can pinning improve the performance? How can I see, if pinning resources would help? What requirement hints can I give to the scheduler? |
Duration: 01h 00m | 7. How to identify a bottleneck? |
How can I find the bottlenecks in a given job? What are common workflows to evaluate performance? What are some common types of bottlenecks? |
Duration: 01h 10m | 8. Performance of Accelerators |
What are accelerators? How do they affect my jobs performance? How can I measure accelerator utilization? |
Duration: 01h 20m | 9. Next Steps |
What are other patterns of performance bottlenecks? How to evaluate an application in more detail? |
Duration: 01h 30m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Learning Objectives
After attending this training, participants will be able to:
- Explain efficiency in the context of HPC systems
- Use batch system tools and third party tools to measure job efficiency
- Discern between worse and better performing jobs
- Describe common concepts and terms related to performance on HPC systems
- Identify hardware components involved in performance considerations
- Achieve first results in performance optimization of their application
- Recall next steps to take towards learning performance optimization
Prerequisites
- Access to an HPC system
- Example workload setup
- Basic knowledge of HPC systems (batch systems, parallel file systems, modules) – being able to submit a simple job and understand what happens in broad terms
- Knowledge of tools to work with HPC systems:
- Bash shell & scripting
- ssh & scp
- Simple slurm jobscripts and commands like
srun
,sbatch
,squeue
,scancel
- git
Link to external resources in prerequisites:
- HPC Intro
- HPC Shell
- HPC.NRW
- Amount of knowledge about MPI, OpenMPI, CUDA, etc.?
- Don’t require in-depth MPI knowledge, but some basic understanding might be necessary?
Maybe make sure required definitions / concepts are available in the hpc-wiki and link to those? But this course should be somewhat self-contained. “Jargon buster” similar to HPC intro?
Maybe add some form of self test, e.g. like PC2 HPC and Linux self test? Or as an exercise in the setup / prerequisites sections?
Selftest should help to answer “Is the course for me?”, i.e. prerequisites should be mostly green, course material should be mostly red
HPC Access
- Do they need to apply somewhere?
- Are they eligible to request access to another system?
- Are they expected to already have an account?
- Could they try to log in in advance?
- Is there maybe some test cluster in the cloud?
You will need access to an HPC cluster to run the examples in this lesson. Discuss how to find out where to apply for access as a researcher (in general, in EU, in Germany, in NRW?). Refer to the HPC Introduction lessons to learn how to access and use a compute cluster of that scale.
- Executive summary of typical HPC workflow? Or refer to other HPCC courses that cover this
- “HPC etiquette”
- E.g. don’t run benchmarks on login node
- Don’t disturb jobs on shared nodes
- Setup of example for performance studies
Common Software on HPC Systems
Working on an HPC system commonly involves a
- batch system to schedule jobs (e.g. Slurm, PBS Pro, HTCondor, …), a
- module system to load certain versions of centrally provided software and a
- way to log in to a login node of the cluster.
To login via ssh
, you can use on (remove this since it’s
discussed in HPC introduction?)
- PuTTY
-
ssh
in PowerShell
-
ssh
in Terminal.app
-
ssh
in Terminal
Example Workload: Snowman Raytracer
- Needs a specific example job.
- Gradual improvement throughout the course
- Introduce only topics that are directly observed/experienced with the example
- Point to additional information/overview in hpc-wiki where useful
- Maybe close every episode with the same metric? (snowman pictures /
hour at a given energy?)
- Could start with “?” when we didn’t learn yet how to do it in the first episodes
- Motivates the discovery of certain metrics, tools, etc.
Get the code:
BASH
git clone --recursive git@github.com:HellmannM/raytracer-vectorization-example.git
cd raytracer-vectorization-example.git
git checkout CUDA_snowman
CPU Build
Prepare the out-of-source build:
BASH
cd ..
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=OFF ../raytracer-vectorization-example.git
To build the example, you need to provide the following dependencies:
- Compiler, e.g. GCC
- MPI, e.g. OpenMPI
- CMake
- Boost
- libpng
In HPC systems this often happens through loading software modules. How exactly the modules are named and what has to be loaded can very much depend on the specific configuration of your cluster. In this case it looks like this:
BASH
module load 2025 GCC/13.2.0 OpenMPI/4.1.6 buildenv/default Boost/1.83.0 CMake/3.27.6 libpng/1.6.40
Finally build and run the code
BASH
cmake --build . --parallel
mpirun -n 4 ./build/raytracer -width=512 -height=512 -spp=128 -threads=1 -png=snowman.png
This is
- starting the raytracer, with a prepared scene,
- calculating the raytraced picture with \(N
= 4\) MPI processes, each using a single thread
(
-threads=1
), - calculating \(128 / N = 32\)
samples per pixel (
-spp=128
) in each MPI process, - setting
height
andwidth
of the resulting picture to \(512\) pixel, and finally - storing the picture as
snowman.png
.
More on what a raytracer is and how it works. How does it parallelize?
CUDA Build
Prepare the out-of-source build:
BASH
cd ..
mkdir build_gpu && cd build_gpu
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=ON ../raytracer-vectorization-example.git
Additionally to above dependencies, this relies on CUDA and corresponding modules of your site. The application is still run with MPI, but mostly to manage multiple processes, e.g. one per GPU:
Acknowledgements
Course created in context of HPC.NRW.