Summary and Setup
Outlining the course
- Targeted audience (see learner profiles: New HPC users, Research Software Engineers with users on HPC systems, researchers in HPC.NRW)
- Estimated length and recommended formats (e.g. X full days, X * 2 half days, in-person/online, live-coding)
- Course intentions (focus on learners perspective!):
- Speed up research (efficient computations, more per time, shorter iteration times, “less in the way”)
- Convey intuition about job-sizes. What is considered large, what small?
- Improve batch utilization through matching application requirements to requested hardware (minimal resource requirements, maximum resource utilization)
- Sharpen awareness for importance to avoid wasting time/energy on a shared system
- Teach common concepts and terms of performance
- First steps into performance optimizations (cluster-, node-, and application level)
- Course context for learners:
- Working on HPC Systems (Batch system, shared file systems, software modules, …)
- Performance of scheduled batch jobs
- Application performance is touched (related to job efficiency), but in-depth is outside of the scope. Next steps point towards deeper performance analyses, e.g. with tracers and profilers
Learning Objectives
After attending this training, participants will be able to:
- Explain efficiency in the context of HPC systems
- Use batch system tools and third party tools to measure job efficiency
- Discern between worse and better performing jobs
- Describe common concepts and terms related to performance on HPC systems
- Identify hardware components involved in performance considerations
- Achieve first results in performance optimization of their application
- Recall next steps to take towards learning performance optimization
Prerequisites
- Access to an HPC system
- Example workload setup
- Basic knowledge of HPC systems (batch systems, parallel file systems, modules) – being able to submit a simple job and understand what happens in broad terms
- Knowledge of tools to work with HPC systems:
- Bash shell & scripting
- ssh & scp
- Simple slurm jobscripts and commands like
srun
,sbatch
,squeue
,scancel
- git
HPC Access
You will need access to an HPC cluster to run the examples in this lesson. Discuss how to find out where to apply for access as a researcher (in general, in EU, in Germany, in NRW?). Refer to the HPC Introduction lessons to learn how to access and use a compute cluster of that scale.
- Executive summary of typical HPC workflow? Or refer to other HPCC courses that cover this
- “HPC etiquette”
- E.g. don’t run benchmarks on login node
- Don’t disturb jobs on shared nodes
- Setup of example for performance studies
Common Software on HPC Systems
Working on an HPC system commonly involves a
- batch system to schedule jobs (e.g. Slurm, PBS Pro, HTCondor, …), a
- module system to load certain versions of centrally provided software and a
- way to log in to a login node of the cluster.
To login via ssh
, you can use on (remove this since it’s
discussed in HPC introduction?)
- PuTTY
-
ssh
in PowerShell
-
ssh
in Terminal.app
-
ssh
in Terminal
Example Workload: Snowman Raytracer
Get the code:
BASH
git clone --recursive git@github.com:HellmannM/raytracer-vectorization-example.git
cd raytracer-vectorization-example.git
git checkout CUDA_snowman
CPU Build
Prepare the out-of-source build:
BASH
cd ..
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=OFF ../raytracer-vectorization-example.git
To build the example, you need to provide the following dependencies:
- Compiler, e.g. GCC
- MPI, e.g. OpenMPI
- CMake
- Boost
- libpng
In HPC systems this often happens through loading software modules. How exactly the modules are named and what has to be loaded can very much depend on the specific configuration of your cluster. In this case it looks like this:
BASH
module load 2025 GCC/13.2.0 OpenMPI/4.1.6 buildenv/default Boost/1.83.0 CMake/3.27.6 libpng/1.6.40
Finally build and run the code
BASH
cmake --build . --parallel
mpirun -n 4 ./build/raytracer -width=512 -height=512 -spp=128 -threads=1 -png=snowman.png
This is
- starting the raytracer, with a prepared scene,
- calculating the raytraced picture with \(N
= 4\) MPI processes, each using a single thread
(
-threads=1
), - calculating \(128 / N = 32\)
samples per pixel (
-spp=128
) in each MPI process, - setting
height
andwidth
of the resulting picture to \(512\) pixel, and finally - storing the picture as
snowman.png
.
More on what a raytracer is and how it works. How does it parallelize?
CUDA Build
Prepare the out-of-source build:
BASH
cd ..
mkdir build_gpu && cd build_gpu
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=ON ../raytracer-vectorization-example.git
Additionally to above dependencies, this relies on CUDA and corresponding modules of your site. The application is still run with MPI, but mostly to manage multiple processes, e.g. one per GPU:
Acknowledgements
Course created in context of HPC.NRW.