Summary and Setup

Outlining the course

Targeted audience (see learner profiles: New HPC users, Research Software Engineers with users on HPC systems, researchers in HPC.NRW)
Estimated length and recommended formats (e.g. X full days, X * 2 half days, in-person/online, live-coding)
Course intentions (focus on learners perspective!):
- Speed up research (efficient computations, more per time, shorter iteration times, “less in the way”)
- Convey intuition about job-sizes. What is considered large, what small?
- Improve batch utilization through matching application requirements to requested hardware (minimal resource requirements, maximum resource utilization)
- Sharpen awareness for importance to avoid wasting time/energy on a shared system
- Teach common concepts and terms of performance
- First steps into performance optimizations (cluster-, node-, and application level)
Course context for learners:
- Working on HPC Systems (Batch system, shared file systems, software modules, …)
- Performance of scheduled batch jobs
- Application performance is touched (related to job efficiency), but in-depth is outside of the scope. Next steps point towards deeper performance analyses, e.g. with tracers and profilers

Learning Objectives

After attending this training, participants will be able to:

Explain efficiency in the context of HPC systems
Use batch system tools and third party tools to measure job efficiency
Discern between worse and better performing jobs
Describe common concepts and terms related to performance on HPC systems
Identify hardware components involved in performance considerations
Achieve first results in performance optimization of their application
Recall next steps to take towards learning performance optimization

Prerequisites

Prerequisite

Access to an HPC system
Example workload setup
Basic knowledge of HPC systems (batch systems, parallel file systems, modules) – being able to submit a simple job and understand what happens in broad terms
Knowledge of tools to work with HPC systems:
- Bash shell & scripting
- ssh & scp
- Simple slurm jobscripts and commands like srun, sbatch, squeue, scancel
- git

HPC Access

You will need access to an HPC cluster to run the examples in this lesson. Discuss how to find out where to apply for access as a researcher (in general, in EU, in Germany, in NRW?). Refer to the HPC Introduction lessons to learn how to access and use a compute cluster of that scale.

Executive summary of typical HPC workflow? Or refer to other HPCC courses that cover this
“HPC etiquette”
- E.g. don’t run benchmarks on login node
- Don’t disturb jobs on shared nodes
Setup of example for performance studies

Discussion

Common Software on HPC Systems

Working on an HPC system commonly involves a

batch system to schedule jobs (e.g. Slurm, PBS Pro, HTCondor, …), a
module system to load certain versions of centrally provided software and a
way to log in to a login node of the cluster.

To login via ssh, you can use on (remove this since it’s discussed in HPC introduction?)

Windows

PuTTY
ssh in PowerShell

MacOS

ssh in Terminal.app

Linux

ssh in Terminal

Example Workload: Snowman Raytracer

Get the code:

BASH

git clone --recursive git@github.com:HellmannM/raytracer-vectorization-example.git
cd raytracer-vectorization-example.git
git checkout CUDA_snowman

CPU Build

Prepare the out-of-source build:

BASH

cd ..
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=OFF ../raytracer-vectorization-example.git

To build the example, you need to provide the following dependencies:

Compiler, e.g. GCC
MPI, e.g. OpenMPI
CMake
Boost
libpng

In HPC systems this often happens through loading software modules. How exactly the modules are named and what has to be loaded can very much depend on the specific configuration of your cluster. In this case it looks like this:

BASH

module load 2025 GCC/13.2.0 OpenMPI/4.1.6 buildenv/default Boost/1.83.0 CMake/3.27.6 libpng/1.6.40

Finally build and run the code

BASH

cmake --build . --parallel
mpirun -n 4 ./build/raytracer -width=512 -height=512 -spp=128 -threads=1 -png=snowman.png

This is

starting the raytracer, with a prepared scene,
calculating the raytraced picture with \(N = 4\) MPI processes, each using a single thread (-threads=1),
calculating \(128 / N = 32\) samples per pixel (-spp=128) in each MPI process,
setting height and width of the resulting picture to \(512\) pixel, and finally
storing the picture as snowman.png.

More on what a raytracer is and how it works. How does it parallelize?

CUDA Build

Prepare the out-of-source build:

BASH

cd ..
mkdir build_gpu && cd build_gpu
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=ON ../raytracer-vectorization-example.git

Additionally to above dependencies, this relies on CUDA and corresponding modules of your site. The application is still run with MPI, but mostly to manage multiple processes, e.g. one per GPU:

BASH

cmake --build . --parallel
export CUDA_VISIBLE_DEVICES=0,1,2,3
mpirun -n 4 ./build/raytracer -width=512 -height=512 -spp=128 -threads=1 -png=snowman.png

Acknowledgements

Course created in context of HPC.NRW.