Performance Overview

Last updated on 2025-11-11 | Edit this page

Overview

After completing this episode, participants should be able to …

Explain different approaches to performance measurements.
Understand common terms and concepts in performance analyses.
Create a performance report through a third-party tool.
Describe what a performance report is meant for (establish baseline, documentation of issues and improvements through optimization, publication of results, finding the next thread to pull in a quest for optimization)
Measure the performance of central components of underlying hardware (CPU, Memory, I/O, …) (split episode?)
Identify which general areas of computer hardware may affect performance.

Previously checked scaling behavior by looking at walltime
what if we would count other things while our job is running? Could be
- CPU utilization
- FLOPS
- Memory uitilization
- …
Two possible ways to look at this data with respect to time:
1. tracing: over time
2. sampling: accumulated results at the end
Third-party tools to measure these things - you can use them with your jobs

Callout

Here you can choose between three alternative perspectives on our job:

ClusterCockpit: A job monitoring service available on many of our clusters. Needs to be centrally maintained by your HPC administration team.
Linaro Forge Performance Reports: A commercial application providing a single page performance overview of your job. Your cluster may have licenses available.
TBD: A free, open source tool/set of tools, to get a general performance overview of your job.

Performance counters and permissions, may require --exclusive, depends on system! Look at documentation / talk to your administrators / support.

cap_perfmon,cap_sys_ptrace,cap_syslog=ep
kernel.perf_event_paranoid

Live coding:

Setup: webpage & login. An conditions on when it is enabled in your particular cluster?
If always enabled: figure out jobid of previous 8-core job from Episode 4

N/A

N/A

(Following this structure throughout the course, trying to understand the performance in these terms)

Broad dimensions of performance:

CPU (Front- and Backend, FLOPS)
- Frontend: decoding instructions, branch prediction, pipeline
- Backend: getting data from memory, cache hierarchy & alignment
- Raw calculations
- Vectorization
- Out-of-order execution
Accelerators (e.g. GPUs)
- More calculations
- Offloading
- Memory & communication models
Memory (data hierarchy)
- Working memory, reading data from/to disk
- Bandwidth of data
I/O (broader data hierarchy: disk, network)
- Stored data
- Local disk (caching)
- Parallel fs (cluster-wide)
- MPI-Communiction
Parallel timeline (synchronization, etc.)
- Application logic

Challenge

Which part of the computer hardware may become an issue for the following application patterns:

Maybe not the best questions, also missing something for accelerators.

General reports show direction in which to continue
- Specialized tools may be necessary to move on

Leading question: Connection to hardware is quite deep, why does it matter? -> Drill deeper, e.g. on NUMA & pinning

Key Points