Summary and Schedule
Outlining the course
- Targeted audience (see learner profiles: New HPC users, RSE with users on HPC systems, researchers in HPC.NRW)
- Estimated length and recommended formats (e.g. X full days, X * 2 half days, in-person/online, live-coding)
- Course intentions (focus on learners perspective!):
- Speed up research
- Improve batch utilization through matching application requirements to requested hardware (minimal resource requirements, maximum resource utilization)
- Convey intuition about job-sizes. What is considered large, what small?
- Sharpen awareness for importance to avoid wasting time/energy on a shared system
- Teach common concepts and terms of performance
- First steps into performance optimizations (cluster-, node-, application level)
- Well defined context:
- HPC Systems
- Performance of jobs
- Application performance is touched (related to job efficiency), but in-depth is outside of the scope. Next steps point towards deeper performance analyses
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction |
Why should I care about my jobs performance? How is efficiency defined? How do I start measuring? |
Duration: 00h 10m | 2. Resource Requirements |
How many resources should it request initially? What options does the scheduler give to request resources? How do I know if they are used well? How large is my HPC cluster? |
Duration: 00h 20m | 3. Scaling Study |
How can I decide the amount of resources I should request for my
job? How do I know how my application behaves at different scales? |
Duration: 00h 30m | 4. Scheduler Tools |
What information can the scheduler provide about my jobs
performance? What’s the meaning of the collected metrics? |
Duration: 00h 40m | 5. Workflow of Performance Measurements |
Why are simple tools like seff and sacct not
enough?What steps can I take to assess a jobs performance? What popular types of reports exist? (e.g. Roofline) |
Duration: 00h 50m | 6. How to identify a bottleneck? | How can I find the bottlenecks in a job at hand? |
Duration: 01h 00m | 7. Special Aspects of Accelerators |
What are accelerators? How do they affect my jobs performance? |
Duration: 01h 10m | 8. Next Steps |
Are there common patterns of “pathological” performance? How can I evaluate the performance of my application in greater detail? |
Duration: 01h 20m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Learning Objectives
After attending this training, participants will be able to:
- Explain efficiency in the context of HPC systems
- Use batch system tools and third party tools to measure job efficiency
- Discern between worse and better performing jobs
- Describe common concepts and terms related to performance on HPC systems
- Identify hardware components involved in performance considerations
- Achieve first results in performance optimization of their application
- Remember next steps to take towards learning performance optimization
Prerequisite
- Access to a HPC system
- Example workload setup
- Basic knowledge of HPC systems (batch systems, parallel file systems) – being able to submit a simple job and understand what happens in broad terms
- Knowledge of tools to work with HPC systems:
- Bash shell & scripting
- ssh & scp
Example Workload & Setup
Example workload that:
- Has some instructive performance issues that can be discovered, e.g.
- Mismatch between requested resources in job script and used resources
- Memory leak or unnecessary allocation with a quick fix? Either triggers OOM or just wasting resources, dependent on side and default memory/core
- No vectorization?
- Parallelism issues?
- Software that can run on CPU and GPU, to discuss both with the example
You will need access to an HPC cluster to run the examples in this lesson. Discuss how to find out where to apply for access as a researcher (in general, in EU, in Germany, in NRW?). Refer to the HPC Introduction lessons to learn how to access and use a compute cluster of that scale.
- Executive summary of typical HPC workflow? Or refer to other HPCC courses that cover this
- “HPC etiquette”
- E.g. don’t run benchmarks on login node
- Don’t disturb jobs on shared nodes
- Setup of example for performance studies
Common Software on HPC Systems
Working on an HPC system commonly involves a
- batch system to schedule jobs (e.g. Slurm, PBS Pro, HTCondor, …), a
- module system to load certain versions of centrally provided software and a
- way to log in to a login node of the cluster.
To login via ssh
, you can use on (remove this since it’s
discussed in HPC introduction?)
- PuTTY
-
ssh
in PowerShell
-
ssh
in Terminal.app
-
ssh
in Terminal