Job Efficiency: All in One View

Content from Introduction

Last updated on 2025-07-01 | Edit this page

Overview

Questions

Why should I care about my jobs performance?
How is efficiency defined?
How do I start measuring?

Objectives

After completing this episode, participants should be able to …

Understand the benefits of efficient jobs.
Identify which areas of computer hardware may affect performance.
Use the time command for a first measurement.

Why Care About Performance?

Reasons from the perspective of learners (see profiles)

Faster output, shorter iteration-/turn-around-time
- More research per time
- Opportunity costs when “accepting” worse performance
- Trade of between time spent on optimizing vs. doing actual research
Potentially less wasted energy
- Core-h / device-h directly correlate to wattage
- Production of hardware and its operation costs energy (even when idle)
- => Buy as little hardware as possible and use it as much as you can, if you have meaningful computations
Applying for HPC resources in a larger center
- Need estimate for expected resources
- Jobs need to be sufficiently efficient
- Is provided hardware a good fit for the applied computational workload?

Exercise: Why care about performance?

maybe true-false statements as warmup exercise? E.g. something like

Better performance allows for more research
Application performance matters less on new computer hardware
Computations directory correlate to energy consumption
Good performance does not matter on my own hardware

Show me the solution

True, shorter turn around times, more results per time, more nobel prices per second!
False, new hardware might make performance issues less pressing, but it is still important (opportunity costs, wasted energy, shared resources)
True, device-hours consume energy (variable depending on utilized features, amount of communication, etc.), but there is a direct correlation to W
False, performance is especially important on shared systems, but energy and opportunity costs also affect researchers on their own hardware and exclusive allocations.

Discussion: How important is performance?

Did you change your opinion about the importance of good performance?
How much time do you want to/can you spend on assessing your jobs performance?

What is Efficient?

Challenge: Many perspectives on Efficiency

Write down your current definition or understanding of efficiency with respect to HPC jobs. (Shared document?)

(Exercise as think, pair, share?)

Give me a hint

E.g. shortest time from submission to job completion.

Show me the solution

Many definitions of efficiency (see below)

Discussion: Which definition should we take?

Are these perspectives equally useful? Is one particularly suited to our discussion?

Many definitions of efficiency (to be ordered and discussed):

Minimal wall-/human-time of the job
Minimal compute-time
Minimal time-to-solution (like 1, including queue wait times, potentially multiple jobs for combined results)
Minimal cost in terms of energy/someones money
With regards to opportunity costs. Amount of research per job (including waiting times, computation time, slowdown through larger iteration cycles (turn around times))

Assuming only “useful” computations, no redundancies.

Which definition do refer to by default in the following episodes? (Do we need a default?)

How Does Performance Relate to Hardware?

(Following this structure throughout the course, trying to understand the performance in these terms)

Broad dimensions of performance:

CPU (Front- and Backend, FLOPS)
- Frontend: decoding instructions, branch prediction, pipeline
- Backend: getting data from memory, cache hierarchy & alignment
- Raw calculations
- Vectorization
- Out-of-order execution
Accelerators (e.g. GPUs)
- More calculations
- Offloading
- Memory & communication models
Memory (data hierarchy)
- Working memory, reading data from/to disk
- Bandwidth of data
I/O (broader data hierarchy: disk, network)
- Stored data
- Local disk (caching)
- Parallel fs (cluster-wide)
- MPI-Communiction
Parallel timeline (synchronization, etc.)
- Application logic

Exercise: Match application behavior to hardware

Which part of the computer hardware may become an issue for the following application patterns:

Calculating matrix multiplications
Reading data from processes on other computers
Calling many different functions from many equally likely if/else branches
Writing very large files (TB)
Comparing many different strings if they match
Constructing a large simulation model
Reading thousands of small files for each iteration

Maybe not the best questions, also missing something for accelerators.

Show me the solution

CPU (FLOPS) and/or Parallel timeline
I/O (network)
CPU (Front-End)
I/O (disk)
(?) CPU-Backend, getting strings through the cache?
Memory (size)
I/O (disk)

Setting the Baseline

Absolute performance is hard to determine:

In comparison to current hardware (theoretical limits vs. real usage)
Still important, if long way from theoretical limits
Always limited by something

During optimization, performance is often expressed in relative terms to a baseline measurement. Define “baseline”. Comparison between before and after a change.

Exercise: Baseline Measurement with `time`

Simple measurement with time of example application. Maybe also with hyperfine ?

Observe system, user, and wall time.

Repeat measurements somewhere 3-10 times to reduce noise

Average time
Minimum (observed best case)

Maybe make a simple/obvious change to compare change to baseline. How much relative improvement?

Discuss meaning of system, user, wall-time. Relate to efficiencies (minimal wall-time vs. minimal compute-time)

Define core-h. Device usage for X seconds correlates to estimated power draw. Real power usage depends on:

Utilized features of the device (some more power-hungry than others)
Amount of data movement through memory, data, network
Cooling (rule of thumb factor \(\times 2\))

Exercise: Core-h and Energy consumption

Figure out your current hardware (docu, cpuinfo, websearch, LLM)
Calculate core-h for above test (either including or excluding repetitions)
Estimate power usage with TDP

Summary

Exercise: Recollecting efficiency

Exercise to raise the question if example workload is efficient or not. Do we know yet? -> No, we can only tell how long it takes, estimate how much time/resources it consumes, and if there is a relative improvement on a change

Key Points

Job performance affects you as a user
Different perspectives on efficiency
- Definitions: wall/human-time, compute-time, time-to-solution, energy (costs / environment), Money, opportunity cost (less research output)
Relationship between performance and computer hardware
Absolute vs. relative performance measurements
- time to establish a baseline
- Estimating energy consumption

Content from Resource Requirements

Last updated on 2025-07-02 | Edit this page

Overview

Questions

How many resources should it request initially?
What options does the scheduler give to request resources?
How do I know if they are used well?
How large is my HPC cluster?

Objectives

After completing this episode, participants should be able to …

Identify the size of their jobs in relation to the HPC system.
Request the right amount of resources from the scheduler.
Change the parameters if the applications’ resource requirements change.

Starting Somewhere

Didactic path: I have no idea how many resources to ask for -> just guess and start with some combinations. Next identify slower, or failed (OOM, timelimit) and choose the best What does that say about efficiency?

Exercise: Starting Somewhere

Run job with a timelimit of 1 minute -> Trigger timelimit. What’s a good timelimit for our task?
Run job with few cores, but too much memory/core -> Trigger OOM. What’s a good memory limit for our task?
Run job with requesting way too many cores -> Endless waiting or not accepted due to account limits. What’s a good CPU limit for our task?
squeue to learn about scheduling issues / reasons

Summarize dimensions in which a job has to be sized correctly (time, cores, memory, gpus, …).

Compared to the HPC System

What’s the relationship between your job and existing hardware of the system?
- What hardware does your HPC system offer?
- Documentation and Slurm commands
Is my job large or small?
- What’s considered large, medium, small? Maybe as percentage of whole system?
- Issues of large jobs: long waiting times
- Issues of many (thousands) small jobs:
How many resources are currently free?
How long do I have to wait? (looking up scheduler estimate + apply common sense)

Exercise: Comparing to the system

sinfo to learn about partitions and free resources
scontrol to learn about nodes in those partitions
lscpu and cat /proc/cpuinfo
Submit a job with a reasonable number of resources and use squeue and/or scontrol show job to learn about Slurms estimated start time

Answer questions about number and type of CPUs, HT/SMT, memory/core, timelimits.

Summarize with a well sized job that’s a good start for the example.

Requesting Resources

More detail about what Slurm provides (among others):

-t, --time=<time>: Time limit of the job
-N, --nodes: Number of nodes
-n, --ntasks: Number of tasks/processes
-c, --cpus-per-task: Number of CPUs per task/process
--threads-per-core=<threads>: Select nodes with at least the number threads per CPU
--mem=<size>[units]: Memory, but can also be as --mem-per-cpu, …
-G, --gpus: Number of GPUs
--exclusive

Binding:

--mem-bind=[{quiet|verbose},]<type>
-m, --distribution={*|block|cyclic|arbitrary|plane=<size>}[:{*|block|cyclic|fcyclic}[:{*|block|cyclic|fcyclic}]][,{Pack|NoPack}]
--hint=: Hints for CPU- (compute_bound) and memory-bound (memory_bound), but also multithread, nomultithread
--cpu-bind=[{quiet|verbose},]<type> (srun)
Mapping of application <-> job resources

Maybe discuss:

Minimizing/maximizing involved number of nodes
- Shared nodes: longer waiting times until a whole node is empty
- Min/max number of nodes min/maximizes communication
Different wait times for certain configurations
- Few tasks on many shared nodes might schedule faster than many tasks on few exclusive nodes.
What is a task / process – Difference?
Requesting memory, more than mem/core -> idle cores

Changing requirements

Motivate why requirements might change (resolution in simulation, more data, more complex model, …)
How to change requested resources if application should run differently? (e.g. more processes)
Considerations & estimates for
- changing compute-time (more/less workload)
- changing memory requirements (smaller/larger model)
- changing number of processes / nodes
- changing I/O -> more/less or larger/smaller files

Exercise: Changing requirements

Walk through how to estimate increase in CPU cores / memory, etc.
Run previous job with larger workload
Check if and how it behaves differently than the smaller job

Summary

Discussion: Recollection

Circle back to efficiency. What’s considered good/efficient in context of job requirements and parameters?

Key Points

Estimate resource requirements and request them in terms the scheduler understands
Be aware of your job in relation to the whole system (available hardware, size)
Aim for a good match between requested and utilized resources
Optimal time-to-solution by minimizing batch queue times and maximizing parallelism

Content from Scaling Study

Last updated on 2025-07-02 | Edit this page

Overview

Questions

How can I decide the amount of resources I should request for my job?
How do I know how my application behaves at different scales?

Objectives

After completing this episode, participants should be able to …

Perform a simple scaling study for a given application.
Identify good working points for the job configuration.

What do we look at?

Amdahl’s vs. Gustavsons’s law / strong and weak scaling
Walltime, Speedup, efficiency

Discussion: What dimensions can we look at?

Show me the solution

CPUs
Nodes
Workload/problem size

Define example payload
- Long enough to be significant
- Short enough to be feasible for a quick study
Identify dimension for scaling study, e.g.
- number of processes (on a single node)
- number of processes (across nodes)
- number of nodes involved (network-communication boundary)
- size of workload
- Decide on number of processes across node, fixed workload size
Choose limits (e.g. 1, 2, 4, … cores), within reasonable size for given Cluster
Beyond nodes? Set to one node?

Parameter Scan

Take measurements
- Use time and repeating measurements (something like 3 or 10)
- Vary scaling parameter

Exercise: Run the Example with different -n

1, 2, 4, 8, 16, 32, … cores and same workload
Take time measurements (ideally multiple and with --exclusive)

Analyzing results

Exercise: Plot the scaling

Plot it against time
Calculate speedup with respect to baseline with 1 core

What’s a good working point? How
Overhead
Efficiency: not wasting cores if adding them doesn’t do much

Summary

What’s a good working point for our example (at a given workload)?

Key Points

Jobs behave differently with varying resources and workloads
Scaling study is necessary to proof a certain behavior of the application
Good working points defined by sections where more cores still provide sufficient speedup, but no costs due to overhead etc. occurs

Content from Scheduler Tools

Last updated on 2025-07-02 | Edit this page

Overview

Questions

What information can the scheduler provide about my jobs performance?
What’s the meaning of the collected metrics?

Objectives

After completing this episode, participants should be able to …

Explain basic performance metrics.
Use tools provided by the scheduler to collect basic performance metrics of their jobs.

Scheduler Tools

sacct
- MaxRSS, AvgRSS
- MaxPages, AvgPages
- AvgCPU, AllocCPUS
- `ElapsedI
- MaxDiskRead, AvgDiskRead`,
- MaxDiskWrite, AvgDiskWrite
- energy
seff
- Utilization of time allocation
- Utilization of allocated CPUs (is 100% <=> efficient? Not if calculations are redundant etc.!)
- Utilization of allocated memory

Shortcomings

Not enough info about e.g. I/O, no timeline of metrics during job execution, …
- I/O may be available, but likely only for local disks
- => no parallel FS
- => no network
Energy demand may be missing or wrong
- Depends on available features
- Doesn’t estimate energy for network switches, cooling, etc.
=> trying other tools! (motivation for subsequent episodes)

Summary

Key Points

sacct and seff for first results
Small scaling study, maximum of X% overhead is “still good” (larger resource req. vs. speedup)
Getting a feel for scale of the HPC system, e.g. “is 64 cores a lot?”, how large is my job in comparison?
CPU and Memory Utilization
Core-h and relationship to power efficiency

Content from Workflow of Performance Measurements

Last updated on 2025-06-26 | Edit this page

Overview

Questions

Why are simple tools like seff and sacct not enough?
What steps can I take to assess a jobs performance?
What popular types of reports exist? (e.g. Roofline)

Objectives

After completing this episode, participants should be able to …

Explain different approaches to performance measurements.
Understand common terms and concepts in performance analyses.
Create a performance report through a third-party tool.
Describe what a performance report is meant for (establish baseline, documentation of issues and improvements through optimization, publication of results, finding the next thread to pull in a quest for optimization)
Measure the performance of central components of underlying hardware (CPU, Memory, I/O, …) (split episode?)

Workflow

Define sampling and tracing
Describe common approaches

Tools

General report

General reports show direction in which to continue
- Specialized tools may be necessary

Key Points

First things first, second things second, …
Profiling, tracing
Sampling, summation
Different HPC centers may provide different approaches to this workflow
Performance reports offer more insight into the job and application behavior

Content from How to identify a bottleneck?

Last updated on 2025-06-26 | Edit this page

Overview

Questions

How can I find the bottlenecks in a job at hand?

Objectives

After completing this episode, participants should be able to …

Name typical performance issues.
Determine if their job is affected by one of these issues.

How to identify a bottleneck?

Key Points

General advice on the workflow
Performance reports may provide an automated summary with recommendations
Performance metrics can be categorized by the underlying hardware, e.g. CPU, memory, I/O, accelerators.
Bottlenecks can appear by metrics being saturated at the physical limits of the hardware or indirectly by other metrics being far from what the physical limits are.
Interpreting bottlenecks is closely related to what the application is supposed to do.
Relative measurements (baseline vs. change)
- system is quiescent, fixed CPU freq + affinity, warmups, …
- Reproducibility -> link to git course?
Scanning results for smoking guns
Any best practices etc.

Content from Special Aspects of Accelerators

Last updated on 2025-06-26 | Edit this page

Overview

Questions

What are accelerators?
How do they affect my jobs performance?

Objectives

After completing this episode, participants should be able to …

Understand difference of performance measurements on accelerators (GPUs, FPGAs) to CPUs.
Understand how batch systems and performance measurements tools treat accelerators.

Introduction

Run the same example workload on GPU and compare.

Key Points

Tools to measure GPU/FPGA performance of a job
Common symptoms of GPU/FPGA problems

Content from Next Steps

Last updated on 2025-06-26 | Edit this page

Overview

Questions

Are there common patterns of “pathological” performance?
How can I evaluate the performance of my application in greater detail?

Objectives

After completing this episode, participants should be able to …

Find collection of performance patterns on hpc-wiki.info
Identify next steps to take with regard to performance optimization.

Next Steps

hpc-wiki.info - I/O - CPU Front End - CPU Back End - Memory leak - Oversubscription - Underutilization

Key Points

There are many profilers, some are language-specific, others are vendor-related, …
Simple profile with exclusive resources
Repeated measurements for reliability

Overview

Questions

Objectives

Why Care About Performance?

Exercise: Why care about performance?

Show me the solution

Discussion: How important is performance?

What is Efficient?

Challenge: Many perspectives on Efficiency

Give me a hint

Show me the solution

Discussion: Which definition should we take?

How Does Performance Relate to Hardware?

Exercise: Match application behavior to hardware

Show me the solution

Setting the Baseline

Exercise: Baseline Measurement with time

Exercise: Core-h and Energy consumption

Summary

Exercise: Recollecting efficiency

Key Points

Overview

Questions

Objectives

Starting Somewhere

Exercise: Starting Somewhere

Compared to the HPC System

Exercise: Comparing to the system

Requesting Resources

Changing requirements

Exercise: Changing requirements

Summary

Discussion: Recollection

Key Points

Overview

Questions

Objectives

What do we look at?

Discussion: What dimensions can we look at?

Show me the solution

Parameter Scan

Exercise: Run the Example with different -n

Analyzing results

Exercise: Plot the scaling

Summary

Key Points

Overview

Questions

Objectives

Scheduler Tools

Shortcomings

Summary

Key Points

Overview

Questions

Objectives

Workflow

Tools

General report

Key Points

Overview

Questions

Objectives

How to identify a bottleneck?

Key Points

Overview

Questions

Objectives

Introduction

Key Points

Overview

Questions

Objectives

Next Steps

Key Points

Exercise: Baseline Measurement with `time`