Performance Overview

Last updated on 2025-09-24 | Edit this page

Overview

Questions

  • Why are tools like seff and sacct not enough?
  • What steps can I take to assess a jobs performance?
  • What popular types of reports exist? (e.g. Roofline)

Objectives

After completing this episode, participants should be able to …

  • Explain different approaches to performance measurements.
  • Understand common terms and concepts in performance analyses.
  • Create a performance report through a third-party tool.
  • Describe what a performance report is meant for (establish baseline, documentation of issues and improvements through optimization, publication of results, finding the next thread to pull in a quest for optimization)
  • Measure the performance of central components of underlying hardware (CPU, Memory, I/O, …) (split episode?)
  • Identify which general areas of computer hardware may affect performance.

Workflow


  • Define sampling and tracing
  • Describe common approaches

Tools


Performance counters and permissions, may require --exclusive, depends on system! Look at documentation / talk to your administrators / support.

cap_perfmon,cap_sys_ptrace,cap_syslog=ep
kernel.perf_event_paranoid

General report


  • General reports show direction in which to continue
    • Specialized tools may be necessary

How Does Performance Relate to Hardware?


(Following this structure throughout the course, trying to understand the performance in these terms)

Broad dimensions of performance:

  • CPU (Front- and Backend, FLOPS)
    • Frontend: decoding instructions, branch prediction, pipeline
    • Backend: getting data from memory, cache hierarchy & alignment
    • Raw calculations
    • Vectorization
    • Out-of-order execution
  • Accelerators (e.g. GPUs)
    • More calculations
    • Offloading
    • Memory & communication models
  • Memory (data hierarchy)
    • Working memory, reading data from/to disk
    • Bandwidth of data
  • I/O (broader data hierarchy: disk, network)
    • Stored data
    • Local disk (caching)
    • Parallel fs (cluster-wide)
    • MPI-Communiction
  • Parallel timeline (synchronization, etc.)
    • Application logic
Hardware
Hardware
Challenge

Exercise: Match application behavior to hardware

Which part of the computer hardware may become an issue for the following application patterns:

  1. Calculating matrix multiplications
  2. Reading data from processes on other computers
  3. Calling many different functions from many equally likely if/else branches
  4. Writing very large files (TB)
  5. Comparing many different strings if they match
  6. Constructing a large simulation model
  7. Reading thousands of small files for each iteration

Maybe not the best questions, also missing something for accelerators.

  1. CPU (FLOPS) and/or Parallel timeline
  2. I/O (network)
  3. CPU (Front-End)
  4. I/O (disk)
  5. (?) CPU-Backend, getting strings through the cache?
  6. Memory (size)
  7. I/O (disk)

Summary


Leading question: Connection to hardware is quite deep, why does it matter? -> Drill deeper, e.g. on NUMA & pinning

Key Points
  • First things first, second things second, …
  • Profiling, tracing
  • Sampling, summation
  • Different HPC centers may provide different approaches to this workflow
  • Performance reports offer more insight into the job and application behavior