Scheduler Tools
Last updated on 2025-09-24 | Edit this page
Estimated time: 10 minutes
Overview
Questions
- What can the scheduler tell about job performance?
- What’s the meaning of collected metrics?
Objectives
After completing this episode, participants should be able to …
- Explain basic performance metrics.
- Use tools provided by the scheduler to collect basic performance metrics of their jobs.
Narrative:
- Okay, so first couple of jobs ran, but were they “quick enough”?
- How many renders could I generate per minute/hour/day according to the current utilization
- Our cluster uses certain hardware, maybe we didn’t use it as much as we could have?
- But I couldn’t see all metrics (may be cluster dependent) (Energy, Disk I/O, Network I/O?)
What we’re doing here:
- What
seff
andsacct
have to offer - Introduce simple relation to hardware, what does RSS, CPU, Disk read/write and their utilization mean?
- Point out what’s missing from a complete picture
Scheduler Tools
-
sacct
-
MaxRSS
,AvgRSS
-
MaxPages
,AvgPages
-
AvgCPU
,AllocCPUS
- `ElapsedI
-
MaxDiskRead
, AvgDiskRead`, -
MaxDiskWrite
,AvgDiskWrite
energy
-
-
seff
- Utilization of time allocation
- Utilization of allocated CPUs (is 100% <=> efficient? Not if calculations are redundant etc.!)
- Utilization of allocated memory
Shortcomings
- Not enough info about e.g. I/O, no timeline of metrics during job
execution, …
- I/O may be available, but likely only for local disks
- => no parallel FS
- => no network
- Energy demand may be missing or wrong
- Depends on available features
- Doesn’t estimate energy for network switches, cooling, etc.
- => trying other tools! (motivation for subsequent episodes)
Can / should we cover I/O and energy metrics at this point?
E.g. use something like beegfs-ctl
to get a rough
estimate of parallel FS performance. Use pidstat etc. to get numbers on
node-local I/O (and much more)
Summary
Leading question: Is there a systematic approach to study a jobs performance at different scales? -> Scaling study
-
sacct
andseff
for first results - Small scaling study, maximum of X% overhead is “still good” (larger resource req. vs. speedup)
- Getting a feel for scale of the HPC system, e.g. “is 64 cores a lot?”, how large is my job in comparison?
- CPU and Memory Utilization
- Core-h and relationship to power efficiency