Resource Requirements

Last updated on 2025-07-02 | Edit this page

Overview

Questions

  • How many resources should it request initially?
  • What options does the scheduler give to request resources?
  • How do I know if they are used well?
  • How large is my HPC cluster?

Objectives

After completing this episode, participants should be able to …

  • Identify the size of their jobs in relation to the HPC system.
  • Request the right amount of resources from the scheduler.
  • Change the parameters if the applications’ resource requirements change.

Starting Somewhere


Didactic path: I have no idea how many resources to ask for -> just guess and start with some combinations. Next identify slower, or failed (OOM, timelimit) and choose the best What does that say about efficiency?

Exercise: Starting Somewhere

  • Run job with a timelimit of 1 minute -> Trigger timelimit. What’s a good timelimit for our task?
  • Run job with few cores, but too much memory/core -> Trigger OOM. What’s a good memory limit for our task?
  • Run job with requesting way too many cores -> Endless waiting or not accepted due to account limits. What’s a good CPU limit for our task?
  • squeue to learn about scheduling issues / reasons

Summarize dimensions in which a job has to be sized correctly (time, cores, memory, gpus, …).

Compared to the HPC System


  • What’s the relationship between your job and existing hardware of the system?
    • What hardware does your HPC system offer?
    • Documentation and Slurm commands
  • Is my job large or small?
    • What’s considered large, medium, small? Maybe as percentage of whole system?
    • Issues of large jobs: long waiting times
    • Issues of many (thousands) small jobs:
  • How many resources are currently free?
  • How long do I have to wait? (looking up scheduler estimate + apply common sense)

Exercise: Comparing to the system

  • sinfo to learn about partitions and free resources
  • scontrol to learn about nodes in those partitions
  • lscpu and cat /proc/cpuinfo
  • Submit a job with a reasonable number of resources and use squeue and/or scontrol show job to learn about Slurms estimated start time

Answer questions about number and type of CPUs, HT/SMT, memory/core, timelimits.

Summarize with a well sized job that’s a good start for the example.

Requesting Resources


More detail about what Slurm provides (among others):

  • -t, --time=<time>: Time limit of the job
  • -N, --nodes: Number of nodes
  • -n, --ntasks: Number of tasks/processes
  • -c, --cpus-per-task: Number of CPUs per task/process
  • --threads-per-core=<threads>: Select nodes with at least the number threads per CPU
  • --mem=<size>[units]: Memory, but can also be as --mem-per-cpu, …
  • -G, --gpus: Number of GPUs
  • --exclusive

Binding:

  • --mem-bind=[{quiet|verbose},]<type>
  • -m, --distribution={*|block|cyclic|arbitrary|plane=<size>}[:{*|block|cyclic|fcyclic}[:{*|block|cyclic|fcyclic}]][,{Pack|NoPack}]
  • --hint=: Hints for CPU- (compute_bound) and memory-bound (memory_bound), but also multithread, nomultithread
  • --cpu-bind=[{quiet|verbose},]<type> (srun)
  • Mapping of application <-> job resources

Maybe discuss:

  • Minimizing/maximizing involved number of nodes
    • Shared nodes: longer waiting times until a whole node is empty
    • Min/max number of nodes min/maximizes communication
  • Different wait times for certain configurations
    • Few tasks on many shared nodes might schedule faster than many tasks on few exclusive nodes.
  • What is a task / process – Difference?
  • Requesting memory, more than mem/core -> idle cores

Changing requirements


  • Motivate why requirements might change (resolution in simulation, more data, more complex model, …)
  • How to change requested resources if application should run differently? (e.g. more processes)
  • Considerations & estimates for
    • changing compute-time (more/less workload)
    • changing memory requirements (smaller/larger model)
    • changing number of processes / nodes
    • changing I/O -> more/less or larger/smaller files

Exercise: Changing requirements

  • Walk through how to estimate increase in CPU cores / memory, etc.
  • Run previous job with larger workload
  • Check if and how it behaves differently than the smaller job

Summary


Discussion: Recollection

Circle back to efficiency. What’s considered good/efficient in context of job requirements and parameters?

Key Points

  • Estimate resource requirements and request them in terms the scheduler understands
  • Be aware of your job in relation to the whole system (available hardware, size)
  • Aim for a good match between requested and utilized resources
  • Optimal time-to-solution by minimizing batch queue times and maximizing parallelism