Pinning
Last updated on 2025-10-31 | Edit this page
Estimated time: 10 minutes
Overview
Questions
- What is “pinning” of job resources?
- How can pinning improve the performance?
- How can I see, if pinning resources would help?
- What requirement hints can I give to the scheduler?
Objectives
After completing this episode, participants should be able to …
- Define the concept of “pinning” and how it can affect job performance.
- Name Slurms options for memory- and cpu- binding.
- Use hints to tell Slurm how to optimize their job allocation.
Narrative:
- We get the feeling, that hardware has a lot to offer, but the rabbit hole is deep!
- What are the “dimensions” in which we can optimize the throughput of snowman pictures per hour?
- Can we improve how the work maps to certain CPUs / Memory regions?
What we’re doing here:
- Introduce pinning and slurm hint options
- Relate to hardware effects
- Use third party performance tools to observe effects!
Stick to simple options here. Put more complex options for pinning / hints, etc. into its own episode somewhere later in the course
Pinning is an important part of job optimization, but requires some knowledge, e.g. about the hardware hierarchies in a cluster, NUMA, etc. So it should be done after we’ve introduced different performance reports and their perspective on hardware
Maybe point to JSC pinning simulator and have similar diagrams as an independent “offline” version in this course
Binding / pinning:
--mem-bind=[{quiet|verbose},]<type>-m, --distribution={*|block|cyclic|arbitrary|plane=<size>}[:{*|block|cyclic|fcyclic}[:{*|block|cyclic|fcyclic}]][,{Pack|NoPack}]-
--hint=: Hints for CPU- (compute_bound) and memory-bound (memory_bound), but alsomultithread,nomultithread -
--cpu-bind=[{quiet|verbose},]<type>(srun) - Mapping of application <-> job resources
Motivation
Exercise
Case 1: 1 thread per rank
mpirun -n 8 ./raytracer -width=512 -height=512 -spp=128 -threads=1 -alloc_mode=3 -png=snowman.png
Case 2: 2 thread per rank
mpirun -n 8 ./raytracer -width=512 -height=512 -spp=128 -threads=2 -alloc_mode=3 -png=snowman.png
Questions: - Do you notice any difference in runtime between the two cases? - Is the increase in threads providing a speedup as expected?
- Observation: The computation times are almost the same.
- Expected behavior: Increasing threads should ideally reduce runtime.
- Hypothesis: Additional threads do not contribute.
How to investigate?
You can verify the actual core usage in two ways: 1. Use
--report-bindings with mpirun 2. Use
htopcommand on the compute node
This is cluster specific. It can possibly be done in two ways: 1.
srun --pty --overlap --jobid=<jobid> /bin/bash 2.
Check on which node job runs and login to the node via SSH
Exercise
Follow any one of the option above and run for 2 threads per rank
mpirun -n 8 ./raytracer -width=512 -height=512 -spp=128 -threads=2 -alloc_mode=3 -png=snowman.png
Questions: - Did you find any justification for the hypothesis we made?
Only 8 cores are active instead of 16
Explanation:
- Eventhough we requested 2 threads per MPI rank, both threads are pinned to the same core.
- The second thread waits for the first thread to finish, so no actual thread-level parallelization is achieved.
- current behavior with overlapping threads on the same core.
- Expected behavior when threads are pinned to separate cores.
How to achieve?
Exercise: Understanding Process and Thread Binding
Pinning (or binding) means locking a process or thread to a specific hardware resource such as a CPU core, socket, or NUMA region. Without pinning, the operating system may move tasks between cores, which can reduce cache reuse and increase memory latency, directly diminishes performance.
In this exercise we will explore how MPI process and thread binding works. We will try binding to core, socket, and numa, and observe timings and bindings.
- This exercise assumes the following hardware setup:
- Dual-socket system (2 sockets, 48 cores per socket, 8 NUMA regions,
96 cores total).
- Each MPI process can use multiple threads (
-threads) for parallel execution.
- Dual-socket system (2 sockets, 48 cores per socket, 8 NUMA regions,
96 cores total).
- The idea is to demonstrate oversubscription by
giving more MPI processes than available sockets or NUMA regions, or by
over-allocating threads per domain.
- You are free to adjust
-nand-threadsbased on your cluster.
Exercise
Case 1: --bind-to numa
mpirun -n 8 --bind-to numa ./raytracer -width=512 -height=512 -spp=128 -threads=12 -alloc_mode=3 -png=snowman.png
Case 2: --bind-to socket
mpirun -n 4 --bind-to socket /raytracer -width=512 -height=512 -spp=128 -threads=48 -alloc_mode=3 -png=snowman.png
Questions: - What is difference between Case 1 and Case 2. Any difference in performance? How many workers? - How could you adjust process/thread counts to better utilize the hardware in Case 2?
- MPI and thread pinning is hardware-aware.
- If the number of processes matches the number of domains (socket or NUMA), then the number of threads should equal the cores per domain to fully utilize the node.
- No speedup in Case 2: Oversubscription occurs because we requested 4 processes on a system with only 2 sockets.
- Threads compete for the same cores → OpenMPI queues threads and waits until other processes finish.
Best Practices for MPI Process and Thread Pinning
Mapping vs. Binding Analogy
Think of running MPI processes and threads like booking seats for a group of friends:
-
Mapping is like planning where your group will sit
in the theatre or on a flight.
- Example: You decide some friends sit in Economy, some in Premium
Economy, and some in Business.
- Similarly,
--map-bydistributes MPI ranks across nodes, sockets, or NUMA regions.
- Example: You decide some friends sit in Economy, some in Premium
Economy, and some in Business.
-
Binding is like reserving the exact seats for each
friend in the planned area.
- Example: Once the seating area is chosen, you assign specific seat
numbers to each friend.
- Similarly,
--bind-topins each MPI process or thread to a specific core or hardware unit to avoid movement.
- Example: Once the seating area is chosen, you assign specific seat
numbers to each friend.
This analogy helps illustrate why mapping defines placement and binding enforces it.
We will use --bind-to core (the smallest hardware unit)
and --map-by to distribute MPI processes across sockets or
NUMA or node regions efficiently.
Choosing the Smallest Hardware Unit
Binding processes to the smallest unit (core) is recommended because:
Exclusive use of resources
Each process or thread is pinned to its own core, preventing multiple threads or processes from competing for the same CPU.Predictable performance
When processes share cores, execution times can fluctuate due to scheduling conflicts. Binding to cores ensures consistent timing across runs.
- Best practice: Always bind processes to the smallest unit (core) and
spread processes evenly across the available hardware using
--map-by. - Example options:
-
--bind-to core→ binds each process to a dedicated core (avoids oversubscription). -
--map-by socket:PE=<threads>→ spreads given number of threads as a processing element across the socket -
--map-by numa:PE=<threads>→ spreads processes across NUMA domains, assigning<threads>cores per process. - similarly
--map-by numa:PE=<threads> -
--cpus-per-rank <n>→ Assigns<n>cores (hardware threads) to each MPI rank - ensuring that all threads within a rank occupy separate cores.
-
Exercise
Use the given best practices above for case 1: -n 8,
-threads=1 and case 2: -n 8,
-threads=4 and answer following questions
Questions: - How many cores does the both jobs use? - Did you get more workers than you requested? - Did you see the scaling when running with 4 threads?
- 8 and 32
- No.
- Yes
Summary
Always check how pinning works
Use verbose reporting (e.g.,--report-bindings) to see how MPI processes and threads are mapped to cores and sockets.Documentation is your friend
For OpenMPI withmpirun, consult the manual: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.phpKnow your hardware
Understanding the number of sockets, cores per socket, and NUMA regions on your cluster helps you make effective binding decisions.Avoid oversubscription
Assigning more threads or processes than available cores hurts performance — it causes contention and idle waits.Recommended practice for OpenMPI
Use--bind-to corealong with--map-byto control placement and threads per process to maximize throughput.