Thread Affinitization

Pin threads to specific CPU cores for optimal performance, reduced jitter, and NUMA optimization.

Prerequisites: Before diving into configuration, review the Threading Model page to understand the architectural concepts and design rationale behind Rumi's threading architecture.

Overview

To achieve the lowest possible latency and best throughput with minimal jitter, Rumi supports the ability to pin critical threads to individual CPU cores. This section presents two approaches for affinitizing your microservice.

Thread Affinitization

Terminology and Concepts

Before diving into configuration details, let's review key terminology:

CPU Socket: Refers to a physical connector on a motherboard that accepts a single physical chip. Modern CPUs provide multiple physical cores which are exposed to the operating system as logical CPUs that can perform parallel execution streams. See Also: CPU Socket
NUMA: Non-Uniform Memory Access, refers to the commonplace architecture in which machines with multiple CPU sockets divide the memory banks of RAM into nodes on a per-socket basis. Access to memory on a socket's "local" memory node is faster than accessing memory on a remote node tied to a different socket. See Also: NUMA
CPU Core: Contemporary CPUs are likely to run multiple cores which are exposed to the underlying OS as a CPU. See Also: Multi-core processing
Hyper-threading: Intel technology to make a single core appear logically as multiple cores on the same chip to improve performance
Logical CPU: What the operating system sees as a CPU. The number of CPUs available to the OS is: <num sockets> * <cores per socket> * <hyper threads per core>
Processor Affinity: Refers to the act of restricting the set of logical CPUs on which a particular program thread can execute

Benefits of Thread Affinitization

Pinning a thread to a particular CPU ensures that the OS won't reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from main memory, which results in jitter. When all critical threads in the processing pipeline are pinned to their own CPU and busy spinning, the OS scheduler is less likely to schedule another thread onto that core, keeping the threads' processor caches hot.

Preparing for Affinitization

To get the most out of affinitization, each busy-spinning thread should be pinned to its own CPU core which prevents the operating system from relocating the thread to another logical CPU while the program is executing.

Identifying Busy-Spinning Threads

Any platform threads that are marked as critical in the Thread Reference above should be affinitized. An easy way to see what threads are busy spinning is to enable container thread stats and trace:

ID    CPU       DCPU    DUSER   CPU%  USER% STATE           NAME
28    1.1s      2.5ms   2.5ms   1     101   RUNNABLE        X-Server-myapp-1-StatsRunner
46    14.5m     5.2s    5.2s    100   100   RUNNABLE        X-STEMux-myapp-2
47    14.5m     5.2s    5.2s    100   101   RUNNABLE        X-AEP-BusManager-IO-myapp.my-bus
....

Assuming you have enough CPUs on your machine such that two critical threads aren't scheduled on the same CPU, any thread that is consistently using >90% CPU while your microservice is not processing messages is one that will benefit from affinitization. Determining the number of busy-spinning threads will allow you to determine if it is possible to pin them all to processors on the same NUMA node.

Determining CPU Layout

CPU layout is machine dependent. Before configuring CPU affinity masks, it is necessary to determine the CPU layout on the target machine. Rumi includes a utility class, UtlThread, that can be run to assist with this:

java -cp "libs/*" com.neeve.util.UtlThread

which will produce output similar to the following:

0: CpuInfo{socketId=0, coreId=0, threadId=0}
1: CpuInfo{socketId=1, coreId=0, threadId=0}
2: CpuInfo{socketId=0, coreId=8, threadId=0}
3: CpuInfo{socketId=1, coreId=8, threadId=0}
4: CpuInfo{socketId=0, coreId=2, threadId=0}
5: CpuInfo{socketId=1, coreId=2, threadId=0}
6: CpuInfo{socketId=0, coreId=10, threadId=0}
7: CpuInfo{socketId=1, coreId=10, threadId=0}
8: CpuInfo{socketId=0, coreId=1, threadId=0}
9: CpuInfo{socketId=1, coreId=1, threadId=0}
10: CpuInfo{socketId=0, coreId=9, threadId=0}
11: CpuInfo{socketId=1, coreId=9, threadId=0}
12: CpuInfo{socketId=0, coreId=0, threadId=1}
13: CpuInfo{socketId=1, coreId=0, threadId=1}
14: CpuInfo{socketId=0, coreId=8, threadId=1}
15: CpuInfo{socketId=1, coreId=8, threadId=1}
16: CpuInfo{socketId=0, coreId=2, threadId=1}
17: CpuInfo{socketId=1, coreId=2, threadId=1}
18: CpuInfo{socketId=0, coreId=10, threadId=1}
19: CpuInfo{socketId=1, coreId=10, threadId=1}
20: CpuInfo{socketId=0, coreId=1, threadId=1}
21: CpuInfo{socketId=1, coreId=1, threadId=1}
22: CpuInfo{socketId=0, coreId=9, threadId=1}
23: CpuInfo{socketId=1, coreId=9, threadId=1}

In the above, we can see:

The machine has 24 logical CPUs (0 through 23)
There are two processor sockets (socketId=0, socketId=1)
There are 12 physical cores total - 6 physical cores per socket (coreIds 0, 1, 2, 8, 9, and 10)
Hyper-threading is enabled and there are two threads per socket (threadId=0, threadId=1)

Note: The fashion in which the OS assigns core numbers is OS dependent.

Linux Only: The UtlThread class is only supported on Linux currently. Eventually, support for other platforms will be added.

Best Practices

Before launching your process, validate that there aren't other processes running that are spinning on a core to which you are affinitizing
Check what other processes on the host will use busy spinning and find out the cores they will use
In Linux, the OS often uses Core 0 for some of its tasks, so it is better to avoid this core if possible
When feasible it is best to disable hyper-threading to maximize the amount of CPU cache available to each CPU

Approach 1: Basic Affinitization

The basic affinitization approach requires no additional DDL configuration and simply uses the numactl command to restrict the NUMA memory nodes and logical CPUs on which your microservice can execute. Using this approach can be a good first step in evaluating the performance benefits of affinitizing your microservice, but is not ideal for reducing jitter.

Pros:

Simple
Avoids remote NUMA node access

Cons:

Does not prevent thread context switches; the OS is free to move threads between logical CPUs which leads to jitter
Does not provide visibility into what CPU a thread is running on, making it harder to diagnose cases where 2 critical threads are scheduled on the same core

Launching with Basic Affinitization

In the CPU layout determined above, one could launch a microservice with memory pinned to NUMA node 1, and only CPUs from socket 1 as follows:

numactl -m1 -C1,3,5,7,9,11 java -cp "libs/*" com.neeve.server.Main -n orderprocessing-vm

Refer to your numactl manual pages for more information.

Validating Basic Affinitization

With basic affinitization, it isn't straightforward to determine what CPUs any particular thread ends up running on, but you can use a command like top to validate that all of your microservice threads are running on the expected nodes, by pressing the '1' key after launching top. With enough effort, it may be possible to correlate the thread IDs displayed in a stack dump to those shown in a tool such as htop, but that is outside the scope of this document.

Approach 2: Advanced Affinitization

For microservices that are most concerned with reducing jitter, the basic affinitization approach described above still leaves open the potential for the operating system relocating your threads from one CPU to another which can lead to latency spikes. With the advanced affinitization approach described here, you will avoid this by configuring each busy-spinning or critical thread in the microservice to its own CPU to avoid context switching.

CPU Affinity Mask Format

Thread affinities are configured by supplying a mask that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical CPUs, or a square bracket enclosed comma-separated list enumerating the logical CPUs to which a thread should be affinitized. The latter format is recommended as it is easier to read.

Examples:

"0" - no affinity specified (0x0000)
"[]" - no affinity specified
"1" - specifies logical CPU 0 (0x0001)
"[0]" - specifies logical CPU 0
"4" - specifies logical CPU 2 (0x0100)
"[2]" - list specifying logical CPU 2
"6" - mask specifying logical CPU 1 and 2 (0x0110)
"4294967296" - specifies logical CPU 32 (0x1000 0000 0000 0000 0000 0000 0000 0000)
"[32]" - specifies logical CPU 32
"[1,2]" - list specifying logical CPU 1 and 2

Enabling Affinitization

By default, CPU affinitization is disabled. To enable it you can set the following env flags in the DDL configuration:

<env>
  <nv>
    <enablecpuaffinitymasks>true</enablecpuaffinitymasks>
    <defaultcpuaffinitymask>[0]</defaultcpuaffinitymask>
  </nv>
</env>

Configuring CPU Affinities

Step 1: Configure Default CPU Affinity Mask

Threads that are critical for reducing microservice latency and improving throughput are listed in the reference tables above, but not all threads are critical. To prevent non-critical threads from being scheduled on a CPU being used by a critical thread, the platform allows the microservice to configure one or more 'default' CPUs that non-critical threads can be affinitized to, by setting the nv.defaultcpuaffinitymask environment variable. For example, the platform's statistics collection thread doesn't need its own dedicated CPU to perform its relatively simple tasks of periodically reporting heartbeats. However, we still want to ensure that the operating system doesn't try to schedule it onto the same core as a critical thread, so the platform will affinitize it with the default CPU affinity mask.

Step 2: Configure Critical Platform Threads Affinities

Critical platform-related threads are those that have the most impact on latency and performance. When the platform is optimized for latency or throughput these threads will be set to use BusySpin or Yielding respectively to avoid being context switched. Each of these threads should be assigned its own CPU.

See the Critical Thread Affinity Configuration Reference section below for a listing of these threads and how to configure their affinities.

Step 3: Affinitizing Non-Platform Threads

If your microservice uses its own threads, they can be affinitized as well by using the platform's UtlThread utility class. Non-critical threads that are not busy-spinning threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core to prevent them from being scheduled on top of the platform's threads.

Non-Critical, Non-Spinning Thread

Non-critical threads can be affinitized to the set of default CPUs configured by nv.defaultcpuaffinitymask by calling setDefaultCpuAffinityMask from the thread to be affinitized:

com.neeve.util.UtlThread.setDefaultCpuAffinityMask();

Critical or Busy Threads

Threads that participate in your transaction's processing flow or are spinning or heavy CPU users should be pinned to their own core so that they don't interfere with affinitized platform threads. For example:

com.neeve.util.UtlThread.setCpuAffinityMask("[9]");

Launching with NUMA Affinitization

Unlike with the Basic Affinitization approach, when all threads have been affinitized to their own core or the default core, it is not strictly necessary to restrict what cores a process operates on, just the memory node to which to restrict the process. In fact, it can even be beneficial to let threads outside the platform or microservice's control be scheduled on other NUMA nodes.

numactl -m0 java -cp "libs/*" com.neeve.server.Main -n orderprocessing-vm

Validating Affinitization

Via Thread Stats Output

The easiest way to check your work is to enable container thread stats. Thread stats are emitted in heartbeats and affinities can be reported in monitoring tools. If the container is configured to trace thread stats, then thread usage is printed as follows:

ID    CPU       DCPU    DUSER   CPU%  USER% STATE           NAME
28    1.1s      2.5ms   2.5ms   1     101   RUNNABLE        X-Server-myapp-1-StatsRunner (aff=[1(s0c1t0)])
46    14.5m     5.2s    5.2s    100   100   RUNNABLE        X-STEMux-myapp-2 (aff=[6(s0c9t0)])
47    14.5m     5.2s    5.2s    100   101   RUNNABLE        X-AEP-BusManager-IO-myapp.my-bus (aff=[4(s0c4t0)])

You can look for any spinning thread (CPU% at 100) that doesn't have an affinity assigned. This will help you avoid the following pitfalls:

Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same coreId/threadId)
Having some other non-Rumi process spinning on one of the coreIds that you've affinitized to
Affinitizing across multiple socketIds (which are on different NUMA nodes) can make performance worse
You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning

The platform outputs thread affinitization using the format like: (aff=[6(s0c9t0)]) which can be interpreted as logical CPU 6 which is on socket 0, core 9, thread 0.

Programmatically

UtlThread.dumpAffinitizationState(System.out, "  ");

This will dump the affinitization state of all threads affinitized through UtlThread.

Via Trace

The above trace will also be printed by an AepEngine after messaging has been started or alternatively when it assumes a backup role (in most cases all platform threads will have been started by this time).

Limitations

The following limitations apply to thread affinitization support:

Thread affinitization is currently only supported on Linux
Affinitization is limited to being able to affinitize threads to logical cores 0 through 63
Affinitization of a thread does not reserve the CPU core, just limits the cores on which a thread will execute. This is important because if not all threads are affinitized the OS thread scheduler may schedule another thread on top of a critical thread if CPU resources are scarce

Threading Model - Architectural concepts and design rationale
Thread Reference - Complete reference of all Rumi threads and their configuration
Disruptors - Configure LMAX disruptor ring buffers
DDL Reference - Complete DDL syntax reference

Next Steps

Review the Threading Model to understand NUMA and affinitization
Determine your machine's CPU layout using UtlThread
Start with basic affinitization to evaluate benefits
Move to advanced affinitization if jitter reduction is critical
Test performance with and without affinitization

PreviousThread Reference NextDiscovery

Last updated 5 days ago

hashtagOverview

hashtagThread Affinitization

hashtagTerminology and Concepts

hashtagBenefits of Thread Affinitization

hashtagPreparing for Affinitization

hashtagIdentifying Busy-Spinning Threads

hashtagDetermining CPU Layout

hashtagBest Practices

hashtagApproach 1: Basic Affinitization

hashtagLaunching with Basic Affinitization

hashtagValidating Basic Affinitization

hashtagApproach 2: Advanced Affinitization

hashtagCPU Affinity Mask Format

hashtagEnabling Affinitization

hashtagConfiguring CPU Affinities

hashtagLaunching with NUMA Affinitization

hashtagValidating Affinitization

hashtagLimitations

hashtagRelated Topics

hashtagNext Steps

Overview

Thread Affinitization

Terminology and Concepts

Benefits of Thread Affinitization

Preparing for Affinitization

Identifying Busy-Spinning Threads

Determining CPU Layout

Best Practices

Approach 1: Basic Affinitization

Launching with Basic Affinitization

Validating Basic Affinitization

Approach 2: Advanced Affinitization

CPU Affinity Mask Format

Enabling Affinitization

Configuring CPU Affinities

Launching with NUMA Affinitization

Validating Affinitization

Limitations

Related Topics

Next Steps