Thread Affinitization

Pin threads to specific CPU cores for optimal performance, reduced jitter, and NUMA optimization.

circle-info

Prerequisites: Before diving into configuration, review the Threading Model page to understand the architectural concepts and design rationale behind Rumi's threading architecture.

Overview

To achieve the lowest possible latency and best throughput with minimal jitter, Rumi supports the ability to pin critical threads to individual CPU cores. This section presents two approaches for affinitizing your microservice.

Thread Affinitization

To achieve the lowest possible latency and best throughput with minimal jitter, Rumi supports the ability to pin critical threads to individual CPU cores. This section presents two approaches for affinitizing your microservice.

Terminology and Concepts

Before diving into configuration details, let's review key terminology:

  • CPU Socket: Refers to a physical connector on a motherboard that accepts a single physical chip. Modern CPUs provide multiple physical cores which are exposed to the operating system as logical CPUs that can perform parallel execution streams. See Also: CPU Socketarrow-up-right

  • NUMA: Non-Uniform Memory Access, refers to the commonplace architecture in which machines with multiple CPU sockets divide the memory banks of RAM into nodes on a per-socket basis. Access to memory on a socket's "local" memory node is faster than accessing memory on a remote node tied to a different socket. See Also: NUMAarrow-up-right

  • CPU Core: Contemporary CPUs are likely to run multiple cores which are exposed to the underlying OS as a CPU. See Also: Multi-core processingarrow-up-right

  • Hyper-threading: Intel technology to make a single core appear logically as multiple cores on the same chip to improve performance

  • Logical CPU: What the operating system sees as a CPU. The number of CPUs available to the OS is: <num sockets> * <cores per socket> * <hyper threads per core>

  • Processor Affinity: Refers to the act of restricting the set of logical CPUs on which a particular program thread can execute

Benefits of Thread Affinitization

Pinning a thread to a particular CPU ensures that the OS won't reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from main memory, which results in jitter. When all critical threads in the processing pipeline are pinned to their own CPU and busy spinning, the OS scheduler is less likely to schedule another thread onto that core, keeping the threads' processor caches hot.

Preparing for Affinitization

To get the most out of affinitization, each busy-spinning thread should be pinned to its own CPU core which prevents the operating system from relocating the thread to another logical CPU while the program is executing.

Identifying Busy-Spinning Threads

Any platform threads that are marked as critical in the Thread Reference above should be affinitized. An easy way to see what threads are busy spinning is to enable container thread stats and trace:

Assuming you have enough CPUs on your machine such that two critical threads aren't scheduled on the same CPU, any thread that is consistently using >90% CPU while your microservice is not processing messages is one that will benefit from affinitization. Determining the number of busy-spinning threads will allow you to determine if it is possible to pin them all to processors on the same NUMA node.

Determining CPU Layout

CPU layout is machine dependent. Before configuring CPU affinity masks, it is necessary to determine the CPU layout on the target machine. Rumi includes a utility class, UtlThread, that can be run to assist with this:

which will produce output similar to the following:

In the above, we can see:

  • The machine has 24 logical CPUs (0 through 23)

  • There are two processor sockets (socketId=0, socketId=1)

  • There are 12 physical cores total - 6 physical cores per socket (coreIds 0, 1, 2, 8, 9, and 10)

  • Hyper-threading is enabled and there are two threads per socket (threadId=0, threadId=1)

circle-info

Note: The fashion in which the OS assigns core numbers is OS dependent.

Linux Only: The UtlThread class is only supported on Linux currently. Eventually, support for other platforms will be added.

Best Practices

  • Before launching your process, validate that there aren't other processes running that are spinning on a core to which you are affinitizing

  • Check what other processes on the host will use busy spinning and find out the cores they will use

  • In Linux, the OS often uses Core 0 for some of its tasks, so it is better to avoid this core if possible

  • When feasible it is best to disable hyper-threading to maximize the amount of CPU cache available to each CPU

Approach 1: Basic Affinitization

The basic affinitization approach requires no additional DDL configuration and simply uses the numactl command to restrict the NUMA memory nodes and logical CPUs on which your microservice can execute. Using this approach can be a good first step in evaluating the performance benefits of affinitizing your microservice, but is not ideal for reducing jitter.

Pros:

  • Simple

  • Avoids remote NUMA node access

Cons:

  • Does not prevent thread context switches; the OS is free to move threads between logical CPUs which leads to jitter

  • Does not provide visibility into what CPU a thread is running on, making it harder to diagnose cases where 2 critical threads are scheduled on the same core

Launching with Basic Affinitization

In the CPU layout determined above, one could launch a microservice with memory pinned to NUMA node 1, and only CPUs from socket 1 as follows:

Refer to your numactl manual pagesarrow-up-right for more information.

Validating Basic Affinitization

With basic affinitization, it isn't straightforward to determine what CPUs any particular thread ends up running on, but you can use a command like top to validate that all of your microservice threads are running on the expected nodes, by pressing the '1' key after launching top. With enough effort, it may be possible to correlate the thread IDs displayed in a stack dump to those shown in a tool such as htop, but that is outside the scope of this document.

Approach 2: Advanced Affinitization

For microservices that are most concerned with reducing jitter, the basic affinitization approach described above still leaves open the potential for the operating system relocating your threads from one CPU to another which can lead to latency spikes. With the advanced affinitization approach described here, you will avoid this by configuring each busy-spinning or critical thread in the microservice to its own CPU to avoid context switching.

CPU Affinity Mask Format

Thread affinities are configured by supplying a mask that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical CPUs, or a square bracket enclosed comma-separated list enumerating the logical CPUs to which a thread should be affinitized. The latter format is recommended as it is easier to read.

Examples:

  • "0" - no affinity specified (0x0000)

  • "[]" - no affinity specified

  • "1" - specifies logical CPU 0 (0x0001)

  • "[0]" - specifies logical CPU 0

  • "4" - specifies logical CPU 2 (0x0100)

  • "[2]" - list specifying logical CPU 2

  • "6" - mask specifying logical CPU 1 and 2 (0x0110)

  • "4294967296" - specifies logical CPU 32 (0x1000 0000 0000 0000 0000 0000 0000 0000)

  • "[32]" - specifies logical CPU 32

  • "[1,2]" - list specifying logical CPU 1 and 2

Enabling Affinitization

By default, CPU affinitization is disabled. To enable it you can set the following env flags in the DDL configuration:

Configuring CPU Affinities

Step 1: Configure Default CPU Affinity Mask

Threads that are critical for reducing microservice latency and improving throughput are listed in the reference tables above, but not all threads are critical. To prevent non-critical threads from being scheduled on a CPU being used by a critical thread, the platform allows the microservice to configure one or more 'default' CPUs that non-critical threads can be affinitized to, by setting the nv.defaultcpuaffinitymask environment variable. For example, the platform's statistics collection thread doesn't need its own dedicated CPU to perform its relatively simple tasks of periodically reporting heartbeats. However, we still want to ensure that the operating system doesn't try to schedule it onto the same core as a critical thread, so the platform will affinitize it with the default CPU affinity mask.

Step 2: Configure Critical Platform Threads Affinities

Critical platform-related threads are those that have the most impact on latency and performance. When the platform is optimized for latency or throughput these threads will be set to use BusySpin or Yielding respectively to avoid being context switched. Each of these threads should be assigned its own CPU.

See the Critical Thread Affinity Configuration Reference section below for a listing of these threads and how to configure their affinities.

Step 3: Affinitizing Non-Platform Threads

If your microservice uses its own threads, they can be affinitized as well by using the platform's UtlThread utility class. Non-critical threads that are not busy-spinning threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core to prevent them from being scheduled on top of the platform's threads.

Non-Critical, Non-Spinning Thread

Non-critical threads can be affinitized to the set of default CPUs configured by nv.defaultcpuaffinitymask by calling setDefaultCpuAffinityMask from the thread to be affinitized:

Critical or Busy Threads

Threads that participate in your transaction's processing flow or are spinning or heavy CPU users should be pinned to their own core so that they don't interfere with affinitized platform threads. For example:

Launching with NUMA Affinitization

Unlike with the Basic Affinitization approach, when all threads have been affinitized to their own core or the default core, it is not strictly necessary to restrict what cores a process operates on, just the memory node to which to restrict the process. In fact, it can even be beneficial to let threads outside the platform or microservice's control be scheduled on other NUMA nodes.

Validating Affinitization

Via Thread Stats Output

The easiest way to check your work is to enable container thread stats. Thread stats are emitted in heartbeats and affinities can be reported in monitoring tools. If the container is configured to trace thread stats, then thread usage is printed as follows:

You can look for any spinning thread (CPU% at 100) that doesn't have an affinity assigned. This will help you avoid the following pitfalls:

  • Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same coreId/threadId)

  • Having some other non-Rumi process spinning on one of the coreIds that you've affinitized to

  • Affinitizing across multiple socketIds (which are on different NUMA nodes) can make performance worse

  • You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning

The platform outputs thread affinitization using the format like: (aff=[6(s0c9t0)]) which can be interpreted as logical CPU 6 which is on socket 0, core 9, thread 0.

Programmatically

This will dump the affinitization state of all threads affinitized through UtlThread.

Via Trace

The above trace will also be printed by an AepEngine after messaging has been started or alternatively when it assumes a backup role (in most cases all platform threads will have been started by this time).

Limitations

The following limitations apply to thread affinitization support:

  • Thread affinitization is currently only supported on Linux

  • Affinitization is limited to being able to affinitize threads to logical cores 0 through 63

  • Affinitization of a thread does not reserve the CPU core, just limits the cores on which a thread will execute. This is important because if not all threads are affinitized the OS thread scheduler may schedule another thread on top of a critical thread if CPU resources are scarce


Next Steps

  1. Review the Threading Model to understand NUMA and affinitization

  2. Determine your machine's CPU layout using UtlThread

  3. Start with basic affinitization to evaluate benefits

  4. Move to advanced affinitization if jitter reduction is critical

  5. Test performance with and without affinitization

Last updated