AEP Module

The AEP (Application Event Processing) module contains the canonical end-to-end performance benchmark for Rumi. This benchmark measures the complete Receive-Process-Send flow of a clustered microservice.

Canonical Benchmark: The AEP module's ESProcessor benchmark is used to generate Rumi's official performance metrics. Results from this benchmark are published in the Canonical Benchmark Results section.

Overview

The AEP module benchmarks exercise the entire Rumi stack, including:

Messaging: Inbound and outbound message handling
Serialization: Message encoding/decoding (Xbuf2)
Handler Dispatch: Event routing to business logic
State Management: Object store operations
Persistence: Transaction log writes
Clustering: State replication to backup instances
Consensus: Acknowledgment protocol between primary and backup

This represents the most comprehensive benchmark in the suite and is used to publish official performance metrics for Rumi releases.

Test Programs

The AEP module provides two test programs:

ESProcessor (Event Sourcing)

Class: com.neeve.perf.aep.engine.ESProcessor

The Event Sourcing processor is the canonical benchmark used for published Rumi performance results. It uses Event Sourcing HA policy where:

Messages are the source of truth
State is replayed from message log on recovery
Optimal for high-throughput message processing

Used for: Official Rumi performance benchmarks published in the Canonical Benchmark Results.

SRProcessor (State Replication)

Class: com.neeve.perf.aep.engine.SRProcessor

The State Replication processor uses State Replication HA policy where:

State objects are the source of truth
State changes are replicated to backup
State is persisted and recovered directly

Used for: Benchmarking state-heavy applications.

Test Flow

The canonical benchmark exercises the following flow:

Primary Instance

Receive - Inbound message arrives from test driver
Decode - Deserialize message from wire format (Xbuf2)
Dispatch - Route message to handler
Process - Business logic reads message fields
Create Response - Business logic creates outbound message
Replicate - Replicate transaction to backup (concurrent with 7)
Persist - Write transaction to log on primary
Consensus ACK - Receive acknowledgment from backup
Encode - Serialize outbound message
Send - Transmit outbound message to test driver

Backup Instance

The backup instance (concurrent with primary's persist step):

Receive Replication - Receive replicated transaction from primary
Persist - Write transaction to log on backup
Dispatch - Route message to handler
Replay - Execute business logic for consistency
Send ACK - Acknowledge to primary

Test Message

The benchmark uses a Car message (defined in nvx-perf-models) that:

Exercises the complete Xbuf2 data model
Contains ~200 bytes when serialized
Includes primitives, strings, nested objects, and arrays
Represents a realistic business message

Test Driver

The test uses a custom in-process driver (LocalProvider) that:

Injects messages at configurable rates
Measures wire-to-wire (w2w) latency
Captures latency distributions (50th, 99th, 99.9th percentiles)
Measures maximum throughput
Eliminates network overhead for consistent measurements

CPU Configurations

The benchmark is run in three CPU configurations:

MinCPU (1 CPU)

Minimal CPU footprint with all threads on dedicated cores:

Threads:

Business logic thread (hot, spinning)
Cluster replication reader (affinitized, not hot)

Configuration:

--clusteringLinkSpinRead=false
--busDetachedSend=false
--persisterDetached=false
--clusteringLinkReaderCPUAffinityMask [1]
--muxCPUAffinityMask [3]
--injectorCPUAffinityMask [2]

Use Case: Resource-constrained environments

Default (2 CPUs)

Balanced configuration where Rumi decides thread allocation:

Threads: Automatically determined by Rumi

Configuration:

# Uses default settings for detached operations
--clusteringLinkReaderCPUAffinityMask [1]
--muxCPUAffinityMask [3]
--injectorCPUAffinityMask [2]
--busDetachedSendCPUAffinityMask [4]
--persisterWriterCPUAffinityMask [5]

Use Case: General-purpose production deployments

MaxCPU (6 CPUs)

Maximum parallelization with additional detached threads:

Additional Threads:

Detached sender (hot, spinning)
Detached dispatcher (hot, spinning)
Detached persister

Configuration:

--clusteringLinkSpinRead=true
--busDetachedSend=true
--persisterDetached=true
--clusteringDetachedSend
--clusteringDetachedDispatch
--clusteringLinkReaderCPUAffinityMask [1]
--muxCPUAffinityMask [3]
--injectorCPUAffinityMask [2]
--busDetachedSendCPUAffinityMask [4]
--persisterWriterCPUAffinityMask [5]
--clusteringDetachedSenderCPUAffinityMask [6]
--clusteringDetachedDispatcherCPUAffinityMask [7]

Use Case: Ultra-low latency requirements with available CPU resources

Runtime Optimization Modes

The benchmark is run in two optimization modes:

Latency Mode

Optimizes for lowest latency:

-Dnv.optimizefor=latency

Message Rate: 10,000 messages/second (sustained) Measurement: Latency percentiles (50th, 99th, 99.9th)

Throughput Mode

Optimizes for highest throughput:

-Dnv.optimizefor=throughput

Message Rate: As fast as possible (saturated) Measurement: Maximum messages per second

Message Access Methods

The benchmark tests two message access patterns:

Indirect Access (protobuf.serial/protobuf.random)

Message data accessed via POJO getters/setters:

@EventHandler
public void onMessage(Car inMessage) {
    // Read via getters
    String make = inMessage.getMake();
    String model = inMessage.getModel();

    // Create outbound message
    Car outMessage = Car.create();
    outMessage.setMake(make);
    outMessage.setModel(model);

    // Send
    messageSender.sendMessage(1, outMessage);
}

Direct Access (Serializer/Deserializer)

Message data accessed via zero-copy serializers:

@EventHandler
public void onMessage(Car inMessage) {
    // Read via deserializer (zero-copy)
    deserializer.decode(inMessage);

    // Create via serializer (zero-copy)
    Car outMessage = serializer.create();

    // Send
    messageSender.sendMessage(1, outMessage);
}

Direct access provides ~10% lower latency and 3x higher throughput

Running the Canonical Benchmark

Prerequisites

Two Linux containers with InfiniBand or 10GbE networking
Rumi Perf distribution installed on both containers
Synchronized time between containers

Example: Latency Test (Default Config)

On Primary (192.168.4.24):

rm -rf rdat
LD_LIBRARY_PATH=$HOME/.nvx/native \
numactl -m 0 \
$JAVA_HOME/bin/java \
  -Dnv.optimizefor=latency \
  -Dnv.optimizeMemoryUsage=false \
  -Dnv.conservecpu=false \
  -Xms16g -Xmx16g \
  -cp "libs/*" \
  com.neeve.perf.aep.engine.ESProcessor \
  --count 600000 \
  --rate 10000 \
  --warmupTime 10 \
  --printIntervalStats \
  --enablePersistence \
  --persisterLogLocation $(pwd)/rdat \
  --persisterInitialLogLength 10 \
  --persisterZeroOutInitial \
  --persisterFlushOnCommit \
  --enableClustering \
  --clusteringLocalIfAddr 192.168.4.24 \
  --clusteringDiscoveryLocalIfAddr 192.168.3.24 \
  --clusteringLinkReaderCPUAffinityMask [1] \
  --injectorCPUAffinityMask [2] \
  --muxCPUAffinityMask [3] \
  --busDetachedSendCPUAffinityMask [4] \
  --persisterWriterCPUAffinityMask [5] \
  --encoding protobuf.random

On Backup (192.168.4.26):

rm -rf rdat
LD_LIBRARY_PATH=$HOME/.nvx/native \
numactl -m 0 \
$JAVA_HOME/bin/java \
  -Dnv.optimizefor=latency \
  -Dnv.optimizeMemoryUsage=false \
  -Dnv.conservecpu=false \
  -Xms16g -Xmx16g \
  -cp "libs/*" \
  com.neeve.perf.aep.engine.ESProcessor \
  --count 600000 \
  --rate 10000 \
  --warmupTime 10 \
  --printIntervalStats \
  --enablePersistence \
  --persisterLogLocation $(pwd)/rdat \
  --persisterInitialLogLength 10 \
  --persisterZeroOutInitial \
  --persisterFlushOnCommit \
  --enableClustering \
  --clusteringLocalIfAddr 192.168.4.26 \
  --clusteringDiscoveryLocalIfAddr 192.168.3.26 \
  --clusteringLinkReaderCPUAffinityMask [1] \
  --injectorCPUAffinityMask [2] \
  --muxCPUAffinityMask [3] \
  --busDetachedSendCPUAffinityMask [4] \
  --persisterWriterCPUAffinityMask [5] \
  --encoding protobuf.random

Note: Start the backup first, then start the primary. The primary will inject messages and report results.

Example: Throughput Test (MinCPU Config)

On Primary:

rm -rf rdat
LD_LIBRARY_PATH=$HOME/.nvx/native \
numactl -m 0 \
$JAVA_HOME/bin/java \
  -Dnv.optimizefor=throughput \
  -Dnv.optimizeMemoryUsage=false \
  -Dnv.conservecpu=false \
  -Xms16g -Xmx16g \
  -cp "libs/*" \
  com.neeve.perf.aep.engine.ESProcessor \
  --count 5000000 \
  --rate 1000000 \
  --warmupTime 10 \
  --printIntervalStats \
  --enablePersistence \
  --persisterDetached=false \
  --persisterLogLocation $(pwd)/rdat \
  --persisterInitialLogLength 10 \
  --persisterZeroOutInitial \
  --persisterFlushOnCommit \
  --enableClustering \
  --clusteringLocalIfAddr 192.168.4.24 \
  --clusteringDiscoveryLocalIfAddr 192.168.3.24 \
  --clusteringLinkSpinRead=false \
  --busDetachedSend=false \
  --clusteringLinkReaderCPUAffinityMask [1] \
  --injectorCPUAffinityMask [2] \
  --muxCPUAffinityMask [3] \
  --busDetachedSendCPUAffinityMask [4] \
  --persisterWriterCPUAffinityMask [5] \
  --outputThroughput \
  --encoding xbuf2.random

Command-Line Parameters

General Parameters

Parameter

Short

Default

Description

--encoding

-l

protobuf.serial

Message encoding: protobuf.serial, protobuf.random, quark.serial, quark.random

--count

-c

10,000,000

Number of messages to inject

--rate

-r

100,000

Message injection rate (msgs/sec)

--warmupTime

-t

Warmup time in seconds before collecting stats

--emptyMessage

-E

false

Don't populate message fields (minimal test)

--noLatencyWrites

-a

false

Don't write latency data to file

--printIntervalStats

-b

false

Print interval stats during test

Output Parameters

Parameter

Short

Default

Description

--outputFile

-O

null

Excel file to write results to

--outputCell

-C

null

Cell in Excel file (ROW-COL format)

--outputThroughput

-T

false

Write throughput instead of latency

CPU Affinity Parameters

Parameter

Short

Default

Description

--injectorCPUAffinityMask

-j

null

CPU mask for message injector thread

--muxCPUAffinityMask

-y

null

CPU mask for event multiplexer thread

--busDetachedSendCPUAffinityMask

-o

null

CPU mask for detached sender thread

Message Bus Parameters

Parameter

Short

Default

Description

--busDetachedSend

-u

false

Enable detached sending (separate thread)

--busDetachedSendQueueDepth

-n

1024

Depth of detached send queue

Persistence Parameters

Parameter

Short

Default

Description

--enablePersistence

-e

false

Enable transaction log persistence

--persisterLogLocation

-k

Directory for transaction log

--persisterInitialLogLength

-i

Preallocated log length (GB)

--persisterZeroOutInitial

-z

false

Zero out preallocated log

--persisterWriteBufferSize

-w

8192

Write buffer size (bytes)

--persisterFlushOnCommit

-f

false

Flush log on every commit

--persisterFlushUsingMappedMemory

-m

false

Use memory-mapped flush

--persisterDetached

-d

false

Use detached persistence thread

--persisterWriterCPUAffinityMask

-x

null

CPU mask for detached persister thread

Clustering Parameters

Parameter

Short

Default

Description

--enableClustering

-v

false

Enable clustering (primary/backup)

--clusteringLocalIfAddr

-I

0.0.0.0

Local interface for replication

--clusteringDiscoveryLocalIfAddr

-U

0.0.0.0

Local interface for discovery

--clusteringLinkSpinRead

-W

false

Spin on replication link reads

--clusteringLinkReaderCPUAffinityMask

-V

null

CPU mask for replication reader

--clusteringDetachedSend

-S

false

Enable detached replication send

--clusteringDetachedSenderCPUAffinityMask

-A

null

CPU mask for detached sender

--clusteringDetachedDispatch

-D

false

Enable detached replication dispatch

--clusteringDetachedDispatcherCPUAffinityMask

-B

null

CPU mask for detached dispatcher

Interpreting Results

Latency Results

The test outputs latency percentiles in microseconds:

Wire-to-Wire Latency Stats:
  Count: 600000
  50th percentile: 27.34 µs
  99th percentile: 30.14 µs
  99.9th percentile: 35.23 µs
  Mean: 27.89 µs

Includes:

Inbound deserialization
Handler dispatch
Business logic execution
Persistence
Replication to backup
Consensus acknowledgment
Outbound serialization
Round-trip wire latency (~23µs on unoptimized network)

Throughput Results

The test outputs maximum sustained throughput:

Throughput: 281,947 messages/second

Represents: Maximum rate at which the clustered microservice can process messages while maintaining:

Full persistence
Replication to backup
Consensus acknowledgment

Published Results

Official performance results from this benchmark are published in the Canonical Benchmark Results section.

Results are organized by:

Rumi version
CPU configuration (MinCPU, Default, MaxCPU)
Optimization mode (Latency, Throughput)
Message access method (Indirect, Direct)

For complete test methodology and hardware configuration, see the Test Description.

Test Configuration Files

Complete test configurations for published results are available in:

nvx-perf-aep/results/mincpu_default_maxcpu_tests.txt

This file contains the exact command lines used to generate published performance results for each Rumi release.

Next Steps

Review Canonical Benchmark Results for published performance metrics
See Test Description for complete methodology
Explore other modules for component-level benchmarks
Read Rumi documentation for tuning guidance

PreviousStorage Module NextJavadoc

Last updated 5 days ago

hashtagOverview

hashtagTest Programs

hashtagESProcessor (Event Sourcing)

hashtagSRProcessor (State Replication)

hashtagTest Flow

hashtagPrimary Instance

hashtagBackup Instance

hashtagTest Message

hashtagTest Driver

hashtagCPU Configurations

hashtagMinCPU (1 CPU)

hashtagDefault (2 CPUs)

hashtagMaxCPU (6 CPUs)

hashtagRuntime Optimization Modes

hashtagLatency Mode

hashtagThroughput Mode

hashtagMessage Access Methods

hashtagIndirect Access (protobuf.serial/protobuf.random)

hashtagDirect Access (Serializer/Deserializer)

hashtagRunning the Canonical Benchmark

hashtagPrerequisites

hashtagExample: Latency Test (Default Config)

hashtagExample: Throughput Test (MinCPU Config)

hashtagCommand-Line Parameters

hashtagGeneral Parameters

hashtagOutput Parameters

hashtagCPU Affinity Parameters

hashtagMessage Bus Parameters

hashtagPersistence Parameters

hashtagClustering Parameters

hashtagInterpreting Results

hashtagLatency Results

hashtagThroughput Results

hashtagPublished Results

hashtagTest Configuration Files

hashtagNext Steps

Overview

Test Programs

ESProcessor (Event Sourcing)

SRProcessor (State Replication)

Test Flow

Primary Instance

Backup Instance

Test Message

Test Driver

CPU Configurations

MinCPU (1 CPU)

Default (2 CPUs)

MaxCPU (6 CPUs)

Runtime Optimization Modes

Latency Mode

Throughput Mode

Message Access Methods

Indirect Access (protobuf.serial/protobuf.random)

Direct Access (Serializer/Deserializer)

Running the Canonical Benchmark

Prerequisites

Example: Latency Test (Default Config)

Example: Throughput Test (MinCPU Config)

Command-Line Parameters

General Parameters

Output Parameters

CPU Affinity Parameters

Message Bus Parameters

Persistence Parameters

Clustering Parameters

Interpreting Results

Latency Results

Throughput Results

Published Results

Test Configuration Files

Next Steps