AEP Module

The AEP (Application Event Processing) module contains the canonical end-to-end performance benchmark for Rumi. This benchmark measures the complete Receive-Process-Send flow of a clustered microservice.

circle-info

Canonical Benchmark: The AEP module's ESProcessor benchmark is used to generate Rumi's official performance metrics. Results from this benchmark are published in the Canonical Benchmark Results section.

Overview

The AEP module benchmarks exercise the entire Rumi stack, including:

  • Messaging: Inbound and outbound message handling

  • Serialization: Message encoding/decoding (Xbuf2)

  • Handler Dispatch: Event routing to business logic

  • State Management: Object store operations

  • Persistence: Transaction log writes

  • Clustering: State replication to backup instances

  • Consensus: Acknowledgment protocol between primary and backup

This represents the most comprehensive benchmark in the suite and is used to publish official performance metrics for Rumi releases.

Test Programs

The AEP module provides two test programs:

ESProcessor (Event Sourcing)

Class: com.neeve.perf.aep.engine.ESProcessor

The Event Sourcing processor is the canonical benchmark used for published Rumi performance results. It uses Event Sourcing HA policy where:

  • Messages are the source of truth

  • State is replayed from message log on recovery

  • Optimal for high-throughput message processing

Used for: Official Rumi performance benchmarks published in the Canonical Benchmark Results.

SRProcessor (State Replication)

Class: com.neeve.perf.aep.engine.SRProcessor

The State Replication processor uses State Replication HA policy where:

  • State objects are the source of truth

  • State changes are replicated to backup

  • State is persisted and recovered directly

Used for: Benchmarking state-heavy applications.

Test Flow

The canonical benchmark exercises the following flow:

Primary Instance

  1. Receive - Inbound message arrives from test driver

  2. Decode - Deserialize message from wire format (Xbuf2)

  3. Dispatch - Route message to handler

  4. Process - Business logic reads message fields

  5. Create Response - Business logic creates outbound message

  6. Replicate - Replicate transaction to backup (concurrent with 7)

  7. Persist - Write transaction to log on primary

  8. Consensus ACK - Receive acknowledgment from backup

  9. Encode - Serialize outbound message

  10. Send - Transmit outbound message to test driver

Backup Instance

The backup instance (concurrent with primary's persist step):

  1. Receive Replication - Receive replicated transaction from primary

  2. Persist - Write transaction to log on backup

  3. Dispatch - Route message to handler

  4. Replay - Execute business logic for consistency

  5. Send ACK - Acknowledge to primary

Test Message

The benchmark uses a Car message (defined in nvx-perf-models) that:

  • Exercises the complete Xbuf2 data model

  • Contains ~200 bytes when serialized

  • Includes primitives, strings, nested objects, and arrays

  • Represents a realistic business message

Test Driver

The test uses a custom in-process driver (LocalProvider) that:

  • Injects messages at configurable rates

  • Measures wire-to-wire (w2w) latency

  • Captures latency distributions (50th, 99th, 99.9th percentiles)

  • Measures maximum throughput

  • Eliminates network overhead for consistent measurements

CPU Configurations

The benchmark is run in three CPU configurations:

MinCPU (1 CPU)

Minimal CPU footprint with all threads on dedicated cores:

Threads:

  • Business logic thread (hot, spinning)

  • Cluster replication reader (affinitized, not hot)

Configuration:

Use Case: Resource-constrained environments

Default (2 CPUs)

Balanced configuration where Rumi decides thread allocation:

Threads: Automatically determined by Rumi

Configuration:

Use Case: General-purpose production deployments

MaxCPU (6 CPUs)

Maximum parallelization with additional detached threads:

Additional Threads:

  • Detached sender (hot, spinning)

  • Detached dispatcher (hot, spinning)

  • Detached persister

Configuration:

Use Case: Ultra-low latency requirements with available CPU resources

Runtime Optimization Modes

The benchmark is run in two optimization modes:

Latency Mode

Optimizes for lowest latency:

Message Rate: 10,000 messages/second (sustained) Measurement: Latency percentiles (50th, 99th, 99.9th)

Throughput Mode

Optimizes for highest throughput:

Message Rate: As fast as possible (saturated) Measurement: Maximum messages per second

Message Access Methods

The benchmark tests two message access patterns:

Indirect Access (protobuf.serial/protobuf.random)

Message data accessed via POJO getters/setters:

Direct Access (Serializer/Deserializer)

Message data accessed via zero-copy serializers:

Direct access provides ~10% lower latency and 3x higher throughput

Running the Canonical Benchmark

Prerequisites

  1. Two Linux containers with InfiniBand or 10GbE networking

  2. Rumi Perf distribution installed on both containers

  3. Synchronized time between containers

Example: Latency Test (Default Config)

On Primary (192.168.4.24):

On Backup (192.168.4.26):

Note: Start the backup first, then start the primary. The primary will inject messages and report results.

Example: Throughput Test (MinCPU Config)

On Primary:

Command-Line Parameters

General Parameters

Parameter
Short
Default
Description

--encoding

-l

protobuf.serial

Message encoding: protobuf.serial, protobuf.random, quark.serial, quark.random

--count

-c

10,000,000

Number of messages to inject

--rate

-r

100,000

Message injection rate (msgs/sec)

--warmupTime

-t

2

Warmup time in seconds before collecting stats

--emptyMessage

-E

false

Don't populate message fields (minimal test)

--noLatencyWrites

-a

false

Don't write latency data to file

--printIntervalStats

-b

false

Print interval stats during test

Output Parameters

Parameter
Short
Default
Description

--outputFile

-O

null

Excel file to write results to

--outputCell

-C

null

Cell in Excel file (ROW-COL format)

--outputThroughput

-T

false

Write throughput instead of latency

CPU Affinity Parameters

Parameter
Short
Default
Description

--injectorCPUAffinityMask

-j

null

CPU mask for message injector thread

--muxCPUAffinityMask

-y

null

CPU mask for event multiplexer thread

--busDetachedSendCPUAffinityMask

-o

null

CPU mask for detached sender thread

Message Bus Parameters

Parameter
Short
Default
Description

--busDetachedSend

-u

false

Enable detached sending (separate thread)

--busDetachedSendQueueDepth

-n

1024

Depth of detached send queue

Persistence Parameters

Parameter
Short
Default
Description

--enablePersistence

-e

false

Enable transaction log persistence

--persisterLogLocation

-k

.

Directory for transaction log

--persisterInitialLogLength

-i

1

Preallocated log length (GB)

--persisterZeroOutInitial

-z

false

Zero out preallocated log

--persisterWriteBufferSize

-w

8192

Write buffer size (bytes)

--persisterFlushOnCommit

-f

false

Flush log on every commit

--persisterFlushUsingMappedMemory

-m

false

Use memory-mapped flush

--persisterDetached

-d

false

Use detached persistence thread

--persisterWriterCPUAffinityMask

-x

null

CPU mask for detached persister thread

Clustering Parameters

Parameter
Short
Default
Description

--enableClustering

-v

false

Enable clustering (primary/backup)

--clusteringLocalIfAddr

-I

0.0.0.0

Local interface for replication

--clusteringDiscoveryLocalIfAddr

-U

0.0.0.0

Local interface for discovery

--clusteringLinkSpinRead

-W

false

Spin on replication link reads

--clusteringLinkReaderCPUAffinityMask

-V

null

CPU mask for replication reader

--clusteringDetachedSend

-S

false

Enable detached replication send

--clusteringDetachedSenderCPUAffinityMask

-A

null

CPU mask for detached sender

--clusteringDetachedDispatch

-D

false

Enable detached replication dispatch

--clusteringDetachedDispatcherCPUAffinityMask

-B

null

CPU mask for detached dispatcher

Interpreting Results

Latency Results

The test outputs latency percentiles in microseconds:

Includes:

  • Inbound deserialization

  • Handler dispatch

  • Business logic execution

  • Persistence

  • Replication to backup

  • Consensus acknowledgment

  • Outbound serialization

  • Round-trip wire latency (~23µs on unoptimized network)

Throughput Results

The test outputs maximum sustained throughput:

Represents: Maximum rate at which the clustered microservice can process messages while maintaining:

  • Full persistence

  • Replication to backup

  • Consensus acknowledgment

Published Results

Official performance results from this benchmark are published in the Canonical Benchmark Results section.

Results are organized by:

  • Rumi version

  • CPU configuration (MinCPU, Default, MaxCPU)

  • Optimization mode (Latency, Throughput)

  • Message access method (Indirect, Direct)

For complete test methodology and hardware configuration, see the Test Description.

Test Configuration Files

Complete test configurations for published results are available in:

This file contains the exact command lines used to generate published performance results for each Rumi release.

Next Steps

Last updated