AEP Module
The AEP (Application Event Processing) module contains the canonical end-to-end performance benchmark for Rumi. This benchmark measures the complete Receive-Process-Send flow of a clustered microservice.
Canonical Benchmark: The AEP module's ESProcessor benchmark is used to generate Rumi's official performance metrics. Results from this benchmark are published in the Canonical Benchmark Results section.
Overview
The AEP module benchmarks exercise the entire Rumi stack, including:
Messaging: Inbound and outbound message handling
Serialization: Message encoding/decoding (Xbuf2)
Handler Dispatch: Event routing to business logic
State Management: Object store operations
Persistence: Transaction log writes
Clustering: State replication to backup instances
Consensus: Acknowledgment protocol between primary and backup
This represents the most comprehensive benchmark in the suite and is used to publish official performance metrics for Rumi releases.
Test Programs
The AEP module provides two test programs:
ESProcessor (Event Sourcing)
Class: com.neeve.perf.aep.engine.ESProcessor
The Event Sourcing processor is the canonical benchmark used for published Rumi performance results. It uses Event Sourcing HA policy where:
Messages are the source of truth
State is replayed from message log on recovery
Optimal for high-throughput message processing
Used for: Official Rumi performance benchmarks published in the Canonical Benchmark Results.
SRProcessor (State Replication)
Class: com.neeve.perf.aep.engine.SRProcessor
The State Replication processor uses State Replication HA policy where:
State objects are the source of truth
State changes are replicated to backup
State is persisted and recovered directly
Used for: Benchmarking state-heavy applications.
Test Flow
The canonical benchmark exercises the following flow:
Primary Instance
Receive - Inbound message arrives from test driver
Decode - Deserialize message from wire format (Xbuf2)
Dispatch - Route message to handler
Process - Business logic reads message fields
Create Response - Business logic creates outbound message
Replicate - Replicate transaction to backup (concurrent with 7)
Persist - Write transaction to log on primary
Consensus ACK - Receive acknowledgment from backup
Encode - Serialize outbound message
Send - Transmit outbound message to test driver
Backup Instance
The backup instance (concurrent with primary's persist step):
Receive Replication - Receive replicated transaction from primary
Persist - Write transaction to log on backup
Dispatch - Route message to handler
Replay - Execute business logic for consistency
Send ACK - Acknowledge to primary
Test Message
The benchmark uses a Car message (defined in nvx-perf-models) that:
Exercises the complete Xbuf2 data model
Contains ~200 bytes when serialized
Includes primitives, strings, nested objects, and arrays
Represents a realistic business message
Test Driver
The test uses a custom in-process driver (LocalProvider) that:
Injects messages at configurable rates
Measures wire-to-wire (w2w) latency
Captures latency distributions (50th, 99th, 99.9th percentiles)
Measures maximum throughput
Eliminates network overhead for consistent measurements
CPU Configurations
The benchmark is run in three CPU configurations:
MinCPU (1 CPU)
Minimal CPU footprint with all threads on dedicated cores:
Threads:
Business logic thread (hot, spinning)
Cluster replication reader (affinitized, not hot)
Configuration:
Use Case: Resource-constrained environments
Default (2 CPUs)
Balanced configuration where Rumi decides thread allocation:
Threads: Automatically determined by Rumi
Configuration:
Use Case: General-purpose production deployments
MaxCPU (6 CPUs)
Maximum parallelization with additional detached threads:
Additional Threads:
Detached sender (hot, spinning)
Detached dispatcher (hot, spinning)
Detached persister
Configuration:
Use Case: Ultra-low latency requirements with available CPU resources
Runtime Optimization Modes
The benchmark is run in two optimization modes:
Latency Mode
Optimizes for lowest latency:
Message Rate: 10,000 messages/second (sustained) Measurement: Latency percentiles (50th, 99th, 99.9th)
Throughput Mode
Optimizes for highest throughput:
Message Rate: As fast as possible (saturated) Measurement: Maximum messages per second
Message Access Methods
The benchmark tests two message access patterns:
Indirect Access (protobuf.serial/protobuf.random)
Message data accessed via POJO getters/setters:
Direct Access (Serializer/Deserializer)
Message data accessed via zero-copy serializers:
Direct access provides ~10% lower latency and 3x higher throughput
Running the Canonical Benchmark
Prerequisites
Two Linux containers with InfiniBand or 10GbE networking
Rumi Perf distribution installed on both containers
Synchronized time between containers
Example: Latency Test (Default Config)
On Primary (192.168.4.24):
On Backup (192.168.4.26):
Note: Start the backup first, then start the primary. The primary will inject messages and report results.
Example: Throughput Test (MinCPU Config)
On Primary:
Command-Line Parameters
General Parameters
--encoding
-l
protobuf.serial
Message encoding: protobuf.serial, protobuf.random, quark.serial, quark.random
--count
-c
10,000,000
Number of messages to inject
--rate
-r
100,000
Message injection rate (msgs/sec)
--warmupTime
-t
2
Warmup time in seconds before collecting stats
--emptyMessage
-E
false
Don't populate message fields (minimal test)
--noLatencyWrites
-a
false
Don't write latency data to file
--printIntervalStats
-b
false
Print interval stats during test
Output Parameters
--outputFile
-O
null
Excel file to write results to
--outputCell
-C
null
Cell in Excel file (ROW-COL format)
--outputThroughput
-T
false
Write throughput instead of latency
CPU Affinity Parameters
--injectorCPUAffinityMask
-j
null
CPU mask for message injector thread
--muxCPUAffinityMask
-y
null
CPU mask for event multiplexer thread
--busDetachedSendCPUAffinityMask
-o
null
CPU mask for detached sender thread
Message Bus Parameters
--busDetachedSend
-u
false
Enable detached sending (separate thread)
--busDetachedSendQueueDepth
-n
1024
Depth of detached send queue
Persistence Parameters
--enablePersistence
-e
false
Enable transaction log persistence
--persisterLogLocation
-k
.
Directory for transaction log
--persisterInitialLogLength
-i
1
Preallocated log length (GB)
--persisterZeroOutInitial
-z
false
Zero out preallocated log
--persisterWriteBufferSize
-w
8192
Write buffer size (bytes)
--persisterFlushOnCommit
-f
false
Flush log on every commit
--persisterFlushUsingMappedMemory
-m
false
Use memory-mapped flush
--persisterDetached
-d
false
Use detached persistence thread
--persisterWriterCPUAffinityMask
-x
null
CPU mask for detached persister thread
Clustering Parameters
--enableClustering
-v
false
Enable clustering (primary/backup)
--clusteringLocalIfAddr
-I
0.0.0.0
Local interface for replication
--clusteringDiscoveryLocalIfAddr
-U
0.0.0.0
Local interface for discovery
--clusteringLinkSpinRead
-W
false
Spin on replication link reads
--clusteringLinkReaderCPUAffinityMask
-V
null
CPU mask for replication reader
--clusteringDetachedSend
-S
false
Enable detached replication send
--clusteringDetachedSenderCPUAffinityMask
-A
null
CPU mask for detached sender
--clusteringDetachedDispatch
-D
false
Enable detached replication dispatch
--clusteringDetachedDispatcherCPUAffinityMask
-B
null
CPU mask for detached dispatcher
Interpreting Results
Latency Results
The test outputs latency percentiles in microseconds:
Includes:
Inbound deserialization
Handler dispatch
Business logic execution
Persistence
Replication to backup
Consensus acknowledgment
Outbound serialization
Round-trip wire latency (~23µs on unoptimized network)
Throughput Results
The test outputs maximum sustained throughput:
Represents: Maximum rate at which the clustered microservice can process messages while maintaining:
Full persistence
Replication to backup
Consensus acknowledgment
Published Results
Official performance results from this benchmark are published in the Canonical Benchmark Results section.
Results are organized by:
Rumi version
CPU configuration (MinCPU, Default, MaxCPU)
Optimization mode (Latency, Throughput)
Message access method (Indirect, Direct)
For complete test methodology and hardware configuration, see the Test Description.
Test Configuration Files
Complete test configurations for published results are available in:
This file contains the exact command lines used to generate published performance results for each Rumi release.
Next Steps
Review Canonical Benchmark Results for published performance metrics
See Test Description for complete methodology
Explore other modules for component-level benchmarks
Read Rumi documentation for tuning guidance
Last updated

