> For the complete documentation index, see [llms.txt](https://docs.rumi.systems/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.rumi.systems/performance/benchmark-suite/modules/aep-module.md).

# AEP Module

The AEP (Application Event Processing) module contains the canonical end-to-end performance benchmark for Rumi. This benchmark measures the complete Receive-Process-Send flow of a clustered microservice.

{% hint style="info" %}
**Canonical Benchmark**: The AEP module's ESProcessor benchmark is used to generate Rumi's official performance metrics. Results from this benchmark are published in the [Canonical Benchmark Results](/performance/canonical-benchmark.md) section.
{% endhint %}

## Overview

The AEP module benchmarks exercise the entire Rumi stack, including:

* **Messaging**: Inbound and outbound message handling
* **Serialization**: Message encoding/decoding (Xbuf2)
* **Handler Dispatch**: Event routing to business logic
* **State Management**: Object store operations
* **Persistence**: Transaction log writes
* **Clustering**: State replication to backup instances
* **Consensus**: Acknowledgment protocol between primary and backup

This represents the most comprehensive benchmark in the suite and is used to publish official performance metrics for Rumi releases.

## Test Programs

The AEP module provides two test programs:

### ESProcessor (Event Sourcing)

**Class**: `com.neeve.perf.aep.engine.ESProcessor`

The Event Sourcing processor is the **canonical benchmark** used for published Rumi performance results. It uses Event Sourcing HA policy where:

* Messages are the source of truth
* State is replayed from message log on recovery
* Optimal for high-throughput message processing

**Used for**: Official Rumi performance benchmarks published in the [Canonical Benchmark Results](/performance/canonical-benchmark.md).

### SRProcessor (State Replication)

**Class**: `com.neeve.perf.aep.engine.SRProcessor`

The State Replication processor uses State Replication HA policy where:

* State objects are the source of truth
* State changes are replicated to backup
* State is persisted and recovered directly

**Used for**: Benchmarking state-heavy applications.

## Test Flow

The canonical benchmark exercises the following flow:

### Primary Instance

1. **Receive** - Inbound message arrives from test driver
2. **Decode** - Deserialize message from wire format (Xbuf2)
3. **Dispatch** - Route message to handler
4. **Process** - Business logic reads message fields
5. **Create Response** - Business logic creates outbound message
6. **Replicate** - Replicate transaction to backup (concurrent with 7)
7. **Persist** - Write transaction to log on primary
8. **Consensus ACK** - Receive acknowledgment from backup
9. **Encode** - Serialize outbound message
10. **Send** - Transmit outbound message to test driver

### Backup Instance

The backup instance (concurrent with primary's persist step):

1. **Receive Replication** - Receive replicated transaction from primary
2. **Persist** - Write transaction to log on backup
3. **Dispatch** - Route message to handler
4. **Replay** - Execute business logic for consistency
5. **Send ACK** - Acknowledge to primary

## Test Message

The benchmark uses a `Car` message (defined in nvx-perf-models) that:

* Exercises the complete Xbuf2 data model
* Contains \~200 bytes when serialized
* Includes primitives, strings, nested objects, and arrays
* Represents a realistic business message

## Test Driver

The test uses a custom in-process driver (`LocalProvider`) that:

* Injects messages at configurable rates
* Measures wire-to-wire (w2w) latency
* Captures latency distributions (50th, 99th, 99.9th percentiles)
* Measures maximum throughput
* Eliminates network overhead for consistent measurements

## CPU Configurations

The benchmark is run in three CPU configurations:

### MinCPU (1 CPU)

Minimal CPU footprint with all threads on dedicated cores:

**Threads**:

* Business logic thread (hot, spinning)
* Cluster replication reader (affinitized, not hot)

**Configuration**:

```bash
--clusteringLinkSpinRead=false
--busDetachedSend=false
--persisterDetached=false
--clusteringLinkReaderCPUAffinityMask [1]
--muxCPUAffinityMask [3]
--injectorCPUAffinityMask [2]
```

**Use Case**: Resource-constrained environments

### Default (2 CPUs)

Balanced configuration where Rumi decides thread allocation:

**Threads**: Automatically determined by Rumi

**Configuration**:

```bash
# Uses default settings for detached operations
--clusteringLinkReaderCPUAffinityMask [1]
--muxCPUAffinityMask [3]
--injectorCPUAffinityMask [2]
--busDetachedSendCPUAffinityMask [4]
--persisterWriterCPUAffinityMask [5]
```

**Use Case**: General-purpose production deployments

### MaxCPU (6 CPUs)

Maximum parallelization with additional detached threads:

**Additional Threads**:

* Detached sender (hot, spinning)
* Detached dispatcher (hot, spinning)
* Detached persister

**Configuration**:

```bash
--clusteringLinkSpinRead=true
--busDetachedSend=true
--persisterDetached=true
--clusteringDetachedSend
--clusteringDetachedDispatch
--clusteringLinkReaderCPUAffinityMask [1]
--muxCPUAffinityMask [3]
--injectorCPUAffinityMask [2]
--busDetachedSendCPUAffinityMask [4]
--persisterWriterCPUAffinityMask [5]
--clusteringDetachedSenderCPUAffinityMask [6]
--clusteringDetachedDispatcherCPUAffinityMask [7]
```

**Use Case**: Ultra-low latency requirements with available CPU resources

## Runtime Optimization Modes

The benchmark is run in two optimization modes:

### Latency Mode

Optimizes for lowest latency:

```bash
-Dnv.optimizefor=latency
```

**Message Rate**: 10,000 messages/second (sustained) **Measurement**: Latency percentiles (50th, 99th, 99.9th)

### Throughput Mode

Optimizes for highest throughput:

```bash
-Dnv.optimizefor=throughput
```

**Message Rate**: As fast as possible (saturated) **Measurement**: Maximum messages per second

## Message Access Methods

The benchmark tests two message access patterns:

### Indirect Access (protobuf.serial/protobuf.random)

Message data accessed via POJO getters/setters:

```java
@EventHandler
public void onMessage(Car inMessage) {
    // Read via getters
    String make = inMessage.getMake();
    String model = inMessage.getModel();

    // Create outbound message
    Car outMessage = Car.create();
    outMessage.setMake(make);
    outMessage.setModel(model);

    // Send
    messageSender.sendMessage(1, outMessage);
}
```

### Direct Access (Serializer/Deserializer)

Message data accessed via zero-copy serializers:

```java
@EventHandler
public void onMessage(Car inMessage) {
    // Read via deserializer (zero-copy)
    deserializer.decode(inMessage);

    // Create via serializer (zero-copy)
    Car outMessage = serializer.create();

    // Send
    messageSender.sendMessage(1, outMessage);
}
```

**Direct access provides \~10% lower latency and 3x higher throughput**

## Running the Canonical Benchmark

### Prerequisites

1. Two Linux containers with InfiniBand or 10GbE networking
2. Rumi Perf distribution installed on both containers
3. Synchronized time between containers

### Example: Latency Test (Default Config)

**On Primary (192.168.4.24)**:

```bash
rm -rf rdat
LD_LIBRARY_PATH=$HOME/.nvx/native \
numactl -m 0 \
$JAVA_HOME/bin/java \
  -Dnv.optimizefor=latency \
  -Dnv.optimizeMemoryUsage=false \
  -Dnv.conservecpu=false \
  -Xms16g -Xmx16g \
  -cp "libs/*" \
  com.neeve.perf.aep.engine.ESProcessor \
  --count 600000 \
  --rate 10000 \
  --warmupTime 10 \
  --printIntervalStats \
  --enablePersistence \
  --persisterLogLocation $(pwd)/rdat \
  --persisterInitialLogLength 10 \
  --persisterZeroOutInitial \
  --persisterFlushOnCommit \
  --enableClustering \
  --clusteringLocalIfAddr 192.168.4.24 \
  --clusteringDiscoveryLocalIfAddr 192.168.3.24 \
  --clusteringLinkReaderCPUAffinityMask [1] \
  --injectorCPUAffinityMask [2] \
  --muxCPUAffinityMask [3] \
  --busDetachedSendCPUAffinityMask [4] \
  --persisterWriterCPUAffinityMask [5] \
  --encoding protobuf.random
```

**On Backup (192.168.4.26)**:

```bash
rm -rf rdat
LD_LIBRARY_PATH=$HOME/.nvx/native \
numactl -m 0 \
$JAVA_HOME/bin/java \
  -Dnv.optimizefor=latency \
  -Dnv.optimizeMemoryUsage=false \
  -Dnv.conservecpu=false \
  -Xms16g -Xmx16g \
  -cp "libs/*" \
  com.neeve.perf.aep.engine.ESProcessor \
  --count 600000 \
  --rate 10000 \
  --warmupTime 10 \
  --printIntervalStats \
  --enablePersistence \
  --persisterLogLocation $(pwd)/rdat \
  --persisterInitialLogLength 10 \
  --persisterZeroOutInitial \
  --persisterFlushOnCommit \
  --enableClustering \
  --clusteringLocalIfAddr 192.168.4.26 \
  --clusteringDiscoveryLocalIfAddr 192.168.3.26 \
  --clusteringLinkReaderCPUAffinityMask [1] \
  --injectorCPUAffinityMask [2] \
  --muxCPUAffinityMask [3] \
  --busDetachedSendCPUAffinityMask [4] \
  --persisterWriterCPUAffinityMask [5] \
  --encoding protobuf.random
```

**Note**: Start the backup first, then start the primary. The primary will inject messages and report results.

### Example: Throughput Test (MinCPU Config)

**On Primary**:

```bash
rm -rf rdat
LD_LIBRARY_PATH=$HOME/.nvx/native \
numactl -m 0 \
$JAVA_HOME/bin/java \
  -Dnv.optimizefor=throughput \
  -Dnv.optimizeMemoryUsage=false \
  -Dnv.conservecpu=false \
  -Xms16g -Xmx16g \
  -cp "libs/*" \
  com.neeve.perf.aep.engine.ESProcessor \
  --count 5000000 \
  --rate 1000000 \
  --warmupTime 10 \
  --printIntervalStats \
  --enablePersistence \
  --persisterDetached=false \
  --persisterLogLocation $(pwd)/rdat \
  --persisterInitialLogLength 10 \
  --persisterZeroOutInitial \
  --persisterFlushOnCommit \
  --enableClustering \
  --clusteringLocalIfAddr 192.168.4.24 \
  --clusteringDiscoveryLocalIfAddr 192.168.3.24 \
  --clusteringLinkSpinRead=false \
  --busDetachedSend=false \
  --clusteringLinkReaderCPUAffinityMask [1] \
  --injectorCPUAffinityMask [2] \
  --muxCPUAffinityMask [3] \
  --busDetachedSendCPUAffinityMask [4] \
  --persisterWriterCPUAffinityMask [5] \
  --outputThroughput \
  --encoding xbuf2.random
```

## Command-Line Parameters

### General Parameters

| Parameter              | Short | Default         | Description                                                                            |
| ---------------------- | ----- | --------------- | -------------------------------------------------------------------------------------- |
| `--encoding`           | `-l`  | protobuf.serial | Message encoding: `protobuf.serial`, `protobuf.random`, `quark.serial`, `quark.random` |
| `--count`              | `-c`  | 10,000,000      | Number of messages to inject                                                           |
| `--rate`               | `-r`  | 100,000         | Message injection rate (msgs/sec)                                                      |
| `--warmupTime`         | `-t`  | 2               | Warmup time in seconds before collecting stats                                         |
| `--emptyMessage`       | `-E`  | false           | Don't populate message fields (minimal test)                                           |
| `--noLatencyWrites`    | `-a`  | false           | Don't write latency data to file                                                       |
| `--printIntervalStats` | `-b`  | false           | Print interval stats during test                                                       |

### Output Parameters

| Parameter            | Short | Default | Description                         |
| -------------------- | ----- | ------- | ----------------------------------- |
| `--outputFile`       | `-O`  | null    | Excel file to write results to      |
| `--outputCell`       | `-C`  | null    | Cell in Excel file (ROW-COL format) |
| `--outputThroughput` | `-T`  | false   | Write throughput instead of latency |

### CPU Affinity Parameters

| Parameter                          | Short | Default | Description                           |
| ---------------------------------- | ----- | ------- | ------------------------------------- |
| `--injectorCPUAffinityMask`        | `-j`  | null    | CPU mask for message injector thread  |
| `--muxCPUAffinityMask`             | `-y`  | null    | CPU mask for event multiplexer thread |
| `--busDetachedSendCPUAffinityMask` | `-o`  | null    | CPU mask for detached sender thread   |

### Message Bus Parameters

| Parameter                     | Short | Default | Description                               |
| ----------------------------- | ----- | ------- | ----------------------------------------- |
| `--busDetachedSend`           | `-u`  | false   | Enable detached sending (separate thread) |
| `--busDetachedSendQueueDepth` | `-n`  | 1024    | Depth of detached send queue              |

### Persistence Parameters

| Parameter                           | Short | Default | Description                            |
| ----------------------------------- | ----- | ------- | -------------------------------------- |
| `--enablePersistence`               | `-e`  | false   | Enable transaction log persistence     |
| `--persisterLogLocation`            | `-k`  | .       | Directory for transaction log          |
| `--persisterInitialLogLength`       | `-i`  | 1       | Preallocated log length (GB)           |
| `--persisterZeroOutInitial`         | `-z`  | false   | Zero out preallocated log              |
| `--persisterWriteBufferSize`        | `-w`  | 8192    | Write buffer size (bytes)              |
| `--persisterFlushOnCommit`          | `-f`  | false   | Flush log on every commit              |
| `--persisterFlushUsingMappedMemory` | `-m`  | false   | Use memory-mapped flush                |
| `--persisterDetached`               | `-d`  | false   | Use detached persistence thread        |
| `--persisterWriterCPUAffinityMask`  | `-x`  | null    | CPU mask for detached persister thread |

### Clustering Parameters

| Parameter                                       | Short | Default | Description                          |
| ----------------------------------------------- | ----- | ------- | ------------------------------------ |
| `--enableClustering`                            | `-v`  | false   | Enable clustering (primary/backup)   |
| `--clusteringLocalIfAddr`                       | `-I`  | 0.0.0.0 | Local interface for replication      |
| `--clusteringDiscoveryLocalIfAddr`              | `-U`  | 0.0.0.0 | Local interface for discovery        |
| `--clusteringLinkSpinRead`                      | `-W`  | false   | Spin on replication link reads       |
| `--clusteringLinkReaderCPUAffinityMask`         | `-V`  | null    | CPU mask for replication reader      |
| `--clusteringDetachedSend`                      | `-S`  | false   | Enable detached replication send     |
| `--clusteringDetachedSenderCPUAffinityMask`     | `-A`  | null    | CPU mask for detached sender         |
| `--clusteringDetachedDispatch`                  | `-D`  | false   | Enable detached replication dispatch |
| `--clusteringDetachedDispatcherCPUAffinityMask` | `-B`  | null    | CPU mask for detached dispatcher     |

## Interpreting Results

### Latency Results

The test outputs latency percentiles in microseconds:

```
Wire-to-Wire Latency Stats:
  Count: 600000
  50th percentile: 27.34 µs
  99th percentile: 30.14 µs
  99.9th percentile: 35.23 µs
  Mean: 27.89 µs
```

**Includes**:

* Inbound deserialization
* Handler dispatch
* Business logic execution
* Persistence
* Replication to backup
* Consensus acknowledgment
* Outbound serialization
* Round-trip wire latency (\~23µs on unoptimized network)

### Throughput Results

The test outputs maximum sustained throughput:

```
Throughput: 281,947 messages/second
```

**Represents**: Maximum rate at which the clustered microservice can process messages while maintaining:

* Full persistence
* Replication to backup
* Consensus acknowledgment

## Published Results

Official performance results from this benchmark are published in the [Canonical Benchmark Results](/performance/canonical-benchmark.md) section.

Results are organized by:

* Rumi version
* CPU configuration (MinCPU, Default, MaxCPU)
* Optimization mode (Latency, Throughput)
* Message access method (Indirect, Direct)

For complete test methodology and hardware configuration, see the [Test Description](/performance/canonical-benchmark/test-description.md).

## Test Configuration Files

Complete test configurations for published results are available in:

```
nvx-perf-aep/results/mincpu_default_maxcpu_tests.txt
```

This file contains the exact command lines used to generate published performance results for each Rumi release.

## Next Steps

* Review [Canonical Benchmark Results](/performance/canonical-benchmark.md) for published performance metrics
* See [Test Description](/performance/canonical-benchmark/test-description.md) for complete methodology
* Explore [other modules](/performance/benchmark-suite.md) for component-level benchmarks
* Read [Rumi documentation](https://docs.rumi.systems) for tuning guidance


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.rumi.systems/performance/benchmark-suite/modules/aep-module.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
