Rumi 4.0.579-4.0.39

Performance results for Rumi 4.0.579-4.0.39 (Core: 4.0.579, Bindings: 4.0.39), based on testing conducted in February 2025.

Note: Rumi 4.0 is the successor to X Platform 3.16. The platform was rebranded from "X Platform" to "Rumi" starting with version 4.0. These results show significant performance improvements over X Platform 3.16.

Test Configuration

Rumi Core Version: 4.0.579
Rumi Bindings Version: 4.0.39
Java Runtime: Oracle Java 8
Message Encoding: Xbuf2
Cluster Configuration: Primary + Backup with persistence and replication
Test Hardware: Intel Xeon Gold 6334 (8-Core, 3.6 GHz), 128GB RAM, InfiniBand network
Message Rate (Latency Tests): 10,000 messages/second
Message Rate (Throughput Tests): As fast as possible (saturated)

See the Test Description for complete test methodology and hardware specifications.

Latency Results

All latency numbers are in microseconds (µs). Round-trip wire latency (~23µs on unoptimized network) is included in all results.

Indirect Message Access | Latency Optimization

Configuration: Xbuf2.Indirect | OptimizeFor=Latency | Message Rate=10,000/sec

Message data accessed via POJO setter/getter methods.

CPU Config

# CPUs

50th % (µs)

99th % (µs)

99.9th % (µs)

vs X Platform 3.16

MinCPU

31.46

34.09

43.43

-23%

Default

26.31

29.18

37.39

-14%

MaxCPU

29.86

31.12

37.32

-8%

Best Configuration: Default (2 CPUs) - 26.31µs median latency

Improvement: 14% lower latency than X Platform 3.16 with Default configuration (26.31µs vs 30.55µs)

Direct Message Access | Latency Optimization

Configuration: Xbuf2.Direct | OptimizeFor=Latency | Message Rate=10,000/sec

Message data accessed via serializer/deserializer objects (zero-copy access).

CPU Config

# CPUs

50th % (µs)

99th % (µs)

99.9th % (µs)

vs X Platform 3.16

MinCPU

28.90

31.56

38.61

-23%

Default

23.60

26.40

31.98

-14%

MaxCPU

26.68

27.78

31.86

-8%

Best Configuration: Default (2 CPUs) - 23.60µs median latency

Improvement: 14% lower latency than X Platform 3.16 with Default configuration (23.60µs vs 27.34µs)

Throughput Results

Throughput measured in messages per second. Test mode: saturated load (as fast as possible).

Indirect Message Access | Throughput Optimization

Configuration: Xbuf2.Indirect | OptimizeFor=Throughput | Message Rate=[As Fast As Possible]

CPU Config

# CPUs

Throughput (msgs/sec)

vs X Platform 3.16

MinCPU

139,875

+19%

Default

113,377

+25%

MaxCPU

97,441

+73%

Best Configuration: MinCPU (1 CPU) - 139,875 msgs/sec

Improvement: 19% higher throughput than X Platform 3.16 with MinCPU configuration

Direct Message Access | Throughput Optimization

Configuration: Xbuf2.Direct | OptimizeFor=Throughput | Message Rate=[As Fast As Possible]

CPU Config

# CPUs

Throughput (msgs/sec)

vs X Platform 3.16

MinCPU

421,498

+49%

Default

428,029

+52%

MaxCPU

422,400

+293%

Best Configuration: Default (3 CPUs) - 428,029 msgs/sec

Improvement: 52% higher throughput than X Platform 3.16 with Default configuration (428K vs 282K msgs/sec)

Performance Analysis

Key Improvements Over X Platform 3.16

Latency Reduction: 10-30% lower across all configurations
- Best improvement: 23% reduction with MinCPU
- Improved mechanical sympathy and reduced overhead
Throughput Increase: 20-300% higher depending on configuration
- Direct access: 49-293% improvement
- Indirect access: 19-73% improvement
- Significant improvements in MaxCPU scenarios
Better CPU Efficiency: Rumi 4.0 requires fewer CPUs for optimal performance
- Latency: 2 CPUs optimal (vs 4 in X Platform 3.16)
- Throughput: Scales better across CPU configurations

Latency Characteristics

Optimal CPU Configuration: 2 CPUs (Default) provides best latency
- Reduced from 4 CPUs in X Platform 3.16
- More efficient thread coordination
- 16-18% better than MinCPU configuration
Direct vs Indirect Access: Direct access reduces latency by ~10%
- Median: 23.60µs (Direct) vs 26.31µs (Indirect)
- Consistent benefit across CPU configurations
Tail Latency: Excellent 99.9th percentile characteristics
- Within 1.35x of median latency
- Better consistency than X Platform 3.16

Throughput Characteristics

Optimal CPU Configuration: Varies by access method
- Direct: 3 CPUs (Default) - 428K msgs/sec
- Indirect: 1 CPU (MinCPU) - 140K msgs/sec
Direct vs Indirect Access: Direct access provides 3x throughput improvement
- Default config: 428K msgs/sec (Direct) vs 113K msgs/sec (Indirect)
- Zero-copy access eliminates serialization bottleneck
Better CPU Scaling: Rumi 4.0 scales much better with more CPUs
- MaxCPU throughput: 293% improvement over X Platform 3.16
- Improved threading architecture reduces coordination overhead

Tuning Recommendations

For Lowest Latency

Use Direct message access (serializer/deserializer objects)
Configure 2 CPUs (Default configuration)
Enable latency optimization mode
Expected: ~24µs median, ~26µs 99th percentile
Improvement over X Platform 3.16: 14% lower latency

For Highest Throughput

Use Direct message access (serializer/deserializer objects)
Configure 3 CPUs (Default configuration)
Enable throughput optimization mode
Expected: ~428K msgs/sec
Improvement over X Platform 3.16: 52% higher throughput

Application-Specific Considerations

Heavier business logic: Will see larger benefits from Rumi 4.0 improvements
Complex message transformations: Direct access provides even larger benefits in 4.0
Multi-CPU workloads: Rumi 4.0's improved threading scales much better
Network-limited scenarios: VMA enablement can further reduce latency

Comparison Summary: Rumi 4.0 vs X Platform 3.16

Latency Comparison (Best Configurations)

Access Method

X Platform 3.16 (4 CPUs)

Rumi 4.0 (2 CPUs)

Improvement

Indirect

30.55µs

26.31µs

-14%

Direct

27.34µs

23.60µs

-14%

Throughput Comparison (Best Configurations)

Access Method

X Platform 3.16

Rumi 4.0

Improvement

Indirect

117K msgs/sec (1 CPU)

140K msgs/sec (1 CPU)

+19%

Direct

282K msgs/sec (1 CPU)

428K msgs/sec (3 CPUs)

+52%

Key Takeaways

✅ Rumi 4.0 is better than X Platform 3.16 across the board

Average 20% lower latency (ranging 10-30%)
Average 85% higher throughput (ranging 20-300%)

✅ More efficient CPU utilization

Optimal latency with fewer CPUs (2 vs 4)
Better scaling with more CPUs for throughput

✅ Improved mechanical sympathy

Better cache utilization
Reduced thread coordination overhead
More predictable tail latencies

Architecture Improvements in Rumi 4.0

The performance improvements in Rumi 4.0 come from several key architectural enhancements:

Optimized Memory Layout: Improved data structure alignment for better cache utilization
Reduced Allocation: Fewer object allocations in critical paths
Better Threading: More efficient thread coordination and work distribution
Improved Serialization: Faster encoding/decoding with better mechanical sympathy
Lock-free Algorithms: Reduced contention in high-throughput scenarios

Next Steps

Review Test Description for complete test methodology
Return to Performance Overview

PreviousRumi 4.0 NextBenchmark Suite

Last updated 5 days ago

hashtagTest Configuration

hashtagLatency Results

hashtagIndirect Message Access | Latency Optimization

hashtagDirect Message Access | Latency Optimization

hashtagThroughput Results

hashtagIndirect Message Access | Throughput Optimization

hashtagDirect Message Access | Throughput Optimization

hashtagPerformance Analysis

hashtagKey Improvements Over X Platform 3.16

hashtagLatency Characteristics

hashtagThroughput Characteristics

hashtagTuning Recommendations

hashtagFor Lowest Latency

hashtagFor Highest Throughput

hashtagApplication-Specific Considerations

hashtagComparison Summary: Rumi 4.0 vs X Platform 3.16

hashtagLatency Comparison (Best Configurations)

hashtagThroughput Comparison (Best Configurations)

hashtagKey Takeaways

hashtagArchitecture Improvements in Rumi 4.0

hashtagNext Steps

Test Configuration

Latency Results

Indirect Message Access | Latency Optimization

Direct Message Access | Latency Optimization

Throughput Results

Indirect Message Access | Throughput Optimization

Direct Message Access | Throughput Optimization

Performance Analysis

Key Improvements Over X Platform 3.16

Latency Characteristics

Throughput Characteristics

Tuning Recommendations

For Lowest Latency

For Highest Throughput

Application-Specific Considerations

Comparison Summary: Rumi 4.0 vs X Platform 3.16

Latency Comparison (Best Configurations)

Throughput Comparison (Best Configurations)

Key Takeaways

Architecture Improvements in Rumi 4.0

Next Steps