Performance results for Rumi 4.0.579-4.0.39 (Core: 4.0.579, Bindings: 4.0.39), based on testing conducted in February 2025.
Test Configuration
Rumi Core Version : 4.0.579
Rumi Bindings Version : 4.0.39
Java Runtime : Oracle Java 8
Cluster Configuration : Primary + Backup with persistence and replication
Test Hardware : Intel Xeon Gold 6334 (8-Core, 3.6 GHz), 128GB RAM, InfiniBand network
Message Rate (Latency Tests) : 10,000 messages/second
Message Rate (Throughput Tests) : As fast as possible (saturated)
See the Test Description for complete test methodology and hardware specifications.
Latency Results
All latency numbers are in microseconds (µs) . Round-trip wire latency (~23µs on unoptimized network) is included in all results.
Indirect Message Access | Latency Optimization
Configuration : Xbuf2.Indirect | OptimizeFor=Latency | Message Rate=10,000/sec
Message data accessed via POJO setter/getter methods.
CPU Config
# CPUs
50th % (µs)
99th % (µs)
99.9th % (µs)
vs X Platform 3.16
Best Configuration : Default (2 CPUs) - 26.31µs median latency
circle-check
Improvement : 14% lower latency than X Platform 3.16 with Default configuration (26.31µs vs 30.55µs)
Direct Message Access | Latency Optimization
Configuration : Xbuf2.Direct | OptimizeFor=Latency | Message Rate=10,000/sec
Message data accessed via serializer/deserializer objects (zero-copy access).
CPU Config
# CPUs
50th % (µs)
99th % (µs)
99.9th % (µs)
vs X Platform 3.16
Best Configuration : Default (2 CPUs) - 23.60µs median latency
circle-check
Improvement : 14% lower latency than X Platform 3.16 with Default configuration (23.60µs vs 27.34µs)
Throughput Results
Throughput measured in messages per second . Test mode: saturated load (as fast as possible).
Indirect Message Access | Throughput Optimization
Configuration : Xbuf2.Indirect | OptimizeFor=Throughput | Message Rate=[As Fast As Possible]
CPU Config
# CPUs
Throughput (msgs/sec)
vs X Platform 3.16
Best Configuration : MinCPU (1 CPU) - 139,875 msgs/sec
circle-check
Improvement : 19% higher throughput than X Platform 3.16 with MinCPU configuration
Direct Message Access | Throughput Optimization
Configuration : Xbuf2.Direct | OptimizeFor=Throughput | Message Rate=[As Fast As Possible]
CPU Config
# CPUs
Throughput (msgs/sec)
vs X Platform 3.16
Best Configuration : Default (3 CPUs) - 428,029 msgs/sec
circle-check
Improvement : 52% higher throughput than X Platform 3.16 with Default configuration (428K vs 282K msgs/sec)
Latency Reduction : 10-30% lower across all configurations
Best improvement: 23% reduction with MinCPU
Improved mechanical sympathy and reduced overhead
Throughput Increase : 20-300% higher depending on configuration
Direct access: 49-293% improvement
Indirect access: 19-73% improvement
Significant improvements in MaxCPU scenarios
Better CPU Efficiency : Rumi 4.0 requires fewer CPUs for optimal performance
Latency: 2 CPUs optimal (vs 4 in X Platform 3.16)
Throughput: Scales better across CPU configurations
Latency Characteristics
Optimal CPU Configuration : 2 CPUs (Default) provides best latency
Reduced from 4 CPUs in X Platform 3.16
More efficient thread coordination
16-18% better than MinCPU configuration
Direct vs Indirect Access : Direct access reduces latency by ~10%
Median: 23.60µs (Direct) vs 26.31µs (Indirect)
Consistent benefit across CPU configurations
Tail Latency : Excellent 99.9th percentile characteristics
Within 1.35x of median latency
Better consistency than X Platform 3.16
Throughput Characteristics
Optimal CPU Configuration : Varies by access method
Direct: 3 CPUs (Default) - 428K msgs/sec
Indirect: 1 CPU (MinCPU) - 140K msgs/sec
Direct vs Indirect Access : Direct access provides 3x throughput improvement
Default config: 428K msgs/sec (Direct) vs 113K msgs/sec (Indirect)
Zero-copy access eliminates serialization bottleneck
Better CPU Scaling : Rumi 4.0 scales much better with more CPUs
MaxCPU throughput: 293% improvement over X Platform 3.16
Improved threading architecture reduces coordination overhead
Tuning Recommendations
For Lowest Latency
Use Direct message access (serializer/deserializer objects)
Configure 2 CPUs (Default configuration)
Enable latency optimization mode
Expected : ~24µs median, ~26µs 99th percentile
Improvement over X Platform 3.16 : 14% lower latency
For Highest Throughput
Use Direct message access (serializer/deserializer objects)
Configure 3 CPUs (Default configuration)
Enable throughput optimization mode
Improvement over X Platform 3.16 : 52% higher throughput
Application-Specific Considerations
Heavier business logic : Will see larger benefits from Rumi 4.0 improvements
Complex message transformations : Direct access provides even larger benefits in 4.0
Multi-CPU workloads : Rumi 4.0's improved threading scales much better
Network-limited scenarios : VMA enablement can further reduce latency
Latency Comparison (Best Configurations)
Access Method
X Platform 3.16 (4 CPUs)
Rumi 4.0 (2 CPUs)
Improvement
Throughput Comparison (Best Configurations)
Access Method
X Platform 3.16
Rumi 4.0
Improvement
✅ Rumi 4.0 is better than X Platform 3.16 across the board
Average 20% lower latency (ranging 10-30%)
Average 85% higher throughput (ranging 20-300%)
✅ More efficient CPU utilization
Optimal latency with fewer CPUs (2 vs 4)
Better scaling with more CPUs for throughput
✅ Improved mechanical sympathy
Reduced thread coordination overhead
More predictable tail latencies
Architecture Improvements in Rumi 4.0
The performance improvements in Rumi 4.0 come from several key architectural enhancements:
Optimized Memory Layout : Improved data structure alignment for better cache utilization
Reduced Allocation : Fewer object allocations in critical paths
Better Threading : More efficient thread coordination and work distribution
Improved Serialization : Faster encoding/decoding with better mechanical sympathy
Lock-free Algorithms : Reduced contention in high-throughput scenarios