Rumi 4.0.579-4.0.39

Performance results for Rumi 4.0.579-4.0.39 (Core: 4.0.579, Bindings: 4.0.39), based on testing conducted in February 2025.

circle-info

Note: Rumi 4.0 is the successor to X Platform 3.16. The platform was rebranded from "X Platform" to "Rumi" starting with version 4.0. These results show significant performance improvements over X Platform 3.16.

Test Configuration

  • Rumi Core Version: 4.0.579

  • Rumi Bindings Version: 4.0.39

  • Java Runtime: Oracle Java 8

  • Message Encoding: Xbuf2

  • Cluster Configuration: Primary + Backup with persistence and replication

  • Test Hardware: Intel Xeon Gold 6334 (8-Core, 3.6 GHz), 128GB RAM, InfiniBand network

  • Message Rate (Latency Tests): 10,000 messages/second

  • Message Rate (Throughput Tests): As fast as possible (saturated)

See the Test Description for complete test methodology and hardware specifications.

Latency Results

All latency numbers are in microseconds (µs). Round-trip wire latency (~23µs on unoptimized network) is included in all results.

Indirect Message Access | Latency Optimization

Configuration: Xbuf2.Indirect | OptimizeFor=Latency | Message Rate=10,000/sec

Message data accessed via POJO setter/getter methods.

CPU Config
# CPUs
50th % (µs)
99th % (µs)
99.9th % (µs)
vs X Platform 3.16

MinCPU

1

31.46

34.09

43.43

-23%

Default

2

26.31

29.18

37.39

-14%

MaxCPU

6

29.86

31.12

37.32

-8%

Best Configuration: Default (2 CPUs) - 26.31µs median latency

circle-check

Direct Message Access | Latency Optimization

Configuration: Xbuf2.Direct | OptimizeFor=Latency | Message Rate=10,000/sec

Message data accessed via serializer/deserializer objects (zero-copy access).

CPU Config
# CPUs
50th % (µs)
99th % (µs)
99.9th % (µs)
vs X Platform 3.16

MinCPU

1

28.90

31.56

38.61

-23%

Default

2

23.60

26.40

31.98

-14%

MaxCPU

6

26.68

27.78

31.86

-8%

Best Configuration: Default (2 CPUs) - 23.60µs median latency

circle-check

Throughput Results

Throughput measured in messages per second. Test mode: saturated load (as fast as possible).

Indirect Message Access | Throughput Optimization

Configuration: Xbuf2.Indirect | OptimizeFor=Throughput | Message Rate=[As Fast As Possible]

CPU Config
# CPUs
Throughput (msgs/sec)
vs X Platform 3.16

MinCPU

1

139,875

+19%

Default

3

113,377

+25%

MaxCPU

6

97,441

+73%

Best Configuration: MinCPU (1 CPU) - 139,875 msgs/sec

circle-check

Direct Message Access | Throughput Optimization

Configuration: Xbuf2.Direct | OptimizeFor=Throughput | Message Rate=[As Fast As Possible]

CPU Config
# CPUs
Throughput (msgs/sec)
vs X Platform 3.16

MinCPU

1

421,498

+49%

Default

3

428,029

+52%

MaxCPU

6

422,400

+293%

Best Configuration: Default (3 CPUs) - 428,029 msgs/sec

circle-check

Performance Analysis

Key Improvements Over X Platform 3.16

  1. Latency Reduction: 10-30% lower across all configurations

    • Best improvement: 23% reduction with MinCPU

    • Improved mechanical sympathy and reduced overhead

  2. Throughput Increase: 20-300% higher depending on configuration

    • Direct access: 49-293% improvement

    • Indirect access: 19-73% improvement

    • Significant improvements in MaxCPU scenarios

  3. Better CPU Efficiency: Rumi 4.0 requires fewer CPUs for optimal performance

    • Latency: 2 CPUs optimal (vs 4 in X Platform 3.16)

    • Throughput: Scales better across CPU configurations

Latency Characteristics

  1. Optimal CPU Configuration: 2 CPUs (Default) provides best latency

    • Reduced from 4 CPUs in X Platform 3.16

    • More efficient thread coordination

    • 16-18% better than MinCPU configuration

  2. Direct vs Indirect Access: Direct access reduces latency by ~10%

    • Median: 23.60µs (Direct) vs 26.31µs (Indirect)

    • Consistent benefit across CPU configurations

  3. Tail Latency: Excellent 99.9th percentile characteristics

    • Within 1.35x of median latency

    • Better consistency than X Platform 3.16

Throughput Characteristics

  1. Optimal CPU Configuration: Varies by access method

    • Direct: 3 CPUs (Default) - 428K msgs/sec

    • Indirect: 1 CPU (MinCPU) - 140K msgs/sec

  2. Direct vs Indirect Access: Direct access provides 3x throughput improvement

    • Default config: 428K msgs/sec (Direct) vs 113K msgs/sec (Indirect)

    • Zero-copy access eliminates serialization bottleneck

  3. Better CPU Scaling: Rumi 4.0 scales much better with more CPUs

    • MaxCPU throughput: 293% improvement over X Platform 3.16

    • Improved threading architecture reduces coordination overhead

Tuning Recommendations

For Lowest Latency

  1. Use Direct message access (serializer/deserializer objects)

  2. Configure 2 CPUs (Default configuration)

  3. Enable latency optimization mode

  4. Expected: ~24µs median, ~26µs 99th percentile

  5. Improvement over X Platform 3.16: 14% lower latency

For Highest Throughput

  1. Use Direct message access (serializer/deserializer objects)

  2. Configure 3 CPUs (Default configuration)

  3. Enable throughput optimization mode

  4. Expected: ~428K msgs/sec

  5. Improvement over X Platform 3.16: 52% higher throughput

Application-Specific Considerations

  • Heavier business logic: Will see larger benefits from Rumi 4.0 improvements

  • Complex message transformations: Direct access provides even larger benefits in 4.0

  • Multi-CPU workloads: Rumi 4.0's improved threading scales much better

  • Network-limited scenarios: VMA enablement can further reduce latency

Comparison Summary: Rumi 4.0 vs X Platform 3.16

Latency Comparison (Best Configurations)

Access Method
X Platform 3.16 (4 CPUs)
Rumi 4.0 (2 CPUs)
Improvement

Indirect

30.55µs

26.31µs

-14%

Direct

27.34µs

23.60µs

-14%

Throughput Comparison (Best Configurations)

Access Method
X Platform 3.16
Rumi 4.0
Improvement

Indirect

117K msgs/sec (1 CPU)

140K msgs/sec (1 CPU)

+19%

Direct

282K msgs/sec (1 CPU)

428K msgs/sec (3 CPUs)

+52%

Key Takeaways

Rumi 4.0 is better than X Platform 3.16 across the board

  • Average 20% lower latency (ranging 10-30%)

  • Average 85% higher throughput (ranging 20-300%)

More efficient CPU utilization

  • Optimal latency with fewer CPUs (2 vs 4)

  • Better scaling with more CPUs for throughput

Improved mechanical sympathy

  • Better cache utilization

  • Reduced thread coordination overhead

  • More predictable tail latencies

Architecture Improvements in Rumi 4.0

The performance improvements in Rumi 4.0 come from several key architectural enhancements:

  1. Optimized Memory Layout: Improved data structure alignment for better cache utilization

  2. Reduced Allocation: Fewer object allocations in critical paths

  3. Better Threading: More efficient thread coordination and work distribution

  4. Improved Serialization: Faster encoding/decoding with better mechanical sympathy

  5. Lock-free Algorithms: Reduced contention in high-throughput scenarios

Next Steps

Last updated