Container Stats and Heartbeats

An operational container continuously collects raw statistics during the course of its operation. The container can also be configured to spin up a background thread that periodically performs the following:

Performs higher level statistical computations such as calculating message rates and average latencies
Emits heartbeat messages to be processed by handlers
Optionally outputs rendered stats to a trace logger which is useful in testing and diagnostic situations
Optionally writes heartbeat messages containing useful container-wide statistics to a binary transaction log (with zero steady-state allocations) which is useful for zero garbage capture of performance in production

The raw metrics collected by the container are used by the background statistical thread for its computations and can also be retrieved programmatically by an application for its own use.

This document describes:

How to enable and configure container stats collection and emission
The higher level statistics calculations performed by the statistics thread
The format of the output of the statistics thread

Configuring Heartbeats

Heartbeats must be configured in your DDL to enable statistics collection. For complete configuration details including all parameters, collection settings, and logging/tracing options, see:

Configuring Monitoring - Complete heartbeat and statistics configuration guide

This page focuses on understanding and interpreting the heartbeat output once configured.

What Heartbeats Contain

Heartbeats contain several categories of statistics:

System Stats: CPU, memory, disk, threads, GC
Thread Stats: Per-thread CPU utilization and affinitization
Pool Stats: Object pool usage and depletion
Engine Stats: AEP engine metrics (see AEP Engine Statistics)
User Stats: Application-defined statistics (see Exposing Application Stats)

Consuming container Heartbeats

When heartbeats are enabled, they can be consumed in several ways:

Heartbeat Event Handlers

Your application can register an event handler for container heartbeats to handle them in process:

@EventHandler
public void onHeartbeat(SrvMonHeartbeatMessage message) {
   // Your logic here:
   // - You could emit over an SMA message bus
   // - log to a time series database
   // etc, etc.
}

See the SrvMonHeartbeatMessage JavaDoc for API details.

Admin Clients

Administrative and monitoring tools can connect to a container via a direct admin connection over TCP to listen for heartbeats for monitoring purposes. The container's stats thread will queue copies of each emitted heartbeat to each connected admin client.

Heartbeat Trace Output

Heartbeat trace is emitted to the nv.server.heartbeat logger at a level of INFO. Trace is only emitted for the types of heartbeat trace for which tracing has been enabled.

For configuration details on enabling trace output for different statistic types, see Configuring Monitoring.

This section explains how to interpret the trace output for each type of heartbeat statistic.

See Also: Trace Logging for general information on trace logging.

System Stats

Sample Trace Output:

[System Stats]
Sat May 13 12:14:03 PDT 2017 'market' container (pid=54449) 2 apps (collection time=0 ns)
System: 20 processors, load average: 0.73 (load 0.10 process, 0.10 total system)
Memory (system): 94.4G total, 89.8G free, 5.5G committed (Swap: 96.6G total, 96.6G free)
Memory (proc): HEAP 1.5G init, 522M used, 1.5G commit, 1.5G max NON-HEAP 2M init, 47M used, 48M commit, 0K max
Disk:
  [/ Total: 49.2GB, Usable:  18GB, Free:  18GB]
  [/dev/shm Total: 47.2GB, Usable: 47.2GB, Free: 47.2GB]
  [/boot Total: 484.2MB, Usable: 422.4MB, Free: 422.4MB]
  [/home Total: 405.2GB, Usable: 267GB, Free: 267GB]
  [/distributions Total: 196.9GB, Usable: 8.1GB, Free: 8.1GB]
Threads: 20 total (16 daemon) 21 peak
JIT: HotSpot 64-Bit Tiered Compilers, time: 2959 ms
GC:
...ParNew [0 collections, commulative time: 0 ms]
...MarkSweepCompact [1 collections, commulative time: 54 ms]

The above trace can be interpreted as follows:

General Info

Date and time that statistics gathering started
Server name
Server PID
Number of apps running in the container
Time spent gathering container statistics (for the current interval, excluding formatting)

System Info

Number of available processors
System load average

Memory Info

For the entire system:

Total available memory
The free memory
Commit memory
Swap total/free

For the process:

Initial heap size
Heap used
Heap committed
Max heap size
Initial non-heap size
Non-heap memory used
Non-heap memory committed
Non-heap memory max size

Reference: For more info regarding the process statistics above, you can reference the Oracle JavaDoc on MemoryUsage.

Note: JDK 7 or newer is needed to collect all available memory stats. In addition, some stats are not available on all JVMs.

Disk

For each volume available:

Total space
Usable space
Available space

Note: Listing of disk system roots requires JDK7+. With JDK 6 or below, some disk information may not be available.

Thread Info

Total thread count
Daemon thread count
Peak thread count

JIT Info

JIT name
Total compilation time

Tip: Compare 2 consecutive intervals to determine if JIT occurred in the interval.

GC Info

Collection count (for all GCs)
Collection time (for all GCs)

Tip: Compare 2 consecutive intervals to determine if a GC occurred in the interval.

Thread Stats

Since 3.7

Sample Trace Output:

[Thread Stats]
ID    CPU       DCPU    DUSER   CPU%  USER% WAIT% STATE           NAME
1     6.0s      982.8us 0       1     0     0     RUNNABLE        X-Server-blackbird1-Main (aff=[])
2     9.3ms     0       0       0     0     0     WAITING         Reference Handler
3     8.7ms     0       0       0     0     0     WAITING         Finalizer
4     43.8us    0       0       0     0     0     RUNNABLE        Signal Dispatcher
23    53.9ms    722.7us 0       1     0     0     RUNNABLE        X-EDP-McastReceiver (aff=[1(s0c1t0)])
24    26.3ms    426.5us 0       1     0     0     TIMED_WAITING   X-EDP-Timer (aff=[1(s0c1t0)])
26    1.9s      33.9ms  30.0ms  1     1     0     RUNNABLE        X-Server-blackbird1-StatsRunner (aff=[1(s0c1t0)])
28    6.9m      10.2s   4.8s    100   48    0     RUNNABLE        X-Server-blackbird1-IOThread-1 (aff=[8(s0c11t0)])
30    236.6us   0       0       0     0     0     TIMED_WAITING   X-EventMultiplexer-Wakeup-admin (aff=[1(s0c1t0)])
34    685.4ms   11.5ms  0       1     0     0     TIMED_WAITING   X-EventMultiplexer-Wakeup-blackbird (aff=[1(s0c1t0)])
35    9.2m      10.3s   10.3s   100   100   100   RUNNABLE        X-ODS-StoreLog-blackbird-1 (aff=[4(s0c4t0)])
40    9.2m      10.3s   10.3s   100   100   0     RUNNABLE        SorProcessor (aff=[5(s0c8t0)])
41    11.7ms    0       0       0     0     100   WAITING         X-STEMux-admin-1 (aff=[])
42    9.0m      10.3s   10.2s   100   99    90    RUNNABLE        X-STEMux-blackbird-2 (aff=[2(s0c2t0)])
43    7.0m      10.2s   4.8s    100   47    0     RUNNABLE        X-ODS-StoreReplicatorLinkReader-myapp-93323c0d-5e4c-48d7-8cd4-f251963a6310 (aff=[3(s0c3t0)])
44    52.0ms    973.7us 0       1     0     0     RUNNABLE        X-ODS-StoreLinkAcceptor-1 (aff=[1(s0c1t0)])
45    58.9ms    1.0ms   0       1     0     0     RUNNABLE        X-EDP-McastReceiver (aff=[1(s0c1t0)])
46    41.9ms    592.2us 0       1     0     0     TIMED_WAITING   X-EDP-Timer (aff=[1(s0c1t0)])
48    9.1m      10.3s   10.1s   100   98    98    RUNNABLE        X-AEP-BusManager-IO-blackbird.market (aff=[7(s0c10t0)])
49    1.1s      0       0       0     0     0     RUNNABLE        X-Client-LinkManagerReader[c43b3977-572f-4366-8524-f17678e71515] (aff=[9(s0c12t0)])
50    9.1m      10.3s   10.3s   100   100   93    RUNNABLE        X-AEP-BusManager-IO-blackbird.blackbird (aff=[6(s0c9t0)])

Where columns can be interpreted as:

Column

Description

The thread's id

CPU

The total amount of time in nanoseconds that the thread has executed (as reported by the JMX thread bean)

DCPU

The amount of time that the thread has executed in user mode or system mode (as reported by the JMX thread bean)

DUSER

The amount of time that the thread has executed in user mode in the given interval in nanoseconds (as reported by the JMX thread bean)

CPU%

The percentage of CPU time the thread used during the interval (e.g. DCPU * 100 / interval time)

USER%

The percentage of user mode CPU time the thread used during the interval (e.g. DUSER * 100 / interval time)

WAIT%

The percentage of the time that the thread was recorded in a wait state such as a busy spin loop or a disruptor wait. Wait times are proactively captured by the platform via code instrumentation that takes a timestamp before and after entering/exiting the wait condition. This means that unlike CPU% or USER%, this percentage can include time when the thread is not scheduled and consuming CPU resources. Because of this it is not generally possible to simply subtract WAIT% from CPU% to calculate the amount of time the thread actually executed. For example, if CPU% is 50 and WAIT% is also 50 and the interval is 5 seconds, it could be that 2.5 seconds of real work was done while 2.5 seconds of wait time occurred while the thread was context switched out, or it could be that all 2.5 seconds of wait time coincided with the 2.5 seconds of CPU time and all of the CPU time was spent busy spinning. In other words, WAIT% gives a definitive indication of time that the thread was not doing active work during the interval; the remaining CPU time is at the mercy of the operating system's thread scheduler.

STATE

The thread's runnable state at the time of collection

NAME

The thread name. Note that when affinitization is enabled and the thread has been affinitized, that affinitization information is appended to the thread name.

Tip: This is useful when trying to determine whether a thread should be affinitized. A busy spinning thread will typically have a CPU% of ~100. If the thread is not affinitized, it might be a good candidate.

affinity

The affinity summary string reported along with individual thread stats is not reported in a column of its own as the affinitizer appends it to the thread name

CPU times are reported according to the most appropriate short form:

Unit

Abbreviation

Days

Hours

Minutes

Seconds

Milliseconds

Microseconds

Nanoseconds

Pool Stats

Pool stats are only included in heartbeats when:

A miss has been recorded for the pool in a given interval and it results in a new object being allocated
The number of preallocated objects taken from a pool drops below the configured value for the pool depletion threshold

Sample Trace Output:

[Pool Stats]
PUT   DPUT  GET   DGET  HIT   DHIT  MISS  DMISS GROW  DGROW EVIC  DEVIC DWSH  DDWSH SIZE  PRE   CAP   NAME
38    0     16.8M 0     38    0     16.8M 0     0     0     0     0     0     0     0     0     1024  iobuf.native-32.20
1     0     62    0     1     0     61    0     0     0     0     0     0     0     0     0     1024  iobuf.native-64.21
1     0     1.0M  0     1     0     1.0M  0     0     0     0     0     0     0     0     0     1024  iobuf.native-256.23
7     0     75    0     7     0     68    0     0     0     0     0     0     0     0     0     1024  iobuf.heap-32.1

Stat

Description

PUT

The overall number of times items were put (returned) to a pool

DPUT

The number of times items were put (returned) to a pool since the last time the pool was reported in a heartbeat (the delta)

GET

The overall number of times an item was taken from a pool.

Tip: If pool items are not being leaked, GET - PUT indicates the number of items that have been taken from the pool and not returned (e.g., items that are being held by messages in the transaction processing pipeline or microservice state).

DGET

The number of times an item was taken from a pool since the last time the pool was reported in a heartbeat (the delta)

HIT

The overall number of times that an item taken from a pool was satisfied by there being an available item in the pool

DHIT

The number of times that an item taken from a pool was satisfied by there being an available item in the pool since the last time the pool was reported in a heartbeat (the delta)

MISS

The overall number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation

DMISS

The number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation since the last time the pool was reported in a heartbeat

GROW

The overall number of times the capacity of a pool had to be increased to accommodate returned items

DGROW

The number of times the capacity of a pool had to be increased to accommodate returned items since the last time the pool was reported in a heartbeat

EVIC

The overall number of items that were evicted from the pool because the pool did not have an adequate capacity to store them

DEVIC

The overall number of items that were evicted from the pool because the pool did not have an adequate capacity to store them since the last time the pool was reported in a heartbeat

DWSH

The overall number of times that an item returned to the pool was washed (e.g., fields reset) in the detached pool washer thread

DDWSH

The number of times that an item returned to the pool was washed (e.g., fields reset) in the detached pool washer thread since the last time the pool was reported in a heartbeat

SIZE

The number of items that are currently in the pool available for pool gets. This number will be 0 if all objects that have been allocated by the pool have been taken.

Note: Because pool stats are generally printed when there are pool misses, this value will often be 0 reflecting that there are no items available in the pool.

PRE

The number of items initially preallocated for the pool

CAP

The capacity of the backing array that is allocated to hold available pool items that have been preallocated or returned to the pool.

Tip: The capacity of a pool will grow automatically as items are returned to the pool without being taken out. A large capacity generally indicates that at some point in the past a larger number of items was needed, but are not currently being used.

NAME

The unique identifier for the pool

Engine Stats

Stats collected by the AEP engine underlying your application are also included in heartbeats. See AEP Engine Statistics for more detail about engine stats.

User Stats

User stats collected by your application are also included in heartbeats.

Sample Trace Output:

[App (ems) User Stats]
...Gauges{
......EMS Messages Received: 142604
......EMS Orders Received: 35651
...}
...Series{
......[In Proc Tick To Trade(sno=35651, #points=150, #skipped=0)
.........In Proc Tick To Trade(interval): [sample=150, min=72 max=84 mean=75 median=75 75%ile=77 90%ile=79 99%ile=83 99.9%ile=84 99.99%ile=84]
.........In Proc Tick To Trade (running): [sample=35651, min=72 max=2000 mean=93 median=76 75%ile=82 90%ile=111 99%ile=227 99.9%ile=805 99.99%ile=1197]
......[In Proc Time To First Slice(sno=35651, #points=150, #skipped=0)
.........In Proc Time To First Slice(interval): [sample=150, min=85 max=98 mean=88 median=88 75%ile=90 90%ile=92 99%ile=95 99.9%ile=98 99.99%ile=98]
.........In Proc Time To First Slice (running): [sample=35651, min=84 max=4469 mean=249 median=88 75%ile=95 90%ile=133 99%ile=283 99.9%ile=3628 99.99%ile=4143]
...}

See Also: Exposing Application Stats for adding stats specific to your application to heartbeats.

Configuring Monitoring - Configure heartbeat and statistics collection
AEP Engine Statistics - Engine-level statistics reference
Exposing Application Stats - Define custom application statistics
Stats Dump Tool - Convert binary heartbeat logs to human-readable format
Per Transaction Stats - Transaction-level statistics

Next Steps

Enable heartbeats in your container configuration
Configure appropriate collection settings for your performance requirements
Choose heartbeat output method (tracing, logging, or event handlers)
Monitor application performance using collected statistics
Use Stats Dump Tool for offline analysis of binary logs

PreviousMonitoring NextAEP Engine Statistics

Last updated 5 days ago

hashtagConfiguring Heartbeats

hashtagWhat Heartbeats Contain

hashtagConsuming container Heartbeats

hashtagHeartbeat Event Handlers

hashtagAdmin Clients

hashtagHeartbeat Trace Output

hashtagSystem Stats

hashtagGeneral Info

hashtagSystem Info

hashtagMemory Info

hashtagDisk

hashtagThread Info

hashtagJIT Info

hashtagGC Info

hashtagThread Stats

hashtagPool Stats

hashtagEngine Stats

hashtagUser Stats

hashtagRelated Topics

hashtagNext Steps

Configuring Heartbeats

What Heartbeats Contain

Consuming container Heartbeats

Heartbeat Event Handlers

Admin Clients

Heartbeat Trace Output

System Stats

General Info

System Info

Memory Info

Disk

Thread Info

JIT Info

GC Info

Thread Stats

Pool Stats

Engine Stats

User Stats

Related Topics

Next Steps