Container Stats and Heartbeats
An operational container continuously collects raw statistics during the course of its operation. The container can also be configured to spin up a background thread that periodically performs the following:
Performs higher level statistical computations such as calculating message rates and average latencies
Emits heartbeat messages to be processed by handlers
Optionally outputs rendered stats to a trace logger which is useful in testing and diagnostic situations
Optionally writes heartbeat messages containing useful container-wide statistics to a binary transaction log (with zero steady-state allocations) which is useful for zero garbage capture of performance in production
The raw metrics collected by the container are used by the background statistical thread for its computations and can also be retrieved programmatically by an application for its own use.
This document describes:
How to enable and configure container stats collection and emission
The higher level statistics calculations performed by the statistics thread
The format of the output of the statistics thread
Configuring Heartbeats
Heartbeats must be configured in your DDL to enable statistics collection. For complete configuration details including all parameters, collection settings, and logging/tracing options, see:
Configuring Monitoring - Complete heartbeat and statistics configuration guide
This page focuses on understanding and interpreting the heartbeat output once configured.
What Heartbeats Contain
Heartbeats contain several categories of statistics:
System Stats: CPU, memory, disk, threads, GC
Thread Stats: Per-thread CPU utilization and affinitization
Pool Stats: Object pool usage and depletion
Engine Stats: AEP engine metrics (see AEP Engine Statistics)
User Stats: Application-defined statistics (see Exposing Application Stats)
Consuming container Heartbeats
When heartbeats are enabled, they can be consumed in several ways:
Heartbeat Event Handlers
Your application can register an event handler for container heartbeats to handle them in process:
See the SrvMonHeartbeatMessage JavaDoc for API details.
Admin Clients
Administrative and monitoring tools can connect to a container via a direct admin connection over TCP to listen for heartbeats for monitoring purposes. The container's stats thread will queue copies of each emitted heartbeat to each connected admin client.
Heartbeat Trace Output
Heartbeat trace is emitted to the nv.server.heartbeat logger at a level of INFO. Trace is only emitted for the types of heartbeat trace for which tracing has been enabled.
For configuration details on enabling trace output for different statistic types, see Configuring Monitoring.
This section explains how to interpret the trace output for each type of heartbeat statistic.
See Also: Trace Logging for general information on trace logging.
System Stats
Sample Trace Output:
The above trace can be interpreted as follows:
General Info
Date and time that statistics gathering started
Server name
Server PID
Number of apps running in the container
Time spent gathering container statistics (for the current interval, excluding formatting)
System Info
Number of available processors
System load average
Memory Info
For the entire system:
Total available memory
The free memory
Commit memory
Swap total/free
For the process:
Initial heap size
Heap used
Heap committed
Max heap size
Initial non-heap size
Non-heap memory used
Non-heap memory committed
Non-heap memory max size
Reference: For more info regarding the process statistics above, you can reference the Oracle JavaDoc on MemoryUsage.
Note: JDK 7 or newer is needed to collect all available memory stats. In addition, some stats are not available on all JVMs.
Disk
For each volume available:
Total space
Usable space
Available space
Note: Listing of disk system roots requires JDK7+. With JDK 6 or below, some disk information may not be available.
Thread Info
Total thread count
Daemon thread count
Peak thread count
JIT Info
JIT name
Total compilation time
Tip: Compare 2 consecutive intervals to determine if JIT occurred in the interval.
GC Info
Collection count (for all GCs)
Collection time (for all GCs)
Tip: Compare 2 consecutive intervals to determine if a GC occurred in the interval.
Thread Stats
Since 3.7
Sample Trace Output:
Where columns can be interpreted as:
ID
The thread's id
CPU
The total amount of time in nanoseconds that the thread has executed (as reported by the JMX thread bean)
DCPU
The amount of time that the thread has executed in user mode or system mode (as reported by the JMX thread bean)
DUSER
The amount of time that the thread has executed in user mode in the given interval in nanoseconds (as reported by the JMX thread bean)
CPU%
The percentage of CPU time the thread used during the interval (e.g. DCPU * 100 / interval time)
USER%
The percentage of user mode CPU time the thread used during the interval (e.g. DUSER * 100 / interval time)
WAIT%
The percentage of the time that the thread was recorded in a wait state such as a busy spin loop or a disruptor wait. Wait times are proactively captured by the platform via code instrumentation that takes a timestamp before and after entering/exiting the wait condition. This means that unlike CPU% or USER%, this percentage can include time when the thread is not scheduled and consuming CPU resources. Because of this it is not generally possible to simply subtract WAIT% from CPU% to calculate the amount of time the thread actually executed. For example, if CPU% is 50 and WAIT% is also 50 and the interval is 5 seconds, it could be that 2.5 seconds of real work was done while 2.5 seconds of wait time occurred while the thread was context switched out, or it could be that all 2.5 seconds of wait time coincided with the 2.5 seconds of CPU time and all of the CPU time was spent busy spinning. In other words, WAIT% gives a definitive indication of time that the thread was not doing active work during the interval; the remaining CPU time is at the mercy of the operating system's thread scheduler.
STATE
The thread's runnable state at the time of collection
NAME
The thread name. Note that when affinitization is enabled and the thread has been affinitized, that affinitization information is appended to the thread name.
Tip: This is useful when trying to determine whether a thread should be affinitized. A busy spinning thread will typically have a CPU% of ~100. If the thread is not affinitized, it might be a good candidate.
affinity
The affinity summary string reported along with individual thread stats is not reported in a column of its own as the affinitizer appends it to the thread name
CPU times are reported according to the most appropriate short form:
Days
d
Hours
h
Minutes
m
Seconds
s
Milliseconds
ms
Microseconds
us
Nanoseconds
ns
Pool Stats
Pool stats are only included in heartbeats when:
A miss has been recorded for the pool in a given interval and it results in a new object being allocated
The number of preallocated objects taken from a pool drops below the configured value for the pool depletion threshold
Sample Trace Output:
PUT
The overall number of times items were put (returned) to a pool
DPUT
The number of times items were put (returned) to a pool since the last time the pool was reported in a heartbeat (the delta)
GET
The overall number of times an item was taken from a pool.
Tip: If pool items are not being leaked, GET - PUT indicates the number of items that have been taken from the pool and not returned (e.g., items that are being held by messages in the transaction processing pipeline or microservice state).
DGET
The number of times an item was taken from a pool since the last time the pool was reported in a heartbeat (the delta)
HIT
The overall number of times that an item taken from a pool was satisfied by there being an available item in the pool
DHIT
The number of times that an item taken from a pool was satisfied by there being an available item in the pool since the last time the pool was reported in a heartbeat (the delta)
MISS
The overall number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation
DMISS
The number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation since the last time the pool was reported in a heartbeat
GROW
The overall number of times the capacity of a pool had to be increased to accommodate returned items
DGROW
The number of times the capacity of a pool had to be increased to accommodate returned items since the last time the pool was reported in a heartbeat
EVIC
The overall number of items that were evicted from the pool because the pool did not have an adequate capacity to store them
DEVIC
The overall number of items that were evicted from the pool because the pool did not have an adequate capacity to store them since the last time the pool was reported in a heartbeat
DWSH
The overall number of times that an item returned to the pool was washed (e.g., fields reset) in the detached pool washer thread
DDWSH
The number of times that an item returned to the pool was washed (e.g., fields reset) in the detached pool washer thread since the last time the pool was reported in a heartbeat
SIZE
The number of items that are currently in the pool available for pool gets. This number will be 0 if all objects that have been allocated by the pool have been taken.
Note: Because pool stats are generally printed when there are pool misses, this value will often be 0 reflecting that there are no items available in the pool.
PRE
The number of items initially preallocated for the pool
CAP
The capacity of the backing array that is allocated to hold available pool items that have been preallocated or returned to the pool.
Tip: The capacity of a pool will grow automatically as items are returned to the pool without being taken out. A large capacity generally indicates that at some point in the past a larger number of items was needed, but are not currently being used.
NAME
The unique identifier for the pool
Engine Stats
Stats collected by the AEP engine underlying your application are also included in heartbeats. See AEP Engine Statistics for more detail about engine stats.
User Stats
User stats collected by your application are also included in heartbeats.
Sample Trace Output:
See Also: Exposing Application Stats for adding stats specific to your application to heartbeats.
Related Topics
Configuring Monitoring - Configure heartbeat and statistics collection
AEP Engine Statistics - Engine-level statistics reference
Exposing Application Stats - Define custom application statistics
Stats Dump Tool - Convert binary heartbeat logs to human-readable format
Per Transaction Stats - Transaction-level statistics
Next Steps
Enable heartbeats in your container configuration
Configure appropriate collection settings for your performance requirements
Choose heartbeat output method (tracing, logging, or event handlers)
Monitor application performance using collected statistics
Use Stats Dump Tool for offline analysis of binary logs
Last updated

