Container Stats and Heartbeats

An operational container continuously collects raw statistics during the course of its operation. The container can also be configured to spin up a background thread that periodically performs the following:

  1. Performs higher level statistical computations such as calculating message rates and average latencies

  2. Emits heartbeat messages to be processed by handlers

  3. Optionally outputs rendered stats to a trace logger which is useful in testing and diagnostic situations

  4. Optionally writes heartbeat messages containing useful container-wide statistics to a binary transaction log (with zero steady-state allocations) which is useful for zero garbage capture of performance in production

The raw metrics collected by the container are used by the background statistical thread for its computations and can also be retrieved programmatically by an application for its own use.

This document describes:

  • How to enable and configure container stats collection and emission

  • The higher level statistics calculations performed by the statistics thread

  • The format of the output of the statistics thread

Configuring Heartbeats

Heartbeats must be configured in your DDL to enable statistics collection. For complete configuration details including all parameters, collection settings, and logging/tracing options, see:

This page focuses on understanding and interpreting the heartbeat output once configured.

What Heartbeats Contain

Heartbeats contain several categories of statistics:

  • System Stats: CPU, memory, disk, threads, GC

  • Thread Stats: Per-thread CPU utilization and affinitization

  • Pool Stats: Object pool usage and depletion

  • Engine Stats: AEP engine metrics (see AEP Engine Statistics)

  • User Stats: Application-defined statistics (see Exposing Application Stats)

Consuming container Heartbeats

When heartbeats are enabled, they can be consumed in several ways:

Heartbeat Event Handlers

Your application can register an event handler for container heartbeats to handle them in process:

See the SrvMonHeartbeatMessage JavaDocarrow-up-right for API details.

Admin Clients

Administrative and monitoring tools can connect to a container via a direct admin connection over TCP to listen for heartbeats for monitoring purposes. The container's stats thread will queue copies of each emitted heartbeat to each connected admin client.

Heartbeat Trace Output

Heartbeat trace is emitted to the nv.server.heartbeat logger at a level of INFO. Trace is only emitted for the types of heartbeat trace for which tracing has been enabled.

For configuration details on enabling trace output for different statistic types, see Configuring Monitoring.

This section explains how to interpret the trace output for each type of heartbeat statistic.

circle-info

See Also: Trace Logging for general information on trace logging.

System Stats

Sample Trace Output:

The above trace can be interpreted as follows:

General Info

  • Date and time that statistics gathering started

  • Server name

  • Server PID

  • Number of apps running in the container

  • Time spent gathering container statistics (for the current interval, excluding formatting)

System Info

  • Number of available processors

  • System load average

Memory Info

For the entire system:

  • Total available memory

  • The free memory

  • Commit memory

  • Swap total/free

For the process:

  • Initial heap size

  • Heap used

  • Heap committed

  • Max heap size

  • Initial non-heap size

  • Non-heap memory used

  • Non-heap memory committed

  • Non-heap memory max size

circle-info

Reference: For more info regarding the process statistics above, you can reference the Oracle JavaDoc on MemoryUsagearrow-up-right.

circle-info

Note: JDK 7 or newer is needed to collect all available memory stats. In addition, some stats are not available on all JVMs.

Disk

For each volume available:

  • Total space

  • Usable space

  • Available space

circle-info

Note: Listing of disk system roots requires JDK7+. With JDK 6 or below, some disk information may not be available.

Thread Info

  • Total thread count

  • Daemon thread count

  • Peak thread count

JIT Info

  • JIT name

  • Total compilation time

circle-info

Tip: Compare 2 consecutive intervals to determine if JIT occurred in the interval.

GC Info

  • Collection count (for all GCs)

  • Collection time (for all GCs)

circle-info

Tip: Compare 2 consecutive intervals to determine if a GC occurred in the interval.

Thread Stats

Since 3.7

Sample Trace Output:

Where columns can be interpreted as:

Column
Description

ID

The thread's id

CPU

The total amount of time in nanoseconds that the thread has executed (as reported by the JMX thread bean)

DCPU

The amount of time that the thread has executed in user mode or system mode (as reported by the JMX thread bean)

DUSER

The amount of time that the thread has executed in user mode in the given interval in nanoseconds (as reported by the JMX thread bean)

CPU%

The percentage of CPU time the thread used during the interval (e.g. DCPU * 100 / interval time)

USER%

The percentage of user mode CPU time the thread used during the interval (e.g. DUSER * 100 / interval time)

WAIT%

The percentage of the time that the thread was recorded in a wait state such as a busy spin loop or a disruptor wait. Wait times are proactively captured by the platform via code instrumentation that takes a timestamp before and after entering/exiting the wait condition. This means that unlike CPU% or USER%, this percentage can include time when the thread is not scheduled and consuming CPU resources. Because of this it is not generally possible to simply subtract WAIT% from CPU% to calculate the amount of time the thread actually executed. For example, if CPU% is 50 and WAIT% is also 50 and the interval is 5 seconds, it could be that 2.5 seconds of real work was done while 2.5 seconds of wait time occurred while the thread was context switched out, or it could be that all 2.5 seconds of wait time coincided with the 2.5 seconds of CPU time and all of the CPU time was spent busy spinning. In other words, WAIT% gives a definitive indication of time that the thread was not doing active work during the interval; the remaining CPU time is at the mercy of the operating system's thread scheduler.

STATE

The thread's runnable state at the time of collection

NAME

The thread name. Note that when affinitization is enabled and the thread has been affinitized, that affinitization information is appended to the thread name.

circle-info

Tip: This is useful when trying to determine whether a thread should be affinitized. A busy spinning thread will typically have a CPU% of ~100. If the thread is not affinitized, it might be a good candidate.

affinity

The affinity summary string reported along with individual thread stats is not reported in a column of its own as the affinitizer appends it to the thread name

CPU times are reported according to the most appropriate short form:

Unit
Abbreviation

Days

d

Hours

h

Minutes

m

Seconds

s

Milliseconds

ms

Microseconds

us

Nanoseconds

ns

Pool Stats

Pool stats are only included in heartbeats when:

  • A miss has been recorded for the pool in a given interval and it results in a new object being allocated

  • The number of preallocated objects taken from a pool drops below the configured value for the pool depletion threshold

Sample Trace Output:

Stat
Description

PUT

The overall number of times items were put (returned) to a pool

DPUT

The number of times items were put (returned) to a pool since the last time the pool was reported in a heartbeat (the delta)

GET

The overall number of times an item was taken from a pool.

circle-info

Tip: If pool items are not being leaked, GET - PUT indicates the number of items that have been taken from the pool and not returned (e.g., items that are being held by messages in the transaction processing pipeline or microservice state).

DGET

The number of times an item was taken from a pool since the last time the pool was reported in a heartbeat (the delta)

HIT

The overall number of times that an item taken from a pool was satisfied by there being an available item in the pool

DHIT

The number of times that an item taken from a pool was satisfied by there being an available item in the pool since the last time the pool was reported in a heartbeat (the delta)

MISS

The overall number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation

DMISS

The number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation since the last time the pool was reported in a heartbeat

GROW

The overall number of times the capacity of a pool had to be increased to accommodate returned items

DGROW

The number of times the capacity of a pool had to be increased to accommodate returned items since the last time the pool was reported in a heartbeat

EVIC

The overall number of items that were evicted from the pool because the pool did not have an adequate capacity to store them

DEVIC

The overall number of items that were evicted from the pool because the pool did not have an adequate capacity to store them since the last time the pool was reported in a heartbeat

DWSH

The overall number of times that an item returned to the pool was washed (e.g., fields reset) in the detached pool washer thread

DDWSH

The number of times that an item returned to the pool was washed (e.g., fields reset) in the detached pool washer thread since the last time the pool was reported in a heartbeat

SIZE

The number of items that are currently in the pool available for pool gets. This number will be 0 if all objects that have been allocated by the pool have been taken.

circle-info

Note: Because pool stats are generally printed when there are pool misses, this value will often be 0 reflecting that there are no items available in the pool.

PRE

The number of items initially preallocated for the pool

CAP

The capacity of the backing array that is allocated to hold available pool items that have been preallocated or returned to the pool.

circle-info

Tip: The capacity of a pool will grow automatically as items are returned to the pool without being taken out. A large capacity generally indicates that at some point in the past a larger number of items was needed, but are not currently being used.

NAME

The unique identifier for the pool

Engine Stats

Stats collected by the AEP engine underlying your application are also included in heartbeats. See AEP Engine Statistics for more detail about engine stats.

User Stats

User stats collected by your application are also included in heartbeats.

Sample Trace Output:

circle-info

See Also: Exposing Application Stats for adding stats specific to your application to heartbeats.

Next Steps

  1. Enable heartbeats in your container configuration

  2. Configure appropriate collection settings for your performance requirements

  3. Choose heartbeat output method (tracing, logging, or event handlers)

  4. Monitor application performance using collected statistics

  5. Use Stats Dump Tool for offline analysis of binary logs

Last updated