Monitoring cluster state replication in Bamboo

Statistics describing the state replication are written into the main atlassian-bamboo.log file. They show the details of state replication, like the size of the local queue or the most frequently used event types.

There are two types of statistics:

Total – statistics aggregated since the node has been started. For example:

Replication queue stats per node: <destination-node-id> total stats: <statistics-in-json>

Snapshot – statistics aggregated since the last snapshot has been taken (10 minutes by default). For example:
```
Replication queue stats per node: <destination-node-id> snapshot stats: <statistics-in-json>
```

For example, the first node of a 3-node cluster will contain total and snapshot statistics on the replication from the first node to the second and third nodes.

The frequency of saving the snapshot statistics can be set with the bamboo.per.node.queue.stats.logging.interval.minutes system property.

On this page:

Log parameters
Sample statistics
Basic monitoring recommendations

Log parameters

The table below describes all the parameters that are present in the statistics:

Parameter	Description
`timestampMillis`	The timestamp of when the statistics were generated (in milliseconds).
`nodeId`	Destination node ID.
`queueSize`	The current size of the queue (number of modifications).
`startQueueSize`	The initial queue size from when the node was started. For example, if the node was restarted with 20 events still in the queue, this parameter will show 20.
`startTimestampMillis`	The timestamp of when the queue was started on the node (in milliseconds).
`startMillisAgo`	The time the queue has been active (in milliseconds).
`closeCounter`	Changes to 1 after closing the queue. The queue will be closed when a node's status changes to offline.
`addCounter`	The number of events added to the queue.
`droppedOnAddCounter`	The number of events that couldn't be added to the queue.
`criticalAddCounter`	The number of events that couldn't be added to the queue. In the log file: `Critical state of local replication queue - cannot add: <event> to queue: <queue-name>, error: <error-message>`
`criticalPeekCounter`	The number of events that couldn't be read from the queue.
`criticalRemoveCounter`	The number of events that couldn't be removed from the queue.
`peekCounter`	The number of events read from the queue. These events are then replicated and removed from the queue.
`removeCounter`	The number of events removed from the queue. It usually means these replications were successfully replicated.
`backupQueueCounter`	The number of backups created for a queue. Backups are created when the queue file is corrupted and can't be read it or written to. After a backup has been created, a new queue is started. This is represented in the log file as follows: `Re-created persistent replication queue for node: <destination-node> with id: <queue-name> in: <localq-home>`
`closeErrorsCounter`	Increases when a queue can't be closed.
`addErrorsCounter`	Increases when an event can't be added to the queue.
`peekErrorsCounter`	Increases when an event can't be read from the queue.
`removeErrorsCounter`	Increases when an event request can't be removed from the queue.
`backupQueueErrorsCounter`	Increases when a queue backup can't be created.
`lastAddTimestampMillis`	The timestamp of when the last event was added to the queue (in milliseconds).
`lastAddMillisAgo`	The time elapsed since the last event added to the queue (in milliseconds).
`lastPeekTimestampMillis`	The timestamp of when the last event was read from the queue (in milliseconds).
`lastPeekMillisAgo`	The time elapsed since the last event was read from the queue (in milliseconds).
`lastRemoveTimestampMillis`	The timestamp of when the last event was removed from the queue (in milliseconds).
`lastRemoveMillisAgo`	The time elapsed since the last event was removed from the queue (in milliseconds).
`lastBackupQueueTimestampMillis`	The timestamp of when the last backup was created (in milliseconds).
`lastBackupQueueMillisAgo`	The time elapsed since the last backup was created (in milliseconds).
`timeToAddMillis`	The time it takes to add an event to the queue (in milliseconds).
`timeToPeekMillis`	The time it takes to read an event from the queue (in milliseconds).
`timeToRemoveMillis`	The time it takes to remove an event from the queue (in milliseconds)
`timeToBackupQueueMillis`	The time it takes to back up a queue (in milliseconds).
`sendCounter`	The number of events that were successfully replicated.
`droppedOnSendCounter`	The number of events that couldn't be replicated.
`timeToSendMillis`	The time it takes to replicate a event to other nodes ( in milliseconds). Can be interpreted as a latency between publisher and receiver nodes.
`sendFailureExceptionCounter`	The number of errors that occurred on the gRPC-based communication layer.
`sendUnavailableExceptionCounter`	The number of unavailability errors that indicate the receiver node was not reachable from the publisher node. For example, due to the node being offline.
`numberOfEventTypes`	The number of actively replicating event types.
`addCounterTopN`	The number of events replicated by the topN event types.
`addCounterOthers`	The number of events replicated by all other (non-topN) event types.
`addCounterByEventTypeTopN`	Sorted list of topN event types (10 by default): event type name + number of messages. If the top 10 is not enough, you can change that by assigning a higher value to the `bamboo.per.node.queue.stats.names.topN` system property. For example: `-Dbamboo.per.node.queue.stats.names.topN=20`

Sample statistics

Statistics are presented in JSON format. By default, statistics are aggregated in a single line to limit the amount of space they take up. To pretty-print statistics similar to what you can see in the example below, set the bamboo.per.node.queue.stats.pretty.printing.enabled property to true.

Here's a sample to give you an idea of what to look for:

{
  "timestampMillis": 1706043222390,
  "nodeId": "ea8cbc38-ac78-41a1-a7a8-879ff3757429",
  "queueSize": 0,
  "startQueueSize": 0,
  "startTimestampMillis": 1706043102392,
  "startMillisAgo": 119998,
  "closeCounter": 0,
  "addCounter": 46,
  "droppedOnAddCounter": 0,
  "criticalAddCounter": 0,
  "criticalPeekCounter": 0,
  "criticalRemoveCounter": 0,
  "peekCounter": 89,
  "removeCounter": 46,
  "backupQueueCounter": 0,
  "closeErrorsCounter": 0,
  "addErrorsCounter": 0,
  "peekErrorsCounter": 0,
  "removeErrorsCounter": 0,
  "backupQueueErrorsCounter": 0,
  "lastAddTimestampMillis": 1706043220684,
  "lastAddMillisAgo": 1706,
  "lastPeekTimestampMillis": 1706043220689,
  "lastPeekMillisAgo": 1701,
  "lastRemoveTimestampMillis": 1706043220689,
  "lastRemoveMillisAgo": 1701,
  "lastBackupQueueTimestampMillis": 0,
  "lastBackupQueueMillisAgo": 0,
  "timeToAddMillis": {
    "count": 46,
    "min": 0,
    "max": 2,
    "sum": 5,
    "avg": 0,
    "distributionCounter": {
      "10": 46,
      "20": 0,
      "50": 0,
      "100": 0
    }
  },
  "timeToPeekMillis": {
    "count": 89,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "timeToRemoveMillis": {
    "count": 46,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {
      "10": 46,
      "20": 0,
      "50": 0,
      "100": 0
    }
  },
  "timeToBackupQueueMillis": {
    "count": 0,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "sendCounter": 46,
  "droppedOnSendCounter": 0,
  "timeToSendMillis": {
    "count": 46,
    "min": 0,
    "max": 191,
    "sum": 529,
    "avg": 12,
    "distributionCounter": {
      "10": 38,
      "20": 2,
      "50": 5,
      "100": 0,
      "200": 1,
      "500": 0,
      "1000": 0,
      "5000": 0
    }
  },
  "sendFailureExceptionCounter": 0,
  "sendUnavailableExceptionCounter": 0,
  "numberOfEventTypes": 5,
  "addCounterTopN": 46,
  "addCounterOthers": 0,
  "addCounterByEventTypeTopN": {
    "INVALIDATE_PLAN_CACHE": 23,
    "REFRESH_ADMINISTRATION_CONFIGURATION": 9,
    "ATLASSIAN_CACHE_REMOVE_BY_KEY": 7,
    "INVALIDATE_REPOSITORY_CACHE": 6,
    "HIDE_PLAN": 1
  }
}

Basic monitoring recommendations

You don't need to keep a close eye on all parameters to tell whether state replication is working properly. We've selected just few basic parameters that will give you a reliable insight into the health of cluster state replication in your Data Center deployment. Review the recommended values in the following table and compare them to what you're seeing in your environment.

Parameter	Description	Recommended value
`queueSize`	The current size of the queue. This number represents how many events are waiting to be replicated.	The value should be close to 0. It means that events are successfully replicated, and then removed from the queue.
`startQueueSize + addCounter`	The initial size of the queue when the node was started and the number of events added to the queue since then.	The sum of `startQueueSize` + `addCounter` should be equal to sendCounter, and to removeCounter.
`sendCounter + removeCounter`	The number of events replicated to another node and the number of events removed from the queue.	It means that all events added to the queue were successfully replicated and then removed from the queue.
`timeToAddMillis`	The time it takes to add and remove events from the local queue.	The average values should be close to 0-10 milliseconds. If it's much more than that, you're having I/O problems with the local home directory storage volume.
`timeToRemoveMillis`	The time it takes to remove events from the local queue.
`timeToSendMillis`	The time it takes to send the event from the local queue to another node over gRPC.
`droppedOnAddCounter` `droppedOnSendCounter`	The number of events that couldn't be added to the queue, or replicated to other nodes.	The values should be close to 0. It means that events are successfully being added to the queue and replicated to other nodes.

More information

Page:

Cluster state replication
Page:

Configuring cluster state replication

Products

Jira Software

Jira Service Management

Jira Work Management

Confluence

Bitbucket

Resources

Documentation

Community

System Status

Suggestions and bugs

Marketplace

Billing and licensing

Monitoring cluster state replication in Bamboo

Cluster state replication in Bamboo

On this page

Still need help?

Log parameters

Sample statistics

Basic monitoring recommendations

More information

Page

Viewport

Confluence

Monitoring cluster state replication in Bamboo

Cluster state replication in Bamboo

On this page

Related content

Still need help?

Log parameters

Sample statistics

Basic monitoring recommendations

More information

Related content