Monitoring cluster state replication in Bamboo

On this page

Still need help?

The Atlassian Community is here for you.

Ask the community

Statistics describing the state replication are written into the main atlassian-bamboo.log file. They show the details of state replication, like the size of the local queue or the most frequently used event types.

There are two types of statistics:

  • Total – statistics aggregated since the node has been started. For example:

    Replication queue stats per node: <destination-node-id> total stats: <statistics-in-json>
  • Snapshot – statistics aggregated since the last snapshot has been taken (10 minutes by default). For example:

    Replication queue stats per node: <destination-node-id> snapshot stats: <statistics-in-json>


For example, the first node of a 3-node cluster will contain total and snapshot statistics on the replication from the first node to the second and third nodes.

tip/resting Created with Sketch.

The frequency of saving the snapshot statistics can be set with the bamboo.per.node.queue.stats.logging.interval.minutes system property.

On this page:

Log parameters

The table below describes all the parameters that are present in the statistics:

ParameterDescription
timestampMillisThe timestamp of when the statistics were generated (in milliseconds).
nodeIdDestination node ID.
queueSizeThe current size of the queue (number of modifications).
startQueueSizeThe initial queue size from when the node was started. For example, if the node was restarted with 20 events still in the queue, this parameter will show 20.
startTimestampMillisThe timestamp of when the queue was started on the node (in milliseconds).
startMillisAgoThe time the queue has been active (in milliseconds).
closeCounterChanges to 1 after closing the queue. The queue will be closed when a node's status changes to offline.
addCounterThe number of events added to the queue.
droppedOnAddCounterThe number of events that couldn't be added to the queue.
criticalAddCounter
The number of events that couldn't be added to the queue. In the log file:


Critical state of local replication queue - cannot add: <event> to queue: <queue-name>, error: <error-message>
criticalPeekCounter
The number of events that couldn't be read from the queue.


criticalRemoveCounterThe number of events that couldn't be removed from the queue.
peekCounterThe number of events read from the queue. These events are then replicated and removed from the queue.
removeCounterThe number of events removed from the queue. It usually means these replications were successfully replicated.
backupQueueCounter
The number of backups created for a queue. Backups are created when the queue file is corrupted and can't be read it or written to. After a backup has been created, a new queue is started. This is represented in the log file as follows:


Re-created persistent replication queue for node: <destination-node> with id: <queue-name> in: <localq-home>
closeErrorsCounterIncreases when a queue can't be closed.
addErrorsCounterIncreases when an event can't be added to the queue.
peekErrorsCounterIncreases when an event can't be read from the queue.
removeErrorsCounterIncreases when an event request can't be removed from the queue.
backupQueueErrorsCounterIncreases when a queue backup can't be created.
lastAddTimestampMillisThe timestamp of when the last event was added to the queue (in milliseconds).
lastAddMillisAgoThe time elapsed since the last event added to the queue (in milliseconds).
lastPeekTimestampMillisThe timestamp of when the last event was read from the queue (in milliseconds).
lastPeekMillisAgoThe time elapsed since the last event was read from the queue (in milliseconds).
lastRemoveTimestampMillisThe timestamp of when the last event was removed from the queue (in milliseconds).
lastRemoveMillisAgoThe time elapsed since the last event was removed from the queue (in milliseconds).
lastBackupQueueTimestampMillisThe timestamp of when the last backup was created (in milliseconds).
lastBackupQueueMillisAgoThe time elapsed since the last backup was created (in milliseconds).
timeToAddMillisThe time it takes to add an event to the queue (in milliseconds).
timeToPeekMillisThe time it takes to read an event from the queue (in milliseconds).
timeToRemoveMillisThe time it takes to remove an event from the queue (in milliseconds)
timeToBackupQueueMillisThe time it takes to back up a queue (in milliseconds).
sendCounterThe number of events that were successfully replicated.
droppedOnSendCounterThe number of events that couldn't be replicated.
timeToSendMillisThe time it takes to replicate a event to other nodes ( in milliseconds). Can be interpreted as a latency between publisher and receiver nodes.
sendFailureExceptionCounterThe number of errors that occurred on the gRPC-based communication layer.
sendUnavailableExceptionCounterThe number of unavailability errors that indicate the receiver node was not reachable from the publisher node. For example, due to the node being offline.
numberOfEventTypesThe number of actively replicating event types.
addCounterTopNThe number of events replicated by the topN event types.
addCounterOthersThe number of events replicated by all other (non-topN) event types.
addCounterByEventTypeTopN

Sorted list of topN event types (10 by default): event type name + number of messages.

tip/resting Created with Sketch.

If the top 10 is not enough, you can change that by assigning a higher value to the bamboo.per.node.queue.stats.names.topN system property.

For example:

-Dbamboo.per.node.queue.stats.names.topN=20

Sample statistics

Statistics are presented in JSON format. By default, statistics are aggregated in a single line to limit the amount of space they take up. To pretty-print statistics similar to what you can see in the example below, set the bamboo.per.node.queue.stats.pretty.printing.enabled property to true.

Here's a sample to give you an idea of what to look for:

{
  "timestampMillis": 1706043222390,
  "nodeId": "ea8cbc38-ac78-41a1-a7a8-879ff3757429",
  "queueSize": 0,
  "startQueueSize": 0,
  "startTimestampMillis": 1706043102392,
  "startMillisAgo": 119998,
  "closeCounter": 0,
  "addCounter": 46,
  "droppedOnAddCounter": 0,
  "criticalAddCounter": 0,
  "criticalPeekCounter": 0,
  "criticalRemoveCounter": 0,
  "peekCounter": 89,
  "removeCounter": 46,
  "backupQueueCounter": 0,
  "closeErrorsCounter": 0,
  "addErrorsCounter": 0,
  "peekErrorsCounter": 0,
  "removeErrorsCounter": 0,
  "backupQueueErrorsCounter": 0,
  "lastAddTimestampMillis": 1706043220684,
  "lastAddMillisAgo": 1706,
  "lastPeekTimestampMillis": 1706043220689,
  "lastPeekMillisAgo": 1701,
  "lastRemoveTimestampMillis": 1706043220689,
  "lastRemoveMillisAgo": 1701,
  "lastBackupQueueTimestampMillis": 0,
  "lastBackupQueueMillisAgo": 0,
  "timeToAddMillis": {
    "count": 46,
    "min": 0,
    "max": 2,
    "sum": 5,
    "avg": 0,
    "distributionCounter": {
      "10": 46,
      "20": 0,
      "50": 0,
      "100": 0
    }
  },
  "timeToPeekMillis": {
    "count": 89,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "timeToRemoveMillis": {
    "count": 46,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {
      "10": 46,
      "20": 0,
      "50": 0,
      "100": 0
    }
  },
  "timeToBackupQueueMillis": {
    "count": 0,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "sendCounter": 46,
  "droppedOnSendCounter": 0,
  "timeToSendMillis": {
    "count": 46,
    "min": 0,
    "max": 191,
    "sum": 529,
    "avg": 12,
    "distributionCounter": {
      "10": 38,
      "20": 2,
      "50": 5,
      "100": 0,
      "200": 1,
      "500": 0,
      "1000": 0,
      "5000": 0
    }
  },
  "sendFailureExceptionCounter": 0,
  "sendUnavailableExceptionCounter": 0,
  "numberOfEventTypes": 5,
  "addCounterTopN": 46,
  "addCounterOthers": 0,
  "addCounterByEventTypeTopN": {
    "INVALIDATE_PLAN_CACHE": 23,
    "REFRESH_ADMINISTRATION_CONFIGURATION": 9,
    "ATLASSIAN_CACHE_REMOVE_BY_KEY": 7,
    "INVALIDATE_REPOSITORY_CACHE": 6,
    "HIDE_PLAN": 1
  }
}

Basic monitoring recommendations

You don't need to keep a close eye on all parameters to tell whether state replication is working properly. We've selected just few basic parameters that will give you a reliable insight into the health of cluster state replication in your Data Center deployment. Review the recommended values in the following table and compare them to what you're seeing in your environment.

ParameterDescriptionRecommended value
queueSizeThe current size of the queue. This number represents how many events are waiting to be replicated.The value should be close to 0. It means that events are successfully replicated, and then removed from the queue.
startQueueSize + addCounterThe initial size of the queue when the node was started and the number of events added to the queue since then.The sum of startQueueSize + addCounter should be equal to sendCounter, and to removeCounter.
sendCounter + removeCounterThe number of events replicated to another node and the number of events removed from the queue.It means that all events added to the queue were successfully replicated and then removed from the queue.
timeToAddMillisThe time it takes to add and remove events from the local queue.The average values should be close to 0-10 milliseconds. If it's much more than that, you're having I/O problems with the local home directory storage volume. 
timeToRemoveMillisThe time it takes to remove events from the local queue.
timeToSendMillisThe time it takes to send the event from the local queue to another node over gRPC.
droppedOnAddCounter

droppedOnSendCounter

The number of events that couldn't be added to the queue, or replicated to other nodes.The values should be close to 0. It means that events are successfully being added to the queue and replicated to other nodes.

Last modified on Jul 1, 2024

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.