Monitoring cluster state replication
Log parameters
The table below describes all the parameters that are present in the statistics:
Parameter | Description |
---|---|
timestampMillis | The timestamp of when the statistics were generated (in milliseconds). |
nodeId | Destination node ID. |
queueSize | The current size of the queue (number of modifications). |
startQueueSize | The initial queue size from when the node was started. For example, if the node was restarted with 20 events still in the queue, this parameter will show 20. |
startTimestampMillis | The timestamp of when the queue was started on the node (in milliseconds). |
startMillisAgo | The time the queue has been active (in milliseconds). |
closeCounter | Changes to 1 after closing the queue. The queue will be closed when a node's status changes to offline. |
addCounter | The number of events added to the queue. |
droppedOnAddCounter | The number of events that couldn't be added to the queue. |
criticalAddCounter | The number of events that couldn't be added to the queue. In the log file:
|
criticalPeekCounter | The number of events that couldn't be read from the queue. |
criticalRemoveCounter | The number of events that couldn't be removed from the queue. |
peekCounter | The number of events read from the queue. These events are then replicated and removed from the queue. |
removeCounter | The number of events removed from the queue. It usually means these replications were successfully replicated. |
backupQueueCounter | The number of backups created for a queue. Backups are created when the queue file is corrupted and can't be read it or written to. After a backup has been created, a new queue is started. This is represented in the log file as follows:
|
closeErrorsCounter | Increases when a queue can't be closed. |
addErrorsCounter | Increases when an event can't be added to the queue. |
peekErrorsCounter | Increases when an event can't be read from the queue. |
removeErrorsCounter | Increases when an event request can't be removed from the queue. |
backupQueueErrorsCounter | Increases when a queue backup can't be created. |
lastAddTimestampMillis | The timestamp of when the last event was added to the queue (in milliseconds). |
lastAddMillisAgo | The time elapsed since the last event added to the queue (in milliseconds). |
lastPeekTimestampMillis | The timestamp of when the last event was read from the queue (in milliseconds). |
lastPeekMillisAgo | The time elapsed since the last event was read from the queue (in milliseconds). |
lastRemoveTimestampMillis | The timestamp of when the last event was removed from the queue (in milliseconds). |
lastRemoveMillisAgo | The time elapsed since the last event was removed from the queue (in milliseconds). |
lastBackupQueueTimestampMillis | The timestamp of when the last backup was created (in milliseconds). |
lastBackupQueueMillisAgo | The time elapsed since the last backup was created (in milliseconds). |
timeToAddMillis | The time it takes to add an event to the queue (in milliseconds). |
timeToPeekMillis | The time it takes to read an event from the queue (in milliseconds). |
timeToRemoveMillis | The time it takes to remove an event from the queue (in milliseconds) |
timeToBackupQueueMillis | The time it takes to back up a queue (in milliseconds). |
sendCounter | The number of events that were successfully replicated. |
droppedOnSendCounter | The number of events that couldn't be replicated. |
timeToSendMillis | The time it takes to replicate a event to other nodes ( in milliseconds). Can be interpreted as a latency between publisher and receiver nodes. |
sendFailureExceptionCounter | The number of errors that occurred on the gRPC-based communication layer. |
sendUnavailableExceptionCounter | The number of unavailability errors that indicate the receiver node was not reachable from the publisher node. For example, due to the node being offline. |
numberOfEventTypes | The number of actively replicating event types. |
addCounterTopN | The number of events replicated by the topN event types. |
addCounterOthers | The number of events replicated by all other (non-topN) event types. |
addCounterByEventTypeTopN | Sorted list of topN event types (10 by default): event type name + number of messages. If the top 10 is not enough, you can change that by assigning a higher value to the For example:
|
Sample statistics
Statistics are presented in JSON format. By default, statistics are aggregated in a single line to limit the amount of space they take up. To pretty-print statistics similar to what you can see in the example below, set the bamboo.per.node.queue.stats.pretty.printing.enabled
property to true
.
Here's a sample to give you an idea of what to look for:
{
"timestampMillis": 1706043222390,
"nodeId": "ea8cbc38-ac78-41a1-a7a8-879ff3757429",
"queueSize": 0,
"startQueueSize": 0,
"startTimestampMillis": 1706043102392,
"startMillisAgo": 119998,
"closeCounter": 0,
"addCounter": 46,
"droppedOnAddCounter": 0,
"criticalAddCounter": 0,
"criticalPeekCounter": 0,
"criticalRemoveCounter": 0,
"peekCounter": 89,
"removeCounter": 46,
"backupQueueCounter": 0,
"closeErrorsCounter": 0,
"addErrorsCounter": 0,
"peekErrorsCounter": 0,
"removeErrorsCounter": 0,
"backupQueueErrorsCounter": 0,
"lastAddTimestampMillis": 1706043220684,
"lastAddMillisAgo": 1706,
"lastPeekTimestampMillis": 1706043220689,
"lastPeekMillisAgo": 1701,
"lastRemoveTimestampMillis": 1706043220689,
"lastRemoveMillisAgo": 1701,
"lastBackupQueueTimestampMillis": 0,
"lastBackupQueueMillisAgo": 0,
"timeToAddMillis": {
"count": 46,
"min": 0,
"max": 2,
"sum": 5,
"avg": 0,
"distributionCounter": {
"10": 46,
"20": 0,
"50": 0,
"100": 0
}
},
"timeToPeekMillis": {
"count": 89,
"min": 0,
"max": 0,
"sum": 0,
"avg": 0,
"distributionCounter": {}
},
"timeToRemoveMillis": {
"count": 46,
"min": 0,
"max": 0,
"sum": 0,
"avg": 0,
"distributionCounter": {
"10": 46,
"20": 0,
"50": 0,
"100": 0
}
},
"timeToBackupQueueMillis": {
"count": 0,
"min": 0,
"max": 0,
"sum": 0,
"avg": 0,
"distributionCounter": {}
},
"sendCounter": 46,
"droppedOnSendCounter": 0,
"timeToSendMillis": {
"count": 46,
"min": 0,
"max": 191,
"sum": 529,
"avg": 12,
"distributionCounter": {
"10": 38,
"20": 2,
"50": 5,
"100": 0,
"200": 1,
"500": 0,
"1000": 0,
"5000": 0
}
},
"sendFailureExceptionCounter": 0,
"sendUnavailableExceptionCounter": 0,
"numberOfEventTypes": 5,
"addCounterTopN": 46,
"addCounterOthers": 0,
"addCounterByEventTypeTopN": {
"INVALIDATE_PLAN_CACHE": 23,
"REFRESH_ADMINISTRATION_CONFIGURATION": 9,
"ATLASSIAN_CACHE_REMOVE_BY_KEY": 7,
"INVALIDATE_REPOSITORY_CACHE": 6,
"HIDE_PLAN": 1
}
}
Basic monitoring recommendations
You don't need to keep a close eye on all parameters to tell whether state replication is working properly. We've selected just few basic parameters that will give you a reliable insight into the health of cluster state replication in your Data Center deployment. Review the recommended values in the following table and compare them to what you're seeing in your environment.
Parameter | Description | Recommended value |
---|---|---|
queueSize | The current size of the queue. This number represents how many events are waiting to be replicated. | The value should be close to 0. It means that events are successfully replicated, and then removed from the queue. |
startQueueSize + addCounter | The initial size of the queue when the node was started and the number of events added to the queue since then. | The sum of startQueueSize + addCounter should be equal to sendCounter, and to removeCounter. |
sendCounter + removeCounter | The number of events replicated to another node and the number of events removed from the queue. | It means that all events added to the queue were successfully replicated and then removed from the queue. |
timeToAddMillis | The time it takes to add and remove events from the local queue. | The average values should be close to 0-10 milliseconds. If it's much more than that, you're having I/O problems with the local home directory storage volume. |
timeToRemoveMillis | The time it takes to remove events from the local queue. | |
timeToSendMillis | The time it takes to send the event from the local queue to another node over gRPC. | |
droppedOnAddCounter
| The number of events that couldn't be added to the queue, or replicated to other nodes. | The values should be close to 0. It means that events are successfully being added to the queue and replicated to other nodes. |