Monitoring the cache replication

Jira Data Center cache replication

On this page

Still need help?

The Atlassian Community is here for you.

Ask the community

Statistics describing the cache replication are written into the main Jira log file, atlassian-jira.log. They show the details of cache replication, like the size of the local queue or the average time it takes to replicate a cache modification.

Logged statistics

Each node logs statistics that describe the state of cache replication. There are two types of statistics:

  • Total: statistics aggregated since the node started.

    In the log file...
    Cache replication queue stats per node: {destination_node_id} total stats: {statistics_in_json}
  • Snapshot: statistics aggregated since the last snapshot was taken (60 sec. by default)

    In the log file...
    Cache replication queue stats per node: {destination_node_id} snapshot stats: {statistics_in_json}

Example:

In a 3-node cluster, you’ll find the following statistics on Node 1.

  • Total: Node 1 to Node 2 replication
  • Total: Node 1 to Node 3 replication
  • Snapshot: Node 1 to Node 2 replication
  • Snapshot: Node 1 to Node 3 replication


Monitoring cache replication

Sample statistics

Statistics are presented in the JSON format. Here are sample ones to give you an idea what to look for.


Show sample statistics...
{
   "timestampMillis":1526489492166, 
   "nodeId":"i-02c18edad029cc01a",
   "queueSize":0,
   "startQueueSize":0,
   "startTimestampMillis":1523651486765,
   "startMillisAgo":2838005401,
   "closeCounter":0,
   "addCounter":320528,
   "droppedOnAddCounter":0,
   "criticalAddCounter":0,
   "criticalPeekCounter":2,
   "criticalRemoveCounter":0,
   "peekCounter":0,
   "peekOrBlockCounter":320528,
   "removeCounter":320528,
   "backupQueueCounter":0,
   "closeErrorsCounter":0,
   "addErrorsCounter":0,
   "peekErrorsCounter":0,
   "peekOrBlockErrorsCounter":2,
   "removeErrorsCounter":0,
   "backupQueueErrorsCounter":0,
   "lastAddTimestampMillis":1526489491576,
   "lastAddMillisAgo":590,
   "lastPeekTimestampMillis":0,
   "lastPeekMillisAgo":0,
   "lastPeekOrBlockTimestampMillis":1526489491576,
   "lastPeekOrBlockMillisAgo":590,
   "lastRemoveTimestampMillis":1526489491580,
   "lastRemoveMillisAgo":586,
   "lastBackupQueueTimestampMillis":0,
   "lastBackupQueueMillisAgo":0,
   "timeToAddMillis":{
      "count":320528,
      "min":0,
      "max":1368,
      "sum":393651,
      "avg":1
   },
   "timeToPeekMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0
   },
   "timeToPeekOrBlockMillis":{
      "count":320528,
      "min":0,
      "max":28113261,
      "sum":28375252091,
      "avg":88526
   },
   "timeToRemoveMillis":{
      "count":320528,
      "min":0,
      "max":1448,
      "sum":209845,
      "avg":0
   },
   "timeToBackupQueueMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0
   },
   "staleCounter":0,
   "sendCounter":320526,
   "droppedOnSendCounter":0,
   "timeToSendMillis":{
      "count":320526,
      "min":0,
      "max":1768,
      "sum":771778,
      "avg":2
   },
   "sendRuntimeExceptionCounter":0,
   "sendCheckedExceptionCounter":0
}

Basic monitoring

By monitoring just these few parameters, you can tell whether cache replication is working properly. Review the recommended values in the table below, and compare them to what you're seeing in your environment.


ParameterDescriptionRecommended value
queueSize

The current size of the queue, which is how many cache modifications are waiting to be replicated.

The value should be close to 0.

It means that modifications are successfully replicated, and then removed from the queue.

startQueueSize + addCounter

The initial size of the queue when the node was started, and the number of modifications added to the queue since then.

The sum of startQueueSize + addCounter should be equal to sendCounter, and to removeCounter.

It means that all modifications added to the queue were successfully replicated and then removed from the queue.

sendCounter = removeCounter

The number of modifications replicated to other node, and the number of modifications removed from the queue.

timeToAddMillis

The time it takes to add and remove a modification from the local queue

The avg values should be close to 1-10 milliseconds.

If it's much more than that, you're having I/O problems with the storage where you have the local home directory. 

timeToRemoveMillis

timeToSendMillis

The time it takes to send the modification from the local queue to the other node over RMI


droppedOnAddCounter

droppedOnSendCounter

The number of modifications that couldn't be added to the queue, or replicated to other nodes.

The values should be close to 0.

It means that all modifications are successfully added to the queue and replicated to other nodes.


Parameters

The table below shows all parameters that you can find in the statistics.


Show all parameters...
ParameterDescription

timestampMillis

The timestamp when the statistics were generated.

nodeIdDestination node ID.
queueSizeCurrent queue size (number of modifications).
startQueueSize

The initial queue size when the node was started. For example, if the node was restarted with 20 modification requests still in the queue, this parameter will show 20 once the node is up again.

startTimestampMillis

The timestamp when the queue was started on this node.

startMillisAgo

How long the queue has been active.

closeCounter

Changes to 1 after closing the queue. The queue will be closed when a node's status changes to offline.


addCounter

The number of modifications added to the queue.

droppedOnAddCounter

The number of modifications that couldn't be added to the queue.

criticalAddCounter

The number of modifications that couldn't be added to the queue. In the log file:

Critical state of local cache replication queue - cannot add: {cache-operation} to queue: {queue-name}, error: {error-message}
criticalPeekCounter

The number of modifications that couldn't be read from the queue. In the log file: 

Critical state of local cache replication queue - cannot peek from queue: {queue-name}, error: {error-message}
criticalRemoveCounter

The number of modifications that couldn't be removed from the queue. In the log file:

Critical state of local cache replication queue - cannot remove localQCacheOp: {cache-operation} from queue: {queue-name}, error: {error-message}.
peekOrBlockCounter

The number of modifications read from the queue. These modifications are then replicated and removed from the queue.

removeCounter

The number of modifications removed from the queue. It usually means these replications were successfully replicated.

backupQueueCounter

The number of backups created for a queue. Backups are created when the queue file is corrupted and we can't read it or write to it. After a backup is created, a new queue is started. In the log file:

Re-created persistent cache replication queue for node: {destination-node} with id: {queue-name} in : {localq-home}
closeErrorsCounter

Increases when a queue couldn't be closed.

addErrorsCounter

Increases when a cache modification request can't be added to the queue.

peekOrBlockErrorsCounter

Increases when a cache modification request can't be read from the queue.

removeErrorsCounter

Increases when a cache modification request can't be removed from the queue.

backupQueueErrorsCounter

Increases when a queue backup can't be created.

lastAddTimestampMillis

The timestamp of the last cache modification request added to the queue.

lastPeekOrBlockTimestampMillis
lastPeekOrBlockMillisAgo

The timestamp of the last cache modification request read from the queue.
lastRemoveTimestampMillis
lastRemoveMillisAgo

The timestamp of the last cache modification request removed to the queue.

lastBackupQueueTimestampMillis
lastBackupQueueMillisAgo

The timestamp of the last backup created.


timeToAddMillisThe time it takes to add a modification to the queue.
timeToPeekOrBlockMillisThe time it takes to read a modification request from the queue.
timeToRemoveMillis

The time it takes to remove a modification request from the queue.

timeToBackupQueueMillis

The time it takes to back up a queue.

staleCounter

The number of modification requests that haven't been replicated because the destination node was restarted after the modification was created. Such modifications don't need to be delivered, because after the restart the node will load the values from the database.

sendCounter

The number of modifications that were successfully replicated.

droppedOnSendCounter

The number of modifications that couldn't be replicated.

timeToSendMillisThe time it takes to replicate a modification to other nodes (milliseconds.)
sendRuntimeExceptionCounter

The number of unrecoverable errors that occurred.

sendCheckedExceptionCounter

The number of recoverable errors that occurred.


Last modified on Jul 9, 2019

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.