Monitoring the cache replication

Jira Data Center cache replication

On this page

Still need help?

The Atlassian Community is here for you.

Ask the community

Statistics describing the cache replication are written into the main Jira log file, atlassian-jira.log. They show the details of cache replication, like the size of the local queue or the average time it takes to replicate a cache modification.

Logged statistics

Each node logs statistics that describe the state of cache replication. There are two types of statistics:

  • Total: statistics aggregated since the node started.

    In the log file...
    Cache replication queue stats per node: {destination_node_id} total stats: {statistics_in_json}
  • Snapshot: statistics aggregated since the last snapshot was taken (60 sec. by default)

    In the log file...
    Cache replication queue stats per node: {destination_node_id} snapshot stats: {statistics_in_json}

Example:

In a 3-node cluster, you’ll find the following statistics on Node 1.

  • Total: Node 1 to Node 2 replication
  • Total: Node 1 to Node 3 replication
  • Snapshot: Node 1 to Node 2 replication
  • Snapshot: Node 1 to Node 3 replication


Monitoring cache replication

Sample statistics

Statistics are presented in the JSON format. Here are sample ones to give you an idea what to look for.


Show sample statistics...
{
   "timestampMillis":1526489492166, 
   "nodeId":"i-02c18edad029cc01a",
   "queueSize":0,
   "startQueueSize":0,
   "startTimestampMillis":1523651486765,
   "startMillisAgo":2838005401,
   "closeCounter":0,
   "addCounter":320528,
   "droppedOnAddCounter":0,
   "criticalAddCounter":0,
   "criticalPeekCounter":2,
   "criticalRemoveCounter":0,
   "peekCounter":0,
   "peekOrBlockCounter":320528,
   "removeCounter":320528,
   "backupQueueCounter":0,
   "closeErrorsCounter":0,
   "addErrorsCounter":0,
   "peekErrorsCounter":0,
   "peekOrBlockErrorsCounter":2,
   "removeErrorsCounter":0,
   "backupQueueErrorsCounter":0,
   "lastAddTimestampMillis":1526489491576,
   "lastAddMillisAgo":590,
   "lastPeekTimestampMillis":0,
   "lastPeekMillisAgo":0,
   "lastPeekOrBlockTimestampMillis":1526489491576,
   "lastPeekOrBlockMillisAgo":590,
   "lastRemoveTimestampMillis":1526489491580,
   "lastRemoveMillisAgo":586,
   "lastBackupQueueTimestampMillis":0,
   "lastBackupQueueMillisAgo":0,
   "timeToAddMillis":{
      "count":320528,
      "min":0,
      "max":1368,
      "sum":393651,
      "avg":1
   },
   "timeToPeekMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0
   },
   "timeToPeekOrBlockMillis":{
      "count":320528,
      "min":0,
      "max":28113261,
      "sum":28375252091,
      "avg":88526
   },
   "timeToRemoveMillis":{
      "count":320528,
      "min":0,
      "max":1448,
      "sum":209845,
      "avg":0
   },
   "timeToBackupQueueMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0
   },
   "staleCounter":0,
   "sendCounter":320526,
   "droppedOnSendCounter":0,
   "timeToSendMillis":{
      "count":320526,
      "min":0,
      "max":1768,
      "sum":771778,
      "avg":2
   },
   "sendRuntimeExceptionCounter":0,
   "sendCheckedExceptionCounter":0
}

Basic monitoring

By monitoring just these few parameters, you can tell whether cache replication is working properly. Review the recommended values in the table below, and compare them to what you're seeing in your environment.


ParameterDescriptionRecommended value
queueSize

The current size of the queue, which is how many cache modifications are waiting to be replicated.

The value should be close to 0.

It means that modifications are successfully replicated, and then removed from the queue.

startQueueSize + addCounter

The initial size of the queue when the node was started, and the number of modifications added to the queue since then.

The sum of startQueueSize + addCounter should be equal to sendCounter, and to removeCounter.

It means that all modifications added to the queue were successfully replicated and then removed from the queue.

sendCounter = removeCounter

The number of modifications replicated to other node, and the number of modifications removed from the queue.

timeToAddMillis

The time it takes to add and remove a modification from the local queue

The avg values should be close to 1-10 milliseconds.

If it's much more than that, you're having I/O problems with the storage where you have the local home directory. 

timeToRemoveMillis

timeToSendMillis

The time it takes to send the modification from the local queue to the other node over RMI


droppedOnAddCounter

droppedOnSendCounter

The number of modifications that couldn't be added to the queue, or replicated to other nodes.

The values should be close to 0.

It means that all modifications are successfully added to the queue and replicated to other nodes.


Parameters

The table below shows all parameters that you can find in the statistics.


Show all parameters...
ParameterDescription

timestampMillis

The timestamp when the statistics were generated.

nodeIdDestination node ID.
queueSizeCurrent queue size (number of modifications).
startQueueSize

The initial queue size when the node was started. For example, if the node was restarted with 20 modification requests still in the queue, this parameter will show 20 once the node is up again.

startTimestampMillis

The timestamp when the queue was started on this node.

startMillisAgo

How long the queue has been active.

closeCounter

Changes to 1 after closing the queue. The queue will be closed when a node's status changes to offline.


addCounter

The number of modifications added to the queue.

droppedOnAddCounter

The number of modifications that couldn't be added to the queue.

criticalAddCounter

The number of modifications that couldn't be added to the queue. In the log file:

Critical state of local cache replication queue - cannot add: {cache-operation} to queue: {queue-name}, error: {error-message}
criticalPeekCounter

The number of modifications that couldn't be read from the queue. In the log file: 

Critical state of local cache replication queue - cannot peek from queue: {queue-name}, error: {error-message}
criticalRemoveCounter

The number of modifications that couldn't be removed from the queue. In the log file:

Critical state of local cache replication queue - cannot remove localQCacheOp: {cache-operation} from queue: {queue-name}, error: {error-message}.
peekOrBlockCounter

The number of modifications read from the queue. These modifications are then replicated and removed from the queue.

removeCounter

The number of modifications removed from the queue. It usually means these replications were successfully replicated.

backupQueueCounter

The number of backups created for a queue. Backups are created when the queue file is corrupted and we can't read it or write to it. After a backup is created, a new queue is started. In the log file:

Re-created persistent cache replication queue for node: {destination-node} with id: {queue-name} in : {localq-home}
closeErrorsCounter

Increases when a queue couldn't be closed.

addErrorsCounter

Increases when a cache modification request can't be added to the queue.

peekOrBlockErrorsCounter

Increases when a cache modification request can't be read from the queue.

removeErrorsCounter

Increases when a cache modification request can't be removed from the queue.

backupQueueErrorsCounter

Increases when a queue backup can't be created.

lastAddTimestampMillis

The timestamp of the last cache modification request added to the queue.

lastPeekOrBlockTimestampMillis
lastPeekOrBlockMillisAgo

The timestamp of the last cache modification request read from the queue.
lastRemoveTimestampMillis
lastRemoveMillisAgo

The timestamp of the last cache modification request removed to the queue.

lastBackupQueueTimestampMillis
lastBackupQueueMillisAgo

The timestamp of the last backup created.


timeToAddMillisThe time it takes to add a modification to the queue.
timeToPeekOrBlockMillisThe time it takes to read a modification request from the queue.
timeToRemoveMillis

The time it takes to remove a modification request from the queue.

timeToBackupQueueMillis

The time it takes to back up a queue.

staleCounter

The number of modification requests that haven't been replicated because the destination node was restarted after the modification was created. Such modifications don't need to be delivered, because after the restart the node will load the values from the database.

sendCounter

The number of modifications that were successfully replicated.

droppedOnSendCounter

The number of modifications that couldn't be replicated.

timeToSendMillisThe time it takes to replicate a modification to other nodes (milliseconds.)
sendRuntimeExceptionCounter

The number of unrecoverable errors that occurred.

sendCheckedExceptionCounter

The number of recoverable errors that occurred.


Changes to localq and new stats

There were couple of changes done to localqs I would like to them describe here.

[VIA-COPY] vs [VIA-INVALIDATION]

JIRA 8.12

Since JIRA 8.12 the channel for replicating messages from caches replicated via copy slightly differs from messages from caches replicated via invalidation.

The main difference is that messages from caches replicated via invalidation have guaranteed delivery, and messages from caches replicated via copy have NO guarantee of delivery. Technically these are the main differences:

  • no sync to file after message added to queue [via-copy]

  • much smaller queue size: 10x10k [via-copy] vs 10x100k [via-invalidation]

  • no retries when delivery to the other node fails [via-copy]

  • messages only delivered to live nodes (with active heartbeat) [via-copy] vs delivered to all active (ACTIVE state so not OFFLINE) nodes [via-invalidation]

Because of this we have now a different set of LOCALQ stats for both channels with different prefixes:

  • [LOCALQ][VIA-COPY]

  • [LOCALQ][VIA-INVALIDATION]

In most of the cases [via-copy] is about replicating DBR, user and groups and [via-invalidation] about everything else.

Top caches

JIRA 8.13.4

Have you ever wondered which cache is the root cause of all those millions of replication messages? Sure you did. But until now it was really tricky to get this information. You had to turn DEBUG logging for every replication message and then aggregate. Because of the volume of the log messages it would generate on a large instance it may have not be even an option on a production instance.

Now in every LOCALQ stats you will see top 10 cache contributors. In total stats you will find the top contributors from the start of this instance and in snapshot stats you will see the top contributors from last 5min.

This one is available since JIRA 8.14 and was just back-ported to JIRA 8.13.4.

New stats example

...
  "numberOfCaches": 29,
  "addCounterTopN": 1421,
  "addCounterOthers": 113,
  "addCounterByCacheTopN": {
    "UserPropertyManager.ApplicationUser.cache": 543,
    "com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
    "com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 182,
    "com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
    "com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
    "DbBackedCachedPropertySetManager.jira.properties.cache": 66,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
    "com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
  }
...

numberOfCaches - number of replicating caches (actively replicating)

addCounterTopN - number of messages replicated by the topN caches

addCounterOthers - number of messages replicated by all other caches

addCounterByCacheTopN - sorted list of topN (10) caches: cache name + number of messages

So here the most active cache is UserPropertyManager.ApplicationUser.cache which send 543 messages from 1421 + 113 total replications.

TopN configuration

If top 10 is not enough you can change it via system property:

-Dcom.atlassian.jira.cluster.distribution.localq.stats.queue.names.topN=20

Full log message example

[LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: node2 total stats: 
 {
  "timestampMillis": 1610720566437,
  "nodeId": "node2",
  "queueSize": 0,
  "startQueueSize": 0,
  "startTimestampMillis": 1610707317150,
  "startMillisAgo": 13249287,
  "closeCounter": 0,
  "addCounter": 1534,
  "droppedOnAddCounter": 0,
  "criticalAddCounter": 0,
  "criticalPeekCounter": 0,
  "criticalRemoveCounter": 0,
  "peekCounter": 0,
  "peekOrBlockCounter": 1594,
  "removeCounter": 1534,
  "backupQueueCounter": 0,
  "closeErrorsCounter": 0,
  "addErrorsCounter": 0,
  "peekErrorsCounter": 0,
  "peekOrBlockErrorsCounter": 0,
  "removeErrorsCounter": 0,
  "backupQueueErrorsCounter": 0,
  "lastAddTimestampMillis": 1610720560592,
  "lastAddMillisAgo": 5845,
  "lastPeekTimestampMillis": 0,
  "lastPeekMillisAgo": 0,
  "lastPeekOrBlockTimestampMillis": 1610720560592,
  "lastPeekOrBlockMillisAgo": 5845,
  "lastRemoveTimestampMillis": 1610720560595,
  "lastRemoveMillisAgo": 5842,
  "lastBackupQueueTimestampMillis": 0,
  "lastBackupQueueMillisAgo": 0,
  "timeToAddMillis": {
    "count": 1534,
    "min": 1,
    "max": 13,
    "sum": 2124,
    "avg": 1,
    "distributionCounter": {
      "10": 1531,
      "20": 3,
      "50": 0,
      "100": 0
    }
  },
  "timeToPeekMillis": {
    "count": 0,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "timeToPeekOrBlockMillis": {
    "count": 1594,
    "min": 0,
    "max": 1549827,
    "sum": 131262632,
    "avg": 82347,
    "distributionCounter": {}
  },
  "timeToRemoveMillis": {
    "count": 1534,
    "min": 0,
    "max": 13,
    "sum": 1376,
    "avg": 0,
    "distributionCounter": {
      "10": 1533,
      "20": 1,
      "50": 0,
      "100": 0
    }
  },
  "timeToBackupQueueMillis": {
    "count": 0,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "staleCounter": 0,
  "sendCounter": 1479,
  "droppedOnSendCounter": 0,
  "timeToSendMillis": {
    "count": 1479,
    "min": 0,
    "max": 2114,
    "sum": 8830,
    "avg": 5,
    "distributionCounter": {
      "10": 1461,
      "20": 6,
      "50": 2,
      "100": 2,
      "200": 3,
      "500": 0,
      "1000": 1,
      "5000": 4
    }
  },
  "sendRuntimeExceptionCounter": 7,
  "sendCheckedExceptionCounter": 0,
  "sendNotBoundExceptionCounter": 108,
  "numberOfCaches": 29,
  "addCounterTopN": 1421,
  "addCounterOthers": 113,
  "addCounterByCacheTopN": {
    "UserPropertyManager.ApplicationUser.cache": 543,
    "com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
    "com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 182,
    "com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
    "com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
    "DbBackedCachedPropertySetManager.jira.properties.cache": 66,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
    "com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
  }
}


Last modified on Jan 26, 2022

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.