Monitoring the cache replication

Statistics describing the cache replication are written into the main Jira log file, atlassian-jira.log. They show the details of cache replication, like the size of the local queue or the average time it takes to replicate a cache modification.

Logged statistics

Each node logs statistics that describe the state of cache replication. There are two types of statistics:

Total: statistics aggregated since the node started.

In the log file...

Cache replication queue stats per node: {destination_node_id} total stats: {statistics_in_json}

Snapshot: statistics aggregated since the last snapshot was taken (60 sec. by default)
In the log file...
Cache replication queue stats per node: {destination_node_id} snapshot stats: {statistics_in_json}

Example:

In a 3-node cluster, you’ll find the following statistics on Node 1.

Total: Node 1 to Node 2 replication
Total: Node 1 to Node 3 replication
Snapshot: Node 1 to Node 2 replication
Snapshot: Node 1 to Node 3 replication

Monitoring cache replication

Sample statistics

Statistics are presented in the JSON format. Here are sample ones to give you an idea what to look for.

Show sample statistics...

{
   "timestampMillis":1526489492166, 
   "nodeId":"i-02c18edad029cc01a",
   "queueSize":0,
   "startQueueSize":0,
   "startTimestampMillis":1523651486765,
   "startMillisAgo":2838005401,
   "closeCounter":0,
   "addCounter":320528,
   "droppedOnAddCounter":0,
   "criticalAddCounter":0,
   "criticalPeekCounter":2,
   "criticalRemoveCounter":0,
   "peekCounter":0,
   "peekOrBlockCounter":320528,
   "removeCounter":320528,
   "backupQueueCounter":0,
   "closeErrorsCounter":0,
   "addErrorsCounter":0,
   "peekErrorsCounter":0,
   "peekOrBlockErrorsCounter":2,
   "removeErrorsCounter":0,
   "backupQueueErrorsCounter":0,
   "lastAddTimestampMillis":1526489491576,
   "lastAddMillisAgo":590,
   "lastPeekTimestampMillis":0,
   "lastPeekMillisAgo":0,
   "lastPeekOrBlockTimestampMillis":1526489491576,
   "lastPeekOrBlockMillisAgo":590,
   "lastRemoveTimestampMillis":1526489491580,
   "lastRemoveMillisAgo":586,
   "lastBackupQueueTimestampMillis":0,
   "lastBackupQueueMillisAgo":0,
   "timeToAddMillis":{
      "count":320528,
      "min":0,
      "max":1368,
      "sum":393651,
      "avg":1
   },
   "timeToPeekMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0
   },
   "timeToPeekOrBlockMillis":{
      "count":320528,
      "min":0,
      "max":28113261,
      "sum":28375252091,
      "avg":88526
   },
   "timeToRemoveMillis":{
      "count":320528,
      "min":0,
      "max":1448,
      "sum":209845,
      "avg":0
   },
   "timeToBackupQueueMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0
   },
   "staleCounter":0,
   "sendCounter":320526,
   "droppedOnSendCounter":0,
   "timeToSendMillis":{
      "count":320526,
      "min":0,
      "max":1768,
      "sum":771778,
      "avg":2
   },
   "sendRuntimeExceptionCounter":0,
   "sendCheckedExceptionCounter":0
}

Basic monitoring

By monitoring just these few parameters, you can tell whether cache replication is working properly. Review the recommended values in the table below, and compare them to what you're seeing in your environment.

Parameter	Description	Recommended value
`queueSize`	The current size of the queue, which is how many cache modifications are waiting to be replicated.	The value should be close to 0. It means that modifications are successfully replicated, and then removed from the queue.
`startQueueSize` + `addCounter`	The initial size of the queue when the node was started, and the number of modifications added to the queue since then.	The sum of `startQueueSize` + `addCounter` should be equal to `sendCounter`, and to `removeCounter`. It means that all modifications added to the queue were successfully replicated and then removed from the queue.
`sendCounter` = `removeCounter`	The number of modifications replicated to other node, and the number of modifications removed from the queue.
`timeToAddMillis`	The time it takes to add and remove a modification from the local queue	The `avg` values should be close to 1-10 milliseconds. If it's much more than that, you're having I/O problems with the storage where you have the local home directory.
`timeToRemoveMillis`
timeToSendMillis	The time it takes to send the modification from the local queue to the other node over RMI
`droppedOnAddCounter` `droppedOnSendCounter`	The number of modifications that couldn't be added to the queue, or replicated to other nodes.	The values should be close to 0. It means that all modifications are successfully added to the queue and replicated to other nodes.

Parameters

The table below shows all parameters that you can find in the statistics.

Show all parameters...

Parameter	Description
`timestampMillis`	The timestamp when the statistics were generated.
`nodeId`	Destination node ID.
`queueSize`	Current queue size (number of modifications).
`startQueueSize`	The initial queue size when the node was started. For example, if the node was restarted with 20 modification requests still in the queue, this parameter will show 20 once the node is up again.
`startTimestampMillis`	The timestamp when the queue was started on this node.
`startMillisAgo`	How long the queue has been active.
`closeCounter`	Changes to 1 after closing the queue. The queue will be closed when a node's status changes to `offline`.
`addCounter`	The number of modifications added to the queue.
`droppedOnAddCounter`	The number of modifications that couldn't be added to the queue.
`criticalAddCounter`	The number of modifications that couldn't be added to the queue. In the log file: `Critical state of local cache replication queue - cannot add: {cache-operation} to queue: {queue-name}, error: {error-message}`
`criticalPeekCounter`	The number of modifications that couldn't be read from the queue. In the log file: `Critical state of local cache replication queue - cannot peek from queue: {queue-name}, error: {error-message}`
`criticalRemoveCounter`	The number of modifications that couldn't be removed from the queue. In the log file: `Critical state of local cache replication queue - cannot remove localQCacheOp: {cache-operation} from queue: {queue-name}, error: {error-message}.`
`peekOrBlockCounter`	The number of modifications read from the queue. These modifications are then replicated and removed from the queue.
`removeCounter`	The number of modifications removed from the queue. It usually means these replications were successfully replicated.
`backupQueueCounter`	The number of backups created for a queue. Backups are created when the queue file is corrupted and we can't read it or write to it. After a backup is created, a new queue is started. In the log file: `Re-created persistent cache replication queue for node: {destination-node} with id: {queue-name} in : {localq-home}`
`closeErrorsCounter`	Increases when a queue couldn't be closed.
`addErrorsCounter`	Increases when a cache modification request can't be added to the queue.
`peekOrBlockErrorsCounter`	Increases when a cache modification request can't be read from the queue.
`removeErrorsCounter`	Increases when a cache modification request can't be removed from the queue.
`backupQueueErrorsCounter`	Increases when a queue backup can't be created.
`lastAddTimestampMillis`	The timestamp of the last cache modification request added to the queue.
`lastPeekOrBlockTimestampMillis lastPeekOrBlockMillisAgo`	The timestamp of the last cache modification request read from the queue.
`lastRemoveTimestampMillis lastRemoveMillisAgo`	The timestamp of the last cache modification request removed to the queue.
`lastBackupQueueTimestampMillis lastBackupQueueMillisAgo`	The timestamp of the last backup created.
`timeToAddMillis`	The time it takes to add a modification to the queue.
`timeToPeekOrBlockMillis`	The time it takes to read a modification request from the queue.
`timeToRemoveMillis`	The time it takes to remove a modification request from the queue.
`timeToBackupQueueMillis`	The time it takes to back up a queue.
`staleCounter`	The number of modification requests that haven't been replicated because the destination node was restarted after the modification was created. Such modifications don't need to be delivered, because after the restart the node will load the values from the database.
`sendCounter`	The number of modifications that were successfully replicated.
`droppedOnSendCounter`	The number of modifications that couldn't be replicated.
`timeToSendMillis`	The time it takes to replicate a modification to other nodes (milliseconds.)
`sendRuntimeExceptionCounter`	The number of unrecoverable errors that occurred.
`sendCheckedExceptionCounter`	The number of recoverable errors that occurred.

Changes to localq and new stats

There were couple of changes done to localqs I would like to them describe here.

[VIA-COPY] vs [VIA-INVALIDATION]

JIRA 8.12

Since JIRA 8.12 the channel for replicating messages from caches replicated via copy slightly differs from messages from caches replicated via invalidation.

The main difference is that messages from caches replicated via invalidation have guaranteed delivery, and messages from caches replicated via copy have NO guarantee of delivery. Technically these are the main differences:

no sync to file after message added to queue [via-copy]
much smaller queue size: 10x10k [via-copy] vs 10x100k [via-invalidation]
no retries when delivery to the other node fails [via-copy]
messages only delivered to live nodes (with active heartbeat) [via-copy] vs delivered to all active (ACTIVE state so not OFFLINE) nodes [via-invalidation]

Because of this we have now a different set of LOCALQ stats for both channels with different prefixes:

[LOCALQ][VIA-COPY]
[LOCALQ][VIA-INVALIDATION]

In most of the cases [via-copy] is about replicating DBR, user and groups and [via-invalidation] about everything else.

Top caches

JIRA 8.13.4

Have you ever wondered which cache is the root cause of all those millions of replication messages? Sure you did. But until now it was really tricky to get this information. You had to turn DEBUG logging for every replication message and then aggregate. Because of the volume of the log messages it would generate on a large instance it may have not be even an option on a production instance.

Now in every LOCALQ stats you will see top 10 cache contributors. In total stats you will find the top contributors from the start of this instance and in snapshot stats you will see the top contributors from last 5min.

This one is available since JIRA 8.14 and was just back-ported to JIRA 8.13.4.

New stats example

...
  "numberOfCaches": 29,
  "addCounterTopN": 1421,
  "addCounterOthers": 113,
  "addCounterByCacheTopN": {
    "UserPropertyManager.ApplicationUser.cache": 543,
    "com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
    "com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 182,
    "com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
    "com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
    "DbBackedCachedPropertySetManager.jira.properties.cache": 66,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
    "com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
  }
...

numberOfCaches - number of replicating caches (actively replicating)

addCounterTopN - number of messages replicated by the topN caches

addCounterOthers - number of messages replicated by all other caches

addCounterByCacheTopN - sorted list of topN (10) caches: cache name + number of messages

So here the most active cache is UserPropertyManager.ApplicationUser.cache which send 543 messages from 1421 + 113 total replications.

TopN configuration

If top 10 is not enough you can change it via system property:

-Dcom.atlassian.jira.cluster.distribution.localq.stats.queue.names.topN=20

Full log message example

[LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: node2 total stats: 
 {
  "timestampMillis": 1610720566437,
  "nodeId": "node2",
  "queueSize": 0,
  "startQueueSize": 0,
  "startTimestampMillis": 1610707317150,
  "startMillisAgo": 13249287,
  "closeCounter": 0,
  "addCounter": 1534,
  "droppedOnAddCounter": 0,
  "criticalAddCounter": 0,
  "criticalPeekCounter": 0,
  "criticalRemoveCounter": 0,
  "peekCounter": 0,
  "peekOrBlockCounter": 1594,
  "removeCounter": 1534,
  "backupQueueCounter": 0,
  "closeErrorsCounter": 0,
  "addErrorsCounter": 0,
  "peekErrorsCounter": 0,
  "peekOrBlockErrorsCounter": 0,
  "removeErrorsCounter": 0,
  "backupQueueErrorsCounter": 0,
  "lastAddTimestampMillis": 1610720560592,
  "lastAddMillisAgo": 5845,
  "lastPeekTimestampMillis": 0,
  "lastPeekMillisAgo": 0,
  "lastPeekOrBlockTimestampMillis": 1610720560592,
  "lastPeekOrBlockMillisAgo": 5845,
  "lastRemoveTimestampMillis": 1610720560595,
  "lastRemoveMillisAgo": 5842,
  "lastBackupQueueTimestampMillis": 0,
  "lastBackupQueueMillisAgo": 0,
  "timeToAddMillis": {
    "count": 1534,
    "min": 1,
    "max": 13,
    "sum": 2124,
    "avg": 1,
    "distributionCounter": {
      "10": 1531,
      "20": 3,
      "50": 0,
      "100": 0
    }
  },
  "timeToPeekMillis": {
    "count": 0,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "timeToPeekOrBlockMillis": {
    "count": 1594,
    "min": 0,
    "max": 1549827,
    "sum": 131262632,
    "avg": 82347,
    "distributionCounter": {}
  },
  "timeToRemoveMillis": {
    "count": 1534,
    "min": 0,
    "max": 13,
    "sum": 1376,
    "avg": 0,
    "distributionCounter": {
      "10": 1533,
      "20": 1,
      "50": 0,
      "100": 0
    }
  },
  "timeToBackupQueueMillis": {
    "count": 0,
    "min": 0,
    "max": 0,
    "sum": 0,
    "avg": 0,
    "distributionCounter": {}
  },
  "staleCounter": 0,
  "sendCounter": 1479,
  "droppedOnSendCounter": 0,
  "timeToSendMillis": {
    "count": 1479,
    "min": 0,
    "max": 2114,
    "sum": 8830,
    "avg": 5,
    "distributionCounter": {
      "10": 1461,
      "20": 6,
      "50": 2,
      "100": 2,
      "200": 3,
      "500": 0,
      "1000": 1,
      "5000": 4
    }
  },
  "sendRuntimeExceptionCounter": 7,
  "sendCheckedExceptionCounter": 0,
  "sendNotBoundExceptionCounter": 108,
  "numberOfCaches": 29,
  "addCounterTopN": 1421,
  "addCounterOthers": 113,
  "addCounterByCacheTopN": {
    "UserPropertyManager.ApplicationUser.cache": 543,
    "com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
    "com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 182,
    "com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
    "com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
    "DbBackedCachedPropertySetManager.jira.properties.cache": 66,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
    "com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
  }
}

Products

Jira Software

Jira Service Management

Jira Work Management

Confluence

Bitbucket

Resources

Documentation

Community

System Status

Suggestions and bugs

Marketplace

Billing and licensing

Monitoring the cache replication

Jira Data Center cache replication

On this page

Still need help?

Logged statistics

Monitoring cache replication

Sample statistics

Basic monitoring

Parameters

Changes to localq and new stats

[VIA-COPY] vs [VIA-INVALIDATION]

Top caches

New stats example

TopN configuration

Full log message example

Page

Viewport

Confluence

Monitoring the cache replication

Jira Data Center cache replication

On this page

Related content

Still need help?

Logged statistics

Monitoring cache replication

Sample statistics

Basic monitoring

Parameters

Changes to localq and new stats

[VIA-COPY] vs [VIA-INVALIDATION]

Top caches

New stats example

TopN configuration

Full log message example

Related content