Monitoring the cache replication
Statistics describing the cache replication are written into the main Jira log file, atlassian-jira.log
. They show the details of cache replication, like the size of the local queue or the average time it takes to replicate a cache modification.
Logged statistics
Each node logs statistics that describe the state of cache replication. There are two types of statistics:
Total: statistics aggregated since the node started.
Snapshot: statistics aggregated since the last snapshot was taken (60 sec. by default)
Example:
In a 3-node cluster, you’ll find the following statistics on Node 1.
- Total: Node 1 to Node 2 replication
- Total: Node 1 to Node 3 replication
- Snapshot: Node 1 to Node 2 replication
- Snapshot: Node 1 to Node 3 replication
Monitoring cache replication
Sample statistics
Statistics are presented in the JSON format. Here are sample ones to give you an idea what to look for.
Basic monitoring
By monitoring just these few parameters, you can tell whether cache replication is working properly. Review the recommended values in the table below, and compare them to what you're seeing in your environment.
Parameter | Description | Recommended value |
---|---|---|
queueSize | The current size of the queue, which is how many cache modifications are waiting to be replicated. | The value should be close to 0. It means that modifications are successfully replicated, and then removed from the queue. |
| The initial size of the queue when the node was started, and the number of modifications added to the queue since then. | The sum of It means that all modifications added to the queue were successfully replicated and then removed from the queue. |
| The number of modifications replicated to other node, and the number of modifications removed from the queue. | |
| The time it takes to add and remove a modification from the local queue | The If it's much more than that, you're having I/O problems with the storage where you have the local home directory. |
| ||
timeToSendMillis | The time it takes to send the modification from the local queue to the other node over RMI | |
| The number of modifications that couldn't be added to the queue, or replicated to other nodes. | The values should be close to 0. It means that all modifications are successfully added to the queue and replicated to other nodes. |
Parameters
The table below shows all parameters that you can find in the statistics.
Changes to localq and new stats
There were couple of changes done to localqs I would like to them describe here.
[VIA-COPY] vs [VIA-INVALIDATION]
JIRA 8.12
Since JIRA 8.12 the channel for replicating messages from caches replicated via copy slightly differs from messages from caches replicated via invalidation.
The main difference is that messages from caches replicated via invalidation have guaranteed delivery, and messages from caches replicated via copy have NO guarantee of delivery. Technically these are the main differences:
no sync to file after message added to queue [via-copy]
much smaller queue size: 10x10k [via-copy] vs 10x100k [via-invalidation]
no retries when delivery to the other node fails [via-copy]
messages only delivered to live nodes (with active heartbeat) [via-copy] vs delivered to all active (ACTIVE state so not OFFLINE) nodes [via-invalidation]
Because of this we have now a different set of LOCALQ stats for both channels with different prefixes:
[LOCALQ][VIA-COPY]
[LOCALQ][VIA-INVALIDATION]
In most of the cases [via-copy] is about replicating DBR, user and groups and [via-invalidation] about everything else.
Top caches
JIRA 8.13.4
Have you ever wondered which cache is the root cause of all those millions of replication messages? Sure you did. But until now it was really tricky to get this information. You had to turn DEBUG logging for every replication message and then aggregate. Because of the volume of the log messages it would generate on a large instance it may have not be even an option on a production instance.
Now in every LOCALQ stats you will see top 10 cache contributors. In total stats you will find the top contributors from the start of this instance and in snapshot stats you will see the top contributors from last 5min.
This one is available since JIRA 8.14 and was just back-ported to JIRA 8.13.4.
New stats example
...
"numberOfCaches": 29,
"addCounterTopN": 1421,
"addCounterOthers": 113,
"addCounterByCacheTopN": {
"UserPropertyManager.ApplicationUser.cache": 543,
"com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
"com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 182,
"com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
"com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
"DbBackedCachedPropertySetManager.jira.properties.cache": 66,
"com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
"com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
"com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
"com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
}
...
numberOfCaches
- number of replicating caches (actively replicating)
addCounterTopN
- number of messages replicated by the topN caches
addCounterOthers
- number of messages replicated by all other caches
addCounterByCacheTopN
- sorted list of topN (10) caches: cache name + number of messages
So here the most active cache is UserPropertyManager.ApplicationUser.cache
which send 543
messages from 1421
+ 113
total replications.
TopN configuration
If top 10 is not enough you can change it via system property:
-Dcom.atlassian.jira.cluster.distribution.localq.stats.queue.names.topN=20
Full log message example
[LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: node2 total stats:
{
"timestampMillis": 1610720566437,
"nodeId": "node2",
"queueSize": 0,
"startQueueSize": 0,
"startTimestampMillis": 1610707317150,
"startMillisAgo": 13249287,
"closeCounter": 0,
"addCounter": 1534,
"droppedOnAddCounter": 0,
"criticalAddCounter": 0,
"criticalPeekCounter": 0,
"criticalRemoveCounter": 0,
"peekCounter": 0,
"peekOrBlockCounter": 1594,
"removeCounter": 1534,
"backupQueueCounter": 0,
"closeErrorsCounter": 0,
"addErrorsCounter": 0,
"peekErrorsCounter": 0,
"peekOrBlockErrorsCounter": 0,
"removeErrorsCounter": 0,
"backupQueueErrorsCounter": 0,
"lastAddTimestampMillis": 1610720560592,
"lastAddMillisAgo": 5845,
"lastPeekTimestampMillis": 0,
"lastPeekMillisAgo": 0,
"lastPeekOrBlockTimestampMillis": 1610720560592,
"lastPeekOrBlockMillisAgo": 5845,
"lastRemoveTimestampMillis": 1610720560595,
"lastRemoveMillisAgo": 5842,
"lastBackupQueueTimestampMillis": 0,
"lastBackupQueueMillisAgo": 0,
"timeToAddMillis": {
"count": 1534,
"min": 1,
"max": 13,
"sum": 2124,
"avg": 1,
"distributionCounter": {
"10": 1531,
"20": 3,
"50": 0,
"100": 0
}
},
"timeToPeekMillis": {
"count": 0,
"min": 0,
"max": 0,
"sum": 0,
"avg": 0,
"distributionCounter": {}
},
"timeToPeekOrBlockMillis": {
"count": 1594,
"min": 0,
"max": 1549827,
"sum": 131262632,
"avg": 82347,
"distributionCounter": {}
},
"timeToRemoveMillis": {
"count": 1534,
"min": 0,
"max": 13,
"sum": 1376,
"avg": 0,
"distributionCounter": {
"10": 1533,
"20": 1,
"50": 0,
"100": 0
}
},
"timeToBackupQueueMillis": {
"count": 0,
"min": 0,
"max": 0,
"sum": 0,
"avg": 0,
"distributionCounter": {}
},
"staleCounter": 0,
"sendCounter": 1479,
"droppedOnSendCounter": 0,
"timeToSendMillis": {
"count": 1479,
"min": 0,
"max": 2114,
"sum": 8830,
"avg": 5,
"distributionCounter": {
"10": 1461,
"20": 6,
"50": 2,
"100": 2,
"200": 3,
"500": 0,
"1000": 1,
"5000": 4
}
},
"sendRuntimeExceptionCounter": 7,
"sendCheckedExceptionCounter": 0,
"sendNotBoundExceptionCounter": 108,
"numberOfCaches": 29,
"addCounterTopN": 1421,
"addCounterOthers": 113,
"addCounterByCacheTopN": {
"UserPropertyManager.ApplicationUser.cache": 543,
"com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
"com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 182,
"com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
"com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
"DbBackedCachedPropertySetManager.jira.properties.cache": 66,
"com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
"com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
"com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
"com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
}
}