The Instance Health Checks are complaining that at least 1 node in the Jira Data Center cluster is not replicating, even though it is actually replicating successfully

Still need help?

The Atlassian Community is here for you.

Ask the community

    

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Summary

The Instance Health Checks are complaining that at least 1 node in the Jira Data Center cluster is not replicating, even though it is actually replicating successfully.

Environment

Any Jira 8.x version
Data Center only

Diagnosis

  • When checking the Jira application logs of one of the healthy Nodes, we can see that they are complaining that one particular Jira node (or more) is not replicating:

    grep -h 'is not replicating' atlassian-jira.log* | sort
    2021-09-07 19:34:52,751+0000 Caesium-1-1 ERROR ServiceRunner     [c.a.t.healthcheck.concurrent.SupportHealthCheckProcess] Health check 'Cluster Cache Replication' failed with severity 'critical': 'The node problematic-node-ID is not replicating'
    2021-09-07 20:34:52,706+0000 Caesium-1-3 ERROR ServiceRunner     [c.a.t.healthcheck.concurrent.SupportHealthCheckProcess] Health check 'Cluster Cache Replication' failed with severity 'critical': 'The node problematic-node-ID is not replicating'
    2021-09-07 21:34:52,768+0000 Caesium-1-1 ERROR ServiceRunner     [c.a.t.healthcheck.concurrent.SupportHealthCheckProcess] Health check 'Cluster Cache Replication' failed with severity 'critical': 'The node problematic-node-ID is not replicating'
  • However, in these same logs, when checking the replication process related to the "problematic node" (the one that the health checks are complaining about), we can see that the cache replication completes successfully:

    2021-09-08 08:21:32,287+0000 localq-stats-0 INFO      [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [scheduled] Running cache replication queue stats for: 20 queues...
    2021-09-08 08:00:27,926+0000 localq-stats-0 INFO      [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: problematic-node-ID snapshot stats:
    ...
    2021-09-08 08:00:27,927+0000 localq-stats-0 INFO      [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [VIA-COPY] Cache replication replicatePutsViaCopy-queue stats per node: problematic-node-ID snapshot stats:
    ...
    2021-09-08 08:21:32,289+0000 localq-stats-0 INFO      [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [scheduled] ... done running cache replication queue stats for: 20 queues.
  • When creating a new Jira ticket while being logged directly into the "problematic node", we can see that this ticket can be found and accessed when logging directly into any other "healthy" node, which is another indication that the replication is actually working properly
  • When running a telnet command between all the Jira nodes of the cluster using their hostname/IP address and the ehcache ports (configured in the files <JIRA_HOME>/cluster.properties of each node), we can confirm that all the nodes are able to communicate with each other
  • When checking the Clustering  page in ⚙ > System, the application status of the "problematic node" might be empty:

Cause

We have seen situations where the health check is reporting false positives about the cluster cache replication for some nodes. Unfortunately, the exact root cause of this issue is currently unknown.

Solution

Schedule a maintenance window and re-start the "problematic" Jira node. After the restart, the health checks should stop complaining about this node.

Providing data to Atlassian Support

If a restart of the node did not resolve the issue, please reach out to Atlassian Support via this link. To help the Atlassian support team investigate the issue faster, please attach a support zip from each node to the ticket raised to Atlassian support.


Last modified on Sep 24, 2021

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.