Cluster Cache Replication HealthCheck fails due to unable to complete within the timeout

Still need help?

The Atlassian Community is here for you.

Ask the community


Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.

Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible



Problem

Jira Data Center cluster replication relies on nodes being recorded in a database and also receiving and sending updates. The Cluster Cache Replication  Health Check confirms that the replication is working in the entire cluster. If an active node is not responding, the other nodes are going to report warnings and the one with the error will report a critical result. See for more details: Cluster Cache Replication health check fails in Jira Data Center

In some cases, due to the long time required to execute the Health Check, it might fail with following error: The health check was unable to complete within the timeout of 20000

Screenshot:



The following errors appear in the atlassian-jira.log

2018-02-12 05:47:56,248  WARN  HealthCheckWatchdog:thread-6  ServiceRunner          [support.healthcheck.concurrent.SupportHealthCheckTask]  Health check Cluster Cache Replication was unable to complete within the timeout of 20000.  
2018-02-12 05:47:56,249  ERROR  HealthCheck:thread-3  ServiceRunner          [plugins.healthcheck.service.ClusterHeartbeatService]  Failed to wait until cluster node appear in the cache  
java.lang.InterruptedException: sleep interrupted
	at java.lang.Thread.sleep(Native Method) [?:1.8.0_102]
	at com.atlassian.jira.plugins.healthcheck.service.SleepTimeoutFactory$SleepTimeout.sleep(SleepTimeoutFactory.java:32) [?:?]
	at com.atlassian.jira.plugins.healthcheck.service.ClusterHeartbeatService.getClusterNodesReplicationInfo(ClusterHeartbeatService.java:71) [?:?]
	at com.atlassian.jira.plugins.healthcheck.cluster.ClusterReplicationHealthCheck.doCheck(ClusterReplicationHealthCheck.java:41) [?:?]
	at com.atlassian.jira.plugins.healthcheck.cluster.AbstractClusterHealthCheck.check(AbstractClusterHealthCheck.java:52) [?:?]
	at com.atlassian.support.healthcheck.impl.PluginSuppliedSupportHealthCheck.check(PluginSuppliedSupportHealthCheck.java:51) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_102]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_102]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_102]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]
2018-02-12 05:47:56,278  WARN  HealthCheck:thread-3  ServiceRunner          [plugins.healthcheck.cluster.ClusterReplicationHealthCheck]  Node jiranode-2 does not seem to replicate its cache  

Please note, this is a different problem than when the cluster is not properly configured and The node is not replicating due to a network condition. See: Cluster Cache Replication health check fails in Jira Data Center


Diagnosis

Environment

  • Jira Datacenter
  • Large number of scheduled threads

Cause 1 (pre Jira 7.4.2)

In Jira 7.4.2 and lower we use the Jira Instance Health (JIH) plugin, that has a timeout of 20s set for the check run while the check itself has 60s timeout. We replaced JIH in Jira 7.4.3 with the Atlassian Troubleshooting & Support Tools Plugin (ATST). The bug is fixed in ATST 1.6.1 as detailed in  Unable to locate Jira server for this macro. It may be due to Application Link configuration.  as it increases the timeout for the check and also the later version of Jira has a different way of sending heartbeats that the check verifies.

Workaround

Add the below JVM argument as per Setting properties and options on startup:

-Datlassian.healthcheck.timeout-ms=60000

If this doesn't resolve it, you may be affected by additional cause below.

Cause 2

An underlying cause is some or all nodes don't schedule cluster replication heartbeat due to a busy scheduler (Caesium), so the health check can't check the value within the timeout. The scheduler has a limited number of threads and this can cause contention for the check, as it may wait for an available thread and that takes longer than 20s (or 60s after increasing it).

Health check for cluster replication uses its own cache to store nodeID.

  • Each node periodically puts a heart-beat value into that cache, this is done by thread scheduled by Scheduler. Example of scheduled task HealthCheckSchedulerImpl triggered during start-up:

    2017-03-31 13:12:15,241 localhost-startStop-1 INFO      [c.a.j.p.h.scheduler.impl.HealthCheckSchedulerImpl] Scheduling job with : JobConfig[jobRunnerKey=com.atlassian.jira.plugins.healthcheck.scheduler.impl.HealthCheckSchedulerImpl,runMode=RUN_LOCALLY,schedule=Schedule[type=INTERVAL,intervalScheduleInfo=IntervalScheduleInfo[firstRunTime=Fri Mar 31 13:12:30 CEST 2017,intervalInMillis=10000]],parameters={}]
  • Then during the execution of the Healthcheck, it verifies the status:
    • It tries to check all non-replicating nodes (checking staus isReplicating from cache) and waits for heart-beat for each live node. 
  • Normally, when heart-beat thread runs periodically, data will be in cache, so it's reply immediately. 
  • In current case, it waits in loop until it gets interrupted after 20000ms as per the above error. 

Workaround

Unfortunatly, there is no reliable workaround to this problem: 

  • You can try to run the Health Check during low peak hours, that might decrease contention for the scheduler.
  • Reduce number of scheduled jobs (eg: if you have high numbers of Mail Handlers) or space them across the day.

Resolution

Unfortunately, there is no resolution. The best way will be to increase number of scheduler threads, but this values is hardcoded and set to 4. See  JRASERVER-65809 - Getting issue details... STATUS

Cause 3

There is a trailing white space on the node's ID and JIRA doesn't handle this consistently: the value is read from the cluster.properties file with the trailing space and saved to the database's clusternode table. However, it seems that the health check mechanism trims trailing white spaces and it results on the node not being found. We have raised  JRASERVER-67243 - Getting issue details... STATUS  to address that.

Workaround

  1. Make sure the jira.node.id property has no trailing space in its value on the cluster.properties file;
  2. Ensure the NODE_ID column of the clusternode database table has no trailing space after the value;

Cause 4

Java arg -Djava.rmi.server.hostname= is set to the wrong server hostname. RMI is used for EHCache, so wrong settings there affects the configuration and replication workflow.

Resolution

  1. Check Java arg -Djava.rmi.server.hostname= and either remove it (if not required) or ensure it's set to the proper value.


Last modified on Jun 8, 2022

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.