Cluster Cache Replication HealthCheck fails due to unable to complete within the timeout
Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Problem
Jira Data Center cluster replication relies on nodes being recorded in a database and also receiving and sending updates. The Cluster Cache Replication Health Check confirms that the replication is working in the entire cluster. If an active node is not responding, the other nodes are going to report warnings and the one with the error will report a critical result. See for more details: Cluster Cache Replication health check fails in Jira Data Center
In some cases, due to the long time required to execute the Health Check, it might fail with following error: The health check was unable to complete within the timeout of 20000
Screenshot:
The following errors appear in the atlassian-jira.log
:
2018-02-12 05:47:56,248 WARN HealthCheckWatchdog:thread-6 ServiceRunner [support.healthcheck.concurrent.SupportHealthCheckTask] Health check Cluster Cache Replication was unable to complete within the timeout of 20000.
2018-02-12 05:47:56,249 ERROR HealthCheck:thread-3 ServiceRunner [plugins.healthcheck.service.ClusterHeartbeatService] Failed to wait until cluster node appear in the cache
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method) [?:1.8.0_102]
at com.atlassian.jira.plugins.healthcheck.service.SleepTimeoutFactory$SleepTimeout.sleep(SleepTimeoutFactory.java:32) [?:?]
at com.atlassian.jira.plugins.healthcheck.service.ClusterHeartbeatService.getClusterNodesReplicationInfo(ClusterHeartbeatService.java:71) [?:?]
at com.atlassian.jira.plugins.healthcheck.cluster.ClusterReplicationHealthCheck.doCheck(ClusterReplicationHealthCheck.java:41) [?:?]
at com.atlassian.jira.plugins.healthcheck.cluster.AbstractClusterHealthCheck.check(AbstractClusterHealthCheck.java:52) [?:?]
at com.atlassian.support.healthcheck.impl.PluginSuppliedSupportHealthCheck.check(PluginSuppliedSupportHealthCheck.java:51) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_102]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_102]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_102]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]
2018-02-12 05:47:56,278 WARN HealthCheck:thread-3 ServiceRunner [plugins.healthcheck.cluster.ClusterReplicationHealthCheck] Node jiranode-2 does not seem to replicate its cache
Please note, this is a different problem than when the cluster is not properly configured and The node is not replicating due to a network condition. See: Cluster Cache Replication health check fails in Jira Data Center
Diagnosis
Environment
- Jira Datacenter
- Large number of scheduled threads
Cause 1 (pre Jira 7.4.2)
In Jira 7.4.2 and lower we use the Jira Instance Health (JIH) plugin, that has a timeout of 20s set for the check run while the check itself has 60s timeout. We replaced JIH in Jira 7.4.3 with the Atlassian Troubleshooting & Support Tools Plugin (ATST). The bug is fixed in ATST 1.6.1 as detailed in
as it increases the timeout for the check and also the later version of Jira has a different way of sending heartbeats that the check verifies.Workaround
Add the below JVM argument as per Setting properties and options on startup:
-Datlassian.healthcheck.timeout-ms=60000
If this doesn't resolve it, you may be affected by additional cause below.
Cause 2
An underlying cause is some or all nodes don't schedule cluster replication heartbeat due to a busy scheduler (Caesium), so the health check can't check the value within the timeout. The scheduler has a limited number of threads and this can cause contention for the check, as it may wait for an available thread and that takes longer than 20s (or 60s after increasing it).
Health check for cluster replication uses its own cache to store nodeID.
Each node periodically puts a heart-beat value into that cache, this is done by thread scheduled by Scheduler. Example of scheduled task HealthCheckSchedulerImpl triggered during start-up:
2017-03-31 13:12:15,241 localhost-startStop-1 INFO [c.a.j.p.h.scheduler.impl.HealthCheckSchedulerImpl] Scheduling job with : JobConfig[jobRunnerKey=com.atlassian.jira.plugins.healthcheck.scheduler.impl.HealthCheckSchedulerImpl,runMode=RUN_LOCALLY,schedule=Schedule[type=INTERVAL,intervalScheduleInfo=IntervalScheduleInfo[firstRunTime=Fri Mar 31 13:12:30 CEST 2017,intervalInMillis=10000]],parameters={}]
- Then during the execution of the Healthcheck, it verifies the status:
- It tries to check all non-replicating nodes (checking staus isReplicating from cache) and waits for heart-beat for each live node.
- Normally, when heart-beat thread runs periodically, data will be in cache, so it's reply immediately.
- In current case, it waits in loop until it gets interrupted after 20000ms as per the above error.
Workaround
Unfortunatly, there is no reliable workaround to this problem:
- You can try to run the Health Check during low peak hours, that might decrease contention for the scheduler.
- Reduce number of scheduled jobs (eg: if you have high numbers of Mail Handlers) or space them across the day.
Resolution
Unfortunately, there is no resolution. The best way will be to increase number of scheduler threads, but this values is hardcoded and set to 4. See JRASERVER-65809 - Getting issue details... STATUS
Cause 3
There is a trailing white space on the node's ID and JIRA doesn't handle this consistently: the value is read from the cluster.properties
file with the trailing space and saved to the database's clusternode
table. However, it seems that the health check mechanism trims trailing white spaces and it results on the node not being found. We have raised
JRASERVER-67243
-
Getting issue details...
STATUS
to address that.
Workaround
- Make sure the
jira.node.id
property has no trailing space in its value on thecluster.properties
file; - Ensure the
NODE_ID
column of theclusternode
database table has no trailing space after the value;
Cause 4
Java arg -Djava.rmi.server.hostname=
is set to the wrong server hostname. RMI is used for EHCache, so wrong settings there affects the configuration and replication workflow.
Resolution
- Check Java arg
-Djava.rmi.server.hostname=
and either remove it (if not required) or ensure it's set to the proper value.