Jira Data Center Functionalities Loss Due to Cluster Wide Lock
Platform Notice: Data Center - This article applies to Atlassian products on the Data Center platform.
Note that this knowledge base article was created for the Data Center version of the product. Data Center knowledge base articles for non-Data Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Problem
This only affects Jira Data Center - the application is not responding on any system administration page. Node restart causes the node to be stuck during application startup due to another node holding a cluster-wide lock.
Symptoms
The remaining nodes within the cluster may encounter one or more symptoms as listed below:
- Application slowness.
System administration page loads forever. Following stack trace can be found in the thread dumps captured on the affected node.
"http-nio-8080-exec-177" #520500 daemon prio=5 tid=0x00007f08d4153800 nid=0x78b9 waiting on condition [0x00007f085e640000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at com.atlassian.beehive.db.DatabaseClusterLock.sleep(DatabaseClusterLock.java:530) at com.atlassian.beehive.db.DatabaseClusterLock.uninterruptibleWait(DatabaseClusterLock.java:102) at com.atlassian.beehive.db.DatabaseClusterLock.lock(DatabaseClusterLock.java:82) at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:159) at org.apache.jsp.decorators.admin_jsp._jspService(admin_jsp.java:485) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
Node does not come up upon restart. Following stack trace can be found in the thread dumps captured on the restarted node.
"localhost-startStop-1" #29 daemon prio=5 os_prio=0 tid=0x00002b4ed0002000 nid=0xb288 waiting on condition [0x00002b4c2a2bc000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at com.atlassian.beehive.db.DatabaseClusterLock.sleep(DatabaseClusterLock.java:530) at com.atlassian.beehive.db.DatabaseClusterLock.uninterruptibleWait(DatabaseClusterLock.java:102) at com.atlassian.beehive.db.DatabaseClusterLock.lock(DatabaseClusterLock.java:82) at com.atlassian.jira.config.DefaultReindexMessageManager.pushMessage(DefaultReindexMessageManager.java:105) at com.atlassian.jira.event.listeners.reindex.ReindexMessageListener.pluginModuleEnabled(ReindexMessageListener.java:41) ... at com.atlassian.plugin.event.impl.DefaultPluginEventManager.broadcast(DefaultPluginEventManager.java:73) at com.atlassian.plugin.manager.DefaultPluginManager.broadcastIgnoreError(DefaultPluginManager.java:2101) at com.atlassian.plugin.manager.DefaultPluginManager.notifyModuleEnabled(DefaultPluginManager.java:2013) at com.atlassian.plugin.manager.DefaultPluginManager.enableConfiguredPluginModule(DefaultPluginManager.java:1758) at com.atlassian.plugin.manager.DefaultPluginManager.enableConfiguredPluginModules(DefaultPluginManager.java:1735) at com.atlassian.plugin.manager.DefaultPluginManager.enableDependentPlugins(DefaultPluginManager.java:1247) at com.atlassian.plugin.manager.DefaultPluginManager.addPlugins(DefaultPluginManager.java:1202) at com.atlassian.jira.plugin.JiraPluginManager.addPlugins(JiraPluginManager.java:150) at com.atlassian.plugin.manager.DefaultPluginManager.lateStartup(DefaultPluginManager.java:645)
Cause
One of the nodes within the cluster is holding a cluster-wide lock. This particular node is experiencing issues that lead to application unresponsiveness. As a result, it's unable to release the cluster-wide lock.
Resolution
Run the following SQL query to determine which node is holding the cluster-wide lock.
SELECT * FROM clusterlockstatus WHERE locked_by_node IS NOT NULL;
It should return a similar result to the following. The following example states that Node 3 is the one holding the cluster-wide lock.
"15600","com.atlassian.jira.config.DefaultReindexMessageManager.messageLock","node3","1514969668397"
- Restart the problematic node so that the cluster-wide lock can be released.
- Once the cluster-wide lock is released and the application is back in action, it is strongly recommended to recycle the remaining nodes one by one to ensure the application's completeness.
Relevant Improvements and bugs
- JRASERVER-65890 - Create logging event for clusterlockstatus during startup
- JRASERVER-66596 - JIRA Datacenter - Add Cluster lock status page which doesn't use locks
- JRASERVER-66597 - JIRA DC might lose Cluster lock due database connectivity problems
- JRASERVER-66658 - JIRA does not retry cluster unlock actions after database connectivity problems
- JRASERVER-74298 - Jira node fails to start due to cluster lock at the active objects
- Jira node fails to start due to cluster lock in the active objects