Jira Data Center Functionalities Loss Due to Cluster Wide Lock

Still need help?

The Atlassian Community is here for you.

Ask the community

Platform Notice: Server and Data Center Only - This article only applies to Atlassian products on the server and data center platforms.

Problem

This only affects Jira Data Center - the application is not responding on any system administration page. Node restart causes the node to be stuck during application startup due to another node holding a cluster wide lock.

Symptoms

The remaining nodes within the cluster may encounter one or more symptoms as listed below:

  1. Application slowness.
  2. System administration page loads forever. Following stack trace can be found in the thread dumps captured on the affected node.

    "http-nio-8080-exec-177" #520500 daemon prio=5 tid=0x00007f08d4153800 nid=0x78b9 waiting on condition [0x00007f085e640000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
    	at java.lang.Thread.sleep(Native Method)
    	at com.atlassian.beehive.db.DatabaseClusterLock.sleep(DatabaseClusterLock.java:530)
    	at com.atlassian.beehive.db.DatabaseClusterLock.uninterruptibleWait(DatabaseClusterLock.java:102)
    	at com.atlassian.beehive.db.DatabaseClusterLock.lock(DatabaseClusterLock.java:82)
    	at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:159)
    	at org.apache.jsp.decorators.admin_jsp._jspService(admin_jsp.java:485)
    	at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    	at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
  3. Node does not come up upon restart. Following stack trace can be found in the thread dumps captured on the restarted node.

    "localhost-startStop-1" #29 daemon prio=5 os_prio=0 tid=0x00002b4ed0002000 nid=0xb288 waiting on condition [0x00002b4c2a2bc000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
    	at java.lang.Thread.sleep(Native Method)
    	at com.atlassian.beehive.db.DatabaseClusterLock.sleep(DatabaseClusterLock.java:530)
    	at com.atlassian.beehive.db.DatabaseClusterLock.uninterruptibleWait(DatabaseClusterLock.java:102)
    	at com.atlassian.beehive.db.DatabaseClusterLock.lock(DatabaseClusterLock.java:82)
    	at com.atlassian.jira.config.DefaultReindexMessageManager.pushMessage(DefaultReindexMessageManager.java:105)
    	at com.atlassian.jira.event.listeners.reindex.ReindexMessageListener.pluginModuleEnabled(ReindexMessageListener.java:41)
    
    ...
    	at com.atlassian.plugin.event.impl.DefaultPluginEventManager.broadcast(DefaultPluginEventManager.java:73)
    	at com.atlassian.plugin.manager.DefaultPluginManager.broadcastIgnoreError(DefaultPluginManager.java:2101)
    	at com.atlassian.plugin.manager.DefaultPluginManager.notifyModuleEnabled(DefaultPluginManager.java:2013)
    	at com.atlassian.plugin.manager.DefaultPluginManager.enableConfiguredPluginModule(DefaultPluginManager.java:1758)
    	at com.atlassian.plugin.manager.DefaultPluginManager.enableConfiguredPluginModules(DefaultPluginManager.java:1735)
    	at com.atlassian.plugin.manager.DefaultPluginManager.enableDependentPlugins(DefaultPluginManager.java:1247)
    	at com.atlassian.plugin.manager.DefaultPluginManager.addPlugins(DefaultPluginManager.java:1202)
    	at com.atlassian.jira.plugin.JiraPluginManager.addPlugins(JiraPluginManager.java:150)
    	at com.atlassian.plugin.manager.DefaultPluginManager.lateStartup(DefaultPluginManager.java:645)

Cause

One of the nodes within the cluster is holding a cluster wide lock. This particular node is experiencing issue which lead to application unresponsiveness. As a result, it's unable to release the cluster wide lock.

Resolution

  1. Run the following SQL query to determine which node is holding the cluster wide lock.

    SELECT * FROM clusterlockstatus WHERE lock_name LIKE '%messageLock' AND locked_by_node IS NOT NULL;
  2. It should return similar result like the following. From the following example, it states that Node 3 is the one holding the cluster wide lock.

    "15600","com.atlassian.jira.config.DefaultReindexMessageManager.messageLock","node3","1514969668397"
  3. Restart the problematic node so that the cluster wide lock can be released.
  4. Once the cluster wide lock is released and application is back in action, it is strongly recommended to recycle the remaining nodes one by one to ensure the application completeness.

Relevant Improvements and bugs

Last modified on Aug 23, 2019

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.