Jira Data Center hangs after restoring the database because of cluster locks

Still need help?

The Atlassian Community is here for you.

Ask the community

Problem

If Jira Data Center hangs when restoring the database, it might be because of how cluster locks work. See Diagnosis below to check whether you’re affected by this problem.

Summary

In Jira 8.3.2, we’ve changed the way database cluster locks work in Jira Data Center. This was a result of fixing JRASERVER-66597 - Getting issue details... STATUS . The benefit of these changes is that locks are synced with the database atomic operations, which practically removes any race conditions that could occur when two nodes try to acquire the same lock on the database at the same time. However, in the specific case of restoring the database, it might happen that an attempt to acquire a lock leaves Jira Data Center hanging.

Details

Cluster locks rely on an assumption that once a particular lock object has been initialized, a record for it will always be present in the database. In most cases, that’s a justified assumption because all locks should have a record like that.

There’s one case where this isn’t true — when you’re restoring the database. The new database won’t have records for locks that weren’t included in the backup. Because of missing records, the node will interpret the related locks as being taken by another node (nodes are looking for a record with specific details, if these details are missing, the node assumes another node is acquiring this lock). As a result, all nodes report being locked out and can’t acquire a cluster lock.

At the moment of creating this mechanism, this seemed like a justified flaw — when restoring the database, cluster locks shouldn’t be needed because only one of the nodes is active, and the whole database is being overwritten anyway.

The problem is that after restoring the database backup, the next step is the Pico container that restarts the Jira components — this happens automatically in the background, but it may trigger shutdown operations that use cluster locks. When a process attempts to acquire a lock that was used by the node before but doesn’t have a record in the new database, the node assumes it’s taken by another node. In cases when this operation is performed using the retry-until-success method (such as lock()), the process gets stuck permanently

Diagnosis

If your Jira Data Center hangs when restoring the database, there are two ways to check if it’s related to this problem.

  • Set logging level for com.atlassian.beehive to DEBUG, and look for repeating messages: Acquisition of cluster lock '<lock name>' by <thread name> failed. Lock is owned by another node.

  • Trigger a few thread dumps and look for repeating TIMED_WAITING state of sleep caused by DatabaseClusterLock.lock() method.

Mitigation

This problem is an edge case, but it will exist until we change the mechanism behind cluster locks. In the meantime, we have some suggestions for app developers and admins who encountered it.

For app developers:

  • Avoid using retry-until-success locking methods (such as com.atlassian.beehive.db.DatabaseClusterLock#lock()) in shutdown operations, as these might be triggered immediately after database imports. One type of operations we know about is performed in response to PluginFrameworkShutdownEvent, but there might be more.

  • For these operations, acquire locks by using non-permanently blocking methods, like com.atlassian.beehive.db.DatabaseClusterLock#tryLock() or com.atlassian.beehive.db.DatabaseClusterLock#tryLock(long, java.util.concurrent.TimeUnit).

For admins:

  • The operation gets stuck after restoring the database, but before reindexing the new database content. To mitigate the problem, restart the frozen node, and trigger the reindexing process.

Last modified on Dec 17, 2019

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.