Confluence will not start due to fatal error in Confluence cluster

Still need help?

The Atlassian Community is here for you.

Ask the community

This article requires fixes

This article has been Flagged for fixing. Use caution when using it and fix it if you have Publisher rights.

Problem

An error message appears when accessing Confluence:

Fatal error in Confluence cluster: Database is being updated by an instance which is not part of the current cluster. 
You should check network connections between cluster nodes, especially multicast traffic.

Background

Confluence has a CLUSTERSAFETY table (located in the database). This table exists even for non clustered environments. Every 30 seconds, Confluence checks this table and compares its value with the one it has in memory. If the new value differs from the one in memory, this error appears, and Confluence cannot proceed. This is the cluster safety mechanism.

How the cluster safety mechanism works...

The cluster safety mechanism is designed to ensure that your wiki cannot become inconsistent because updates by one user are not visible to another. A failure of this mechanism is a fatal error in Confluence and is called cluster panic. Because the cluster safety mechanism helps prevents data inconsistency whenever any two copies of Confluence run against the same database, it is enabled in all instances of Confluence, not just clusters.

A scheduled task, ClusterSafetyJob, runs every 30 seconds in Confluence. In a cluster, this job is run only on one of the nodes. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across a cluster. It does the following:

  1. Generate a new random number
  2. Compare the existing safety numbers, if there is already a safety number in both the database and the cache.
  3. If the numbers differ, publish a ClusterPanicEvent. Currently in Confluence, this causes the following to happen on each node in the cluster:
    • disable all access to the application
    • disable all scheduled tasks
    • In Confluence 5.5 and earlier, update the database safety number to a new value, which will cause all nodes accessing the database to fail. From Confluence 5.6 onwards, the database safety number is not updated, to allow the other Confluence node/s to continue processing requests.
  4. If the numbers are the same or aren't set yet, update the safety numbers:
    • set the safety number in the database to the new random number
    • set the safety number in the cache to the new random number.

Causes

Though it appears to be a cluster related problem, this error occurs in non-clustered environments as well. There are several issues that can all cause the same error message:

  1. Multiple instances of Confluence are deployed, connecting to the same database. Happens often when a production environment is cloned or a staging environment is started without changing the hibernate.connection.url which is pointing to the original database. 
  2. Confluence is using a duplicate Server ID that already exists in another environment. Happens often when a production environment is cloned, or a staging environment is started without creating a new Server ID.
  3. A performance issue (usually invasive Garbage Collection) has suspended the cluster safety job from running. In Confluence 3.0.1 and 3.0.2, this is exacerbated and happens with a much higher frequency. See Cluster panics (Non Clustered Confluence 2.10.4, 3.0.1 and 3.0.2).
  4. Communication between the nodes in a cluster has been severed.
  5. Confluence is using a read-only DB.
  6. In a single-node (non-clustered) deployment, there are two records in clustersafety table. 
  7. Confluence is connected to a MySQL database that is configured to be a master server on a replication.

Diagnosis and Resolutions

  1. If the problem occurs shortly after startup in a single-node (non-clustered) deployment, see Cluster Panic due to Multiple Deployments.
  2. If the problem occurs shortly after startup in a single-node (non-clustered) deployment and it wasn't caused by multiple deployments, see if clustersafety table (in the database) has more than one record. If so, just delete one of them.
  3. If you are using 3.0.1 or 3.0.2, you are likely experiencing a bug. See Cluster panics (Non Clustered Confluence 2.10.4, 3.0.1 and 3.0.2).
  4. If the problem happens spontaneously during production usage in a single-node (non-clustered) deployment, see Cluster Panic due to Performance Problems.
  5. If the problem happens in a multi-node cluster (Confluence 5.4 and earlier), see Cluster Panic due to Multicast Traffic Communication Problem.
  6. If the problem happens in Confluence Data Center 5.6 or later (clustered), see Recovering from a Data Center cluster split-brain.
  7. If Confluence is connected on MySQL master database follow the resolution method of this guide.
  8. If Confluence is using a duplicate Server ID as a result of a cloned environment, see How to change the server ID of Confluence.
  9. If the above documents aren't able to point to the problem, check Data Center Troubleshooting


Last modified on Jan 4, 2023

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.