Troubleshooting a Data Center cluster outage
Confluence Data Center cluster outages can be difficult to troubleshoot as the environments are complex and logging can be very verbose.
This page provides a starting point for investigating outages in your cluster.
Establish the originating node
The most common outage scenario is when something, such as database connectivity issue, network outage or a long garbage collection (GC) process, causes a node to fail to communicate with the cluster for 30 seconds or more and is removed by Hazelcast. The affected node then continues to write to the database, causing a cluster panic.
On this page:
To establish the originating node:
- Gather the
atlassian-confluence.logfiles from each node as soon as possible after the outage. Time is critical as the logs will roll over and you may lose the relevant time period.
Record identifying information about each node to help you interpret the log messages (IP address, node ID and name of each node).
Make a chronological timeline of the events:
- Record the time that users or monitoring systems started reporting problems.
- View the logs for each node side by side (Hint: we find opening three tabs in node number order helps you always know which logs you are viewing).
- Search the logs for '
removing member'and '
panic'. This will give you a good idea of which nodes caused the issue and when.
Make a chronological timeline of events from errors to node removal to panics. You can essentially disregard all logging that happens post-panic because once a node panics it needs to be restarted to function effectively. There will be a lot of noise in the logs, but it won't be very useful. The time period we're most interested in will be the minute or so leading up to the first removal or panic event in the logs.
2:50:15 (approx) Node 3 stopped heartbeating to the cluster for 30s (we can estimate this from the time of node removal) 02:50:45 Node 3 was removed by Node 2 02:53:15 Node 4 panics 02:54:15 Node 1, Node 3 and Node 4 receive the panic event and stop processing Node 2 remains serving requests
When you've established when the first affected node was removed, or when the first cluster panic occurred, look back in time in the logs on that node, to look for root causes.
Investigate common root causes
Once you know when the first affected node was removed you can start investigating root causes. From this point on, you're only looking at events on the affected node around the time of removal (in our example above, this is Node 3 at around 2:50). The subsequent removals and panics are usually flow-on effects of the original node removal event, and aren't likely to provide useful root cause information.
Check the GC logs for the node that was removed (Node 3 in our example). Were there any GC pauses longer than the Hazelcast heartbeat interval (30 seconds by default)? Nodes can't heartbeat during Garbage Collection, so they will be removed from the cluster by one of the other nodes.
If there was a cluster panic, but the node was not removed from the cluster first, check the GC logs for pauses around the time of the panic - pauses that are relatively short (less than 30 seconds) can sometimes still cause panics (due to a race condition) in Confluence 5.10.1 and earlier.
Check any database monitoring tools you may have. How many connections to the database were there at the time of the outage? Heartbeats can fail to send if a node can get a connection from its connection pool but not from the database itself, which can lead to nodes being removed from the cluster.
You won't be able to diagnose this from the Confluence logs and will need to look at any external monitoring tools you have for your database. If the outage happens again, check the current number of connections at the db level during the outage.
Check your network monitoring tools. If a node drops off the network for a short time and cannot communicate with the cluster, it can be removed by the other nodes. Your load balancer logs may be useful here.
Still having trouble?
Contact Support for help troubleshooting these outages. Provide them with as much of the information above as possible, to help their investigation.