Cluster communication problems: Member has left cluster, or Member has been forcefully evicted from cluster, or A potential communication problem has been detected

Still need help?

The Atlassian Community is here for you.

Ask the community

 

This article applies to Confluence clustered 5.4 or earlier.

Symptoms

Confluence cluster is not working as expected. Eg: you cannot start more than one node without the new member being evicted from the cluster.

The following appears in the atlassian-confluence.log:

2014-05-11 18:24:05,957 WARN [Logger@9233091 3.3.1/389] [Coherence] log 2014-05-11 18:24:05.957 Oracle Coherence GE 3.3.1/389 <Warning> (thread=PacketPublisher, member=2): A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after 45 seconds, although other packets were acknowledged by the same cluster member (Member(Id=1, Timestamp=2014-05-11 18:17:48.519, Address=xxx.xxx.xxx.x:8090, MachineId=12345, Location=process:1234@CONFLUENCE01)) to this member (Member(Id=2, Timestamp=2014-05-11 18:23:16.19, Address=xxx.xxx.xxx.x:8090, MachineId=67891, Location=process:1234@CONFLUENCE02)) as recently as 0 seconds ago. It is possible that the packet size greater than 1468 is responsible; for example, some network equipment cannot handle packets larger than 1472 bytes (IPv4) or 1468 bytes (IPv6). Use the 'ping' command with the <size> option to verify successful delivery of specifically sized packets. Other possible causes include network failure, poor thread scheduling (see FAQ if running on Windows), an extremely overloaded server, a server that is attempting to run its processes using swap space, and unreasonably lengthy GC times.

2014-05-11 18:13:49,218 WARN [Logger@9226875 3.3.1/389] [Coherence] log 2014-05-11 18:13:49.218 Oracle Coherence GE 3.3.1/389 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; the member appears to be alive, but exhibits long periods of unresponsiveness; removing Member(Id=1, Timestamp=2014-05-11 18:09:52.641, Address=xxx.xxx.xxx.x:8090, MachineId=41352, Location=process:1234@CONFLUENCE01)

2014-05-11 18:13:49,249 INFO [Cluster:EventDispatcher] [confluence.cluster.coherence.TangosolClusterManager] memberLeft Member has left cluster: Member(Id=1, Timestamp=2014-05-11 18:13:49.218, Address=xxx.xxx.xxx.x:8090, MachineId=12345, Location=process:1234@CONFLUENCE01) 2014-05-11 18:13:49,436 WARN [Logger@9226875 3.3.1/389] [Coherence] log 2014-05-11 18:13:49.436 Oracle Coherence GE 3.3.1/389 <Warning> (thread=Cluster, member=2): The member formerly known as Member(Id=1, Timestamp=2014-05-11 18:13:49.218, Address=xxx.xxx.xxx.x:8090, MachineId=12345, Location=process:1234@CONFLUENCE01) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.

 

Cause

There are multiple potential causes for this issue:

  1. The packet size is too large for the network configuration to handle
  2. Garbage Collection
  3. Other environmental issues:
    1. Network failure
    2. A VM using swap space
    3. An otherwise overloaded server

Workaround

Start just one node and allow that to serve your customers independently while you investigate the root cause of the issue.

Resolution

Packet Size

Run these commands to confirm that larger packets are allowed through your network:

ping
ping -l 1500
ping -l 3000

If any of these are rejected, get your network administrators to allow larger packet sizes.

Garbage Collection

  1. How to Enable Garbage Collection (GC) Logging
  2. Review the logs using a tool like GCViewer
  3. Raise a Support Request if you'd like Support to help you analyse the logs and determine if they are causing the issue
  4. Follow these guidelines to reduce the size of your heap and bring the GC times down

Other environmental issues

Get your network and infrastructure administrators to investigate the current state of the network and the server itself.

Last modified on Feb 26, 2016

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.