Cluster communication problems: Member has left cluster, or Member has been forcefully evicted from cluster, or A potential communication problem has been detected
This article applies to Confluence clustered 5.4 or earlier.
Confluence cluster is not working as expected. Eg: you cannot start more than one node without the new member being evicted from the cluster.
The following appears in the
2014-05-11 18:24:05,957 WARN [Logger@9233091 3.3.1/389] [Coherence] log 2014-05-11 18:24:05.957 Oracle Coherence GE 3.3.1/389 <Warning> (thread=PacketPublisher, member=2): A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after 45 seconds, although other packets were acknowledged by the same cluster member (Member(Id=1, Timestamp=2014-05-11 18:17:48.519, Address=xxx.xxx.xxx.x:8090, MachineId=12345, Location=process:1234@CONFLUENCE01)) to this member (Member(Id=2, Timestamp=2014-05-11 18:23:16.19, Address=xxx.xxx.xxx.x:8090, MachineId=67891, Location=process:1234@CONFLUENCE02)) as recently as 0 seconds ago. It is possible that the packet size greater than 1468 is responsible; for example, some network equipment cannot handle packets larger than 1472 bytes (IPv4) or 1468 bytes (IPv6). Use the 'ping' command with the <size> option to verify successful delivery of specifically sized packets. Other possible causes include network failure, poor thread scheduling (see FAQ if running on Windows), an extremely overloaded server, a server that is attempting to run its processes using swap space, and unreasonably lengthy GC times.
2014-05-11 18:13:49,218 WARN [Logger@9226875 3.3.1/389] [Coherence] log 2014-05-11 18:13:49.218 Oracle Coherence GE 3.3.1/389 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; the member appears to be alive, but exhibits long periods of unresponsiveness; removing Member(Id=1, Timestamp=2014-05-11 18:09:52.641, Address=xxx.xxx.xxx.x:8090, MachineId=41352, Location=process:1234@CONFLUENCE01)
2014-05-11 18:13:49,249 INFO [Cluster:EventDispatcher] [confluence.cluster.coherence.TangosolClusterManager] memberLeft Member has left cluster: Member(Id=1, Timestamp=2014-05-11 18:13:49.218, Address=xxx.xxx.xxx.x:8090, MachineId=12345, Location=process:1234@CONFLUENCE01) 2014-05-11 18:13:49,436 WARN [Logger@9226875 3.3.1/389] [Coherence] log 2014-05-11 18:13:49.436 Oracle Coherence GE 3.3.1/389 <Warning> (thread=Cluster, member=2): The member formerly known as Member(Id=1, Timestamp=2014-05-11 18:13:49.218, Address=xxx.xxx.xxx.x:8090, MachineId=12345, Location=process:1234@CONFLUENCE01) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.
There are multiple potential causes for this issue:
- The packet size is too large for the network configuration to handle
- Garbage Collection
- Other environmental issues:
- Network failure
- A VM using swap space
- An otherwise overloaded server
Start just one node and allow that to serve your customers independently while you investigate the root cause of the issue.
Run these commands to confirm that larger packets are allowed through your network:
ping ping -l 1500 ping -l 3000
If any of these are rejected, get your network administrators to allow larger packet sizes.
- How to Enable Garbage Collection (GC) Logging
- Review the logs using a tool like GCViewer
- Raise a Support Request if you'd like Support to help you analyse the logs and determine if they are causing the issue
- Follow these guidelines to reduce the size of your heap and bring the GC times down
Other environmental issues
Get your network and infrastructure administrators to investigate the current state of the network and the server itself.