Bamboo agents go offline and come back online without restart
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
Bamboo remote agents go offline suddenly and come back online after a brief period without any manual intervention or restarting the agents.
Environment
This was observed in 9.6.1 but could be applicable to other versions as well since the issue is related to the network.
Diagnosis
When you start seeing your agents going offline or shutting down suddenly, it is important to check the <bamboo-home>/logs/atlassian-bamboo.log
file to understand what could be going wrong.
- In case you see the below logs, this means there is an ActiveMQ (Remote agent broker) failure. The logs tell us that a request sent to the ActiveMQ server did not receive a response within a certain timeframe.:
2024-05-21 19:25:08,345 WARN [RemoteEventBroadcast-1] [RemoteBroadcastEventListener] Broadcast failed with timeout, backing off...
2024-05-21 19:25:08,362 INFO [RemoteEventBroadcast-1] [RemoteBroadcastEventListener] Caught UncategorizedJmsException
2024-05-21 19:25:08,362 INFO [RemoteEventBroadcast-1] [Emergency] Caught UncategorizedJmsException
org.springframework.jms.UncategorizedJmsException: Uncategorized exception occurred during JMS processing; nested exception is javax.jms.JMSException: org.apache.activemq.transport.RequestTimedOutIOException
at org.springframework.jms.support.JmsUtils.convertJmsAccessException(JmsUtils.java:311) ~[spring-jms-5.3.33.jar:5.3.33]
...
Caused by: javax.jms.JMSException: org.apache.activemq.transport.RequestTimedOutIOException
- The logs could be followed by the below messages, which suggest there is a disconnection of the agents:
2024-05-21 19:44:33,364 INFO [scheduler_Worker-10] [PlanStatePersisterImpl] Updating delta states of build following NPV-1
2024-05-21 19:44:33,504 WARN [scheduler_Worker-10] [RemoteAgentManagerImpl] Detected that remote agent 'agent1' has been inactive since Tue May 21 19:20:15 UAT 2024
2024-05-21 19:45:33,604 WARN [scheduler_Worker-10] [RemoteAgentManagerImpl] Marking remote agent 'agent1' as unresponsive
2024-05-21 19:45:33,604 WARN [scheduler_Worker-10] [RemoteAgentManagerImpl] Detected that remote agent 'agent2' has been inactive since Tue May 21 19:22:15 UAT 2024
- You may also find many network related errors:
2024-05-21 19:55:19,602 ERROR [http-nio-8085-exec-282] [ArtifactServlet] Exception when storing the artifact
org.apache.catalina.connector.ClientAbortException: java.io.IOException: Connection reset by peer
Cause
When such messages are seen usually they are due to a network connectivity issue which causes the ActiveMQ to become unresponsive. This inturn causes the disconnection of agents.
Solution
You can check if there were any reported network errors during the time frame.