The Bamboo Cluster standby node takes over the service and kills the active node on startup even if the active node was operational

Still need help?

The Atlassian Community is here for you.

Ask the community


Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Summary

On a Bamboo Cluster, whenever an Active node is already running, it is expected that any standby nodes wait for the Active node to be offline to take over the service. Under certain circumstances, the Bamboo standby node will take over the service from the active node during startup and vice-versa causing a cluster takeover race condition.

Environment

Bamboo 8 Data Center in an Active/Standby Cluster setup

Diagnosis

When starting the Bamboo standby node, we can see the following error messages on the Active node:

2022-07-12 22:12:15,236 INFO [http-nio-8085-exec-2] [AbstractIndexer] Can't find approximateTimePerResult value in Bandana for indexer com.atlassian.bamboo.index.buildresult.DefaultBuildResultsIndexer, assuming default 100ms
2022-07-12 22:12:15,241 INFO [http-nio-8085-exec-2] [AbstractIndexer] Can't find approximateTimePerResult value in Bandana for indexer com.atlassian.bamboo.deployments.environments.index.EnvironmentIndexerImpl, assuming default 100ms
2022-07-12 22:12:15,246 INFO [http-nio-8085-exec-2] [AbstractIndexer] Can't find approximateTimePerResult value in Bandana for indexer com.atlassian.bamboo.deployments.versions.index.VersionIndexerImpl, assuming default 100ms
2022-07-12 22:12:15,251 INFO [http-nio-8085-exec-2] [AbstractIndexer] Can't find approximateTimePerResult value in Bandana for indexer com.atlassian.bamboo.index.quicksearch.QuickSearchIndexerImpl, assuming default 100ms
2022-07-12 22:13:08,244 INFO [ActiveMQ Journal Checkpoint Worker] [PageFile] Unexpected io error on pagefile write of 1 pages.
java.io.IOException: Stale file handle
	at java.base/sun.nio.ch.FileDispatcherImpl.force0(Native Method)
	at java.base/sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:82)
	at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:461)
	at org.apache.activemq.util.RecoverableRandomAccessFile.sync(RecoverableRandomAccessFile.java:401)
	at org.apache.activemq.store.kahadb.disk.page.PageFile.writeBatch(PageFile.java:1187)
	at org.apache.activemq.store.kahadb.disk.page.PageFile.flush(PageFile.java:608)
	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1795)
	at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:1104)
	at org.apache.activemq.store.kahadb.MessageDatabase$CheckpointRunner.run(MessageDatabase.java:445)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-12 22:13:08,247 ERROR [ActiveMQ Journal Checkpoint Worker] [MessageDatabase] Checkpoint failed 


We can also notice the following log entries on both cluster nodes:

Server 1
2022-07-12 22:10:36,743 INFO [localhost-startStop-1] [ClusterLockBootstrapServiceImpl] Primary lock acquired with node id a186cbf4-f6d8-49da-878c-b2bc56a5bde1, proceeding with startup...
Server 2
2022-07-12 22:12:43,540 INFO [localhost-startStop-1] [ClusterLockBootstrapServiceImpl] Primary lock acquired with node id a186cbf4-f6d8-49da-878c-b2bc56a5bde1, proceeding with startup...

Cause

If you look closely you'll see that the "node id" is the same on both nodes. When Bamboo starts, it compares its "node id" with the one from the Database. If the "node id" is the same, it assumes it can just start normally, whilst when the "node id" is different, it understands some other server is active and then will put itself on standby mode.

And confirmed that the <bamboo-home>/cluster-node.properties were the same on both nodes:

#Bamboo DC node instance settings. Don't share this file between nodes.
#Manual changes might cause issues with cluster lock management. Read documentation before introducing changes.
#Wed Mar 30 14:28:12 AEDT 2022
node.id=a186cbf4-f6d8-49da-878c-b2bc56a5bde1
node.name=Node a186cbf4-f6d8-49da-878c-b2bc56a5bde1

Both <bamboo-home>/cluster-node.properties files from Server 1 and Server 2 contain the same information, so that's why Bamboo is behaving that way.

Solution

Remove the <bamboo-home>/cluster-node.properties file from one of your Bamboo servers, and start the application again. Bamboo will generate a random id and will stop trying to take over the service from each other.

Last modified on Jul 13, 2022

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.