Cluster state replication
The state replication process consists of the following stages:
- Creating local queues
- Adding state modifications to the local queues
- Replicating state modifications from local queues to the other nodes
Creating local queues
Before we can queue state modifications, we need to create separate, local queues on each cluster node. We do this so that modifications are properly grouped and ordered.
The queues are created automatically. The nodes retrieve information about the whole cluster from the database by means of the following query:
select * from cluster_node_heartbeat;
Whenever you add a node to the cluster, the already existing nodes detect it and create another 8 queues for the new one in their file system. The new node will also create 8 queues for each remaining node.
By default, each node has 8 separate queues for each remaining node in the active state. We've opted for 8 queues to increase throughput— we can't speed up the replication of a single modification due to the number of actions that must be completed, but we can replicate 8 modifications at the same time.
In a 3-node cluster, the following queues will be created:
On the first node:
- 8 queues for replication modifications to the second node
- 8 queues for replicating modifications to the third node
On the second node:
- 8 queues for replicating modifications to the first node
- 8 queues for replicating modifications to the third node
On the third node:
- 8 queues for replicating modifications to the first node
- 8 queues for replicating modifications to the second node
The queues are created in the <local-bamboo-home>/localq
directory on each node. The contents of this directory look similar to the following example:
queue_fa2dcb4cd9614f27ba7cf9cf8932a097_0_fefb71b80fc2701443da7258e8646bce
..
queue_fa2dcb4cd9614f27ba7cf9cf8932a097_2_fefb71b80fc2701443da7258e8646bce
queue_fa2dcb4cd9614f27ba7cf9cf8932a097_3_fefb71b80fc2701443da7258e8646bce
..
queue_fa2dcb4cd9614f27ba7cf9cf8932a097_7_fefb71b80fc2701443da7258e8646bce
The queues are named according to the following scheme:
queue_<node-id>_<node-queue-number>_<hashed-node-id>
Where:
-
<node-id>
is the unique identifier of each node in your Bamboo Data Center cluster <node-queue-number>
is the number of the local queue<hashed-node-id>
is the hashed node identifier
You can obtain the node identifier by inspecting the node.id
field in the cluster-node.properties
file on each node.
Adding state modifications to the local queues
Once all the queues have been created, state invalidation events can be added to the right queue. Let's use the previous example of the 3-node Bamboo Data Center cluster to illustrate how this occurs:
- A state invalidation event occurs on the first node (for example, you make changes to the plan structure).
- The changes are made in the database, and the local state related to that change is invalidated.
- A request to invalidate the state is added to the following local queues:
queue_<node2-id>_<node2-queue-number>_<hashed-node2-id>
queue_<node3_id>_<node3-queue-number>_<hashed-node3-id>
- The change is replicated to the other nodes based on the order in which it was queued.
Modifications occurring in a single thread are always added to the same queue. This is to keep the order in which they'll be replicated in case two events in the same thread are dependent on each other.
Replicating state modifications from local queues to the other nodes
After all state modifications have been added to the local queues, the replication is handled by dispatchers. Each queue is paired with a dispatcher that handles:
- reading the state modification from the queue
- delivering the state modification request to the receiver node over gRPC in a non-blocking manner
- removing the state modification request from the queue
The dispatcher keeps a single gRPC channel between the publisher and the receiver nodes. The channel transmits only the modifications originating in the local queue the dispatcher was created for.
Following the previous example of the 3-node cluster, this is the process of replicating state modifications:
- The dispatcher responsible for the local
queue_<node2-id>_<node2-queue-number>_<hashed-node2-id>
reads the modifications from that queue and attempts to deliver them over the gRPC channel to the second node. - Another dispatcher responsible for the local
queue_<node3_id>_<node3-queue-number>_<hashed-node3-id>
reads the modifications from that queue and attempts to deliver them over the gRPC channel to the third node.
If the modifications have been successfully replicated, they're removed from the queues. If the replication fails, error messages are written to the log file.
How to monitor cluster state replication
You can monitor state replication by reviewing statistics that are written in the log file. They’ll show you the size of the local queues, and whether state modifications are successfully replicated or persisted in the queues for too long. In most cases, monitoring just a few parameters will tell you if the replication is working properly.
Learn how to monitor cluster state replication
How to configure cluster state replication
The default state replication settings should be sufficient for most use cases, but if needed, you can adjust options such as the number of local queues per node or the frequency of collecting statistics.