Disaster recovery guide for Bamboo Data Center
Overview
Before you begin, read these initial subsections to get familiar with the various components and terminology used in this guide and review the requirements that must be met to follow this guide.
Terminology
Warm standby – This guide describes what is generally referred to as a warm standby strategy. That means that any secondary Bamboo instances are continuously running and kept in sync with the primary node for seamless failover.
Recovery Point Objective (RPO) – How up-to-date you require your Bamboo instance to be after a failure.
Recovery Time Objective (RTO) – How quickly you require your standby system to be available after a failure.
Recovery Cost Objective (RCO) – How much you are willing to spend on your disaster recovery solution.
What is the difference between high availability and disaster recovery?
The terms "high availability", "disaster recovery" and "failover" are often confused. To eliminate confusion we define these terms as:
High availability – A strategy to provide a specific level of availability. In the case of Bamboo, this refers to access to the application and an acceptable response time. Automated correction and failover within the same location are usually part of high-availability planning.
Disaster recovery – A strategy to resume operations in an alternate data center (usually in another geographic location) in the event of a disaster whereby the main data center becomes unavailable. Failover (to another location) is a fundamental part of disaster recovery.
Failover – A failover happens when one machine takes over from another machine if the first one fails. This could be within the same data center or across different geographical locations. Failover is usually part of both high availability and disaster recovery planning.
Components
A typical deployment of Bamboo configured for disaster recovery is depicted in the following diagram. Both the standby Bamboo instances as well as the shared file server and database instances are warm (that is, constantly running) so that replication and failover can occur.
Replication of data sources
Replication of your primary Bamboo system to a standby system requires:
your shared home directory to be on a file system that supports atomic snapshot-level replication to a remote standby
your database to be capable of replication to a remote standby
File system
The shared home directory contains your build results data, log files, user-installed apps, and so on. For more details, go to Bamboo home migration.
You need to replicate your shared home directory onto your standby file server. This process needs to be quick, reliable, and incremental to ensure the standby is as up-to-date as possible.
You need to choose a file system replication technology that will maintain the data integrity of the artifacts and build logs on disk.
For example, a replication technology such as Rsync would not meet this requirement, as artifacts might change during the transfer, causing corruption.
The best way to achieve the desired level of consistency is to use a file system that supports atomic block-level snapshots. All files in the snapshot are effectively frozen at the same point in time regardless of how long the replication to the standby takes.
Database
The database contains data about build and deployment configurations, agent settings, users, groups, permissions, and so on. You need to replicate your database and continuously keep it up to date to satisfy your RPO.
See Supported platforms for more details about choosing a supported database technology. Suppliers of supported database solutions provide their own database replication solutions.
These data sources comprise the entire state of your Bamboo instance. App vendors should make sure that files added by their apps are stored in the shared Bamboo home instead of the local home.
Setting up a backup system
Step 1. Install Bamboo Data Center on the backup instance
Install Bamboo Data Center on the backup instance the same way you would set up a primary instance.
The backup instance will effectively be an exact replica of the primary instance and therefore requires all the components deployed in the primary to be deployed to the backup. This includes a Data Center instance, database, and a file server to store the shared home folder.
See Installing Bamboo Data Center for specific, detailed installation procedures. Make sure Bamboo uses the backup database and not the production database.
Confirm Bamboo starts and can connect to its own database, not the production one. Starting your backup instance to test your setup is discussed later in the Disaster recovery testing section.
DB access credentials are stored in the local home directory of the Bamboo instance at bamboo.home
in bamboo-init.properties
or the BAMBOO_HOME
environment variable.
Step 2. Set up replication to the primary instance
Set up file server replication – Set up your backup file server as a replica of your primary. Only the volume containing Bamboo’s shared home directory needs to be replicated. Refer to your file server vendor's documentation for more information.
Set up database replication – Set up your backup database as a read replica of your primary database. Refer to your database vendor's documentation for more information.
Step 3. Initiate replication
Once you set up replication, ensure the primary instance is replicating to the backup instance. The method of replication varies depending on the chosen technology.
Once set up correctly, most database replication technologies are automatic and don't require ongoing manual steps. File system replication technologies may require the ongoing transfer of snapshots from the primary system to the standby to be automated (for example, using cron).
Make sure:
that the primary and backup instances run the same version of Bamboo
not to use rsync for file system replication
Disaster recovery testing
You should exercise extreme care when testing any disaster recovery plan. Simple mistakes could cause your live instance to be corrupted. For example, if testing updates are inserted into a production database. You may negatively impact your ability to recover from a real disaster while testing your disaster recovery plan.
It is important to note that your backup instance will be configured as if it were the primary, which can cause issues when testing your disaster recovery plan. When starting Bamboo at the backup node, the build orphan job may attempt to change ongoing build result statuses it finds when the file server is out of sync with the database.
Make sure remote agents can connect to your backup instances.
Before testing your disaster recovery deployment
Before testing, isolate your standby instance and keep the standby data center as isolated as possible from production systems during testing. You will need to prevent replication from the primary while testing your standby instance.
To ensure your standby instance is isolated from your primary instance:
Isolate your database – Temporarily pause all replication to the standby database, then promote the standby database.
Isolate your file system – Temporarily pause replication to the standby file system, then promote the standby file system.
Handling a failover
In the event your primary instance is unavailable, you need to failover to your standby system. This section describes how to do this, including instructions on how to check the data in your standby system.
To failover to your standby instance
Promote your standby database – Ensure the standby database is ready to accept connections and can no longer receive any further updates from the primary. Refer to your database vendor's documentation for more information.
Promote your standby file server – Ensure the standby file server doesn’t receive any further updates from the primary. Refer to your file server vendor's documentation for more information.
Start Bamboo in the standby instance.
Monitor the Bamboo log file and check for issues.
Update your DNS, HTTP proxy, or other front-end devices to route traffic to your standby server.
Update the mail server configuration if the mail server differs from the production mail server.
Returning to the primary instance
In most cases, you'll want to return to using your primary instance after you've resolved the problems that caused the disaster. There are basically two approaches to this:
Schedule a reasonably sized outage window. Take a full backup of the standby system's home directory and database, and restore it on the primary system.
Run the replication and failover steps in reverse, where the standby system now takes the role of the primary, and the original primary system acts as a standby until you cut over.