Confluence Data Center disaster recovery
A disaster recovery strategy is a key part of any business continuity plan. It outlines the processes to follow in the event of a disaster, to ensure that the business can recover and keep operating. For Confluence, this means ensuring Confluence's availability in the event that your primary site becomes unavailable.
Confluence Data Center is the only Atlassian-supported high-availability solution for Confluence. However, if you don't choose Confluence Data Center, our Experts may be able to help implementing a high availabili ty solution for your environment. Contact our Experts team for more information.
This page demonstrates how you can use Confluence Data Center 5.9 or later in implementing and managing a disaster recovery strategy for Confluence. It doesn't, however, cover the broader business practices, like setting the key objectives (RTO, RPO & RCO1), and standard operating procedures.
What's the difference between high availability and disaster recovery?
The terms "high availability", "disaster recovery" and "failover" can often be confused. For the purposes of this page, we've defined them as follows:
- High availability – A strategy to provide a specific level of availability. In Confluence's case, access to the application and an acceptable response time. Automated correction and failover (within the same location) are usually part of high-availability planning.
- Disaster recovery – A strategy to resume operations in an alternate data center (usually in another geographic location), if the main data center becomes unavailable (i.e. a disaster). Failover (to another location) is a fundamental part of disaster recovery.
- Failover – is when one machine takes over from another machine, when the aforementioned machines fails. This could be within the same data center or from one data center to another. Failover is usually part of both high availability and disaster recovery planning.
Before you start, you need Confluence Data Center 5.9 or later to implement the strategy described in this guide.
This page describes what is generally referred to as a 'cold standby' strategy, which means the standby Confluence instance isn't continuously running and that you need to take some administrative steps to start the standby instance and ensure it's in a suitable state to service the business needs of your organization.
Maintaining a runbook
The detailed steps will vary from organization to organization and, as such, we recommend you keep a full runbook of steps on file, away from the production system it references. Make your runbook detailed enough such that anyone in the relevant team should be able to complete the steps and recover your service, regardless of prior knowledge or experience. We expect any runbook to contain steps that cover the following parts of the disaster recovery process:
- Detection of the problem
- Isolation of the current production environment and bringing it down gracefully
- Synchronization of data between failed production and intended recovery point
- Warm up instructions for the recovery instance
- Documentation, communication, and escalation guidelines
The major components you need to consider in your disaster recovery plan are:
|Confluence installation||Your standby site should have exactly the same version of Confluence installed as your production site.|
|Database||This is the primary source of truth for Confluence and contains most of the Confluence data (except for attachments, avatars, etc). You need to replicate your database and continuously keep it up to date to satisfy your RPO1|
All attachments are stored in the Confluence Data Center shared home directory, and you need to ensure it's replicated to the standby instance.
|Search Index||The search index isn't a primary source of truth, and can always be recreated from the database. For large installations, though, this can be quite time consuming and the functionality of Confluence will be greatly reduced until the index is fully recovered. Confluence Data Center stores search index backups in the shared home directory, which are covered by the shared home directory replication.|
|Plugins||User installed plugins are stored in the database and are covered by the database replication.|
|Other data||A few other non-critical items are stored in the Confluence Data Center shared home. Ensure they're also replicated to your standby instance.|
Set up a standby system
Step 1. Install Confluence Data Center 5.9 or higher
Install the same version of Confluence on your standby system. Configure the system to attach to the standby database.
DO NOT start the standby Confluence system
Starting Confluence would write data to the database and shared home, which you do not want to do.
You may want to test the installation, in which case you should temporarily connect it to a different database and different shared home directory and start Confluence to make sure it works as expected. Don't forget to update the database configuration to point to the standby database and the shared home directory configuration to point to the standby shared home directory after your testing.
Step 2. Implement a data replication strategy
Replicating data to your standby location is crucial to a cold standby failover strategy. You don't want to fail over to your standb y Confluence ins tance and find that it's out of date or that it takes many hours to re-index.
All of the following Confluence supported database suppliers provide their own database replication solutions:
You need to implement a database replication strategy that meets your RTO, RPO and RCO1.
You also need to implement a file server replication strategy for the Confluence shared home directory that meets your RTO, RPO and RCO1.
For your clustered environment you need to be aware of the following, in addition to the information above:
There's no need for the configuration of the standby cluster to reflect that of the live cluster. It may contain more or fewer nodes, depending on your requirements and budget. Fewer nodes may result in lower throughput, but that may be acceptable depending on your circumstances.
Where we mention
|Starting the standby cluster||It's important to initially start only one node of the cluster, allow it to recover the search index, and check it's working correctly before starting additional nodes.|
Disaster recovery testing
You should exercise extreme care when testing any disaster recovery plan. Simple mistakes may cause your live instance to be corrupted, for example, if testing updates are inserted into your production database. You may detrimentally impact your ability to recover from a real disaster, while testing your disaster recovery plan.
The key is to keep the main data center as isolated as possible from the disaster recovery testing .
This procedure will ensure that the standby environment will have all the right data, but as the testing environment is completely separate from the standby environment, possible configuration problems on the standby instance are not covered.
Before you perform any testing, you need to isolate your production data.
|Attachments, plugins and indexes||
You need to ensure that no plugin updates or index backups occur during the test:
After this you can resume all replication to the standby instance, including the database.
Perform disaster recovery testing
Once you have isolated your production data, follow the steps below to test your disaster recovery plan:
- Ensure that the new database is ready, with the latest snapshot and no replication
- Ensure that the new shared home directory is ready, with the latest snapshot and no replication
- Ensure you have a copy of Confluence on a clean server with the right database and shared home directory settings in
- Ensure you have confluence.home mapped, as it was in the standby instance, in the test server
- Disable email (See
atlassian.mail.senddisabledin Configuring System Properties)
- Start Confluence
Handling a failover
In the event your primary site is unavailable, you'll need to fail over to your standby system. The steps are as follows:
- Ensure your live system is shutdown and no longer updating the database
- Ensure the contents of
<confluencesharedhomeis synced to your standby instance
- Perform whatever steps are required to activate your standby database
- Start Confluence on one node in the standby instance
- Wait for Confluence to start and check it is operating as expected
- Start up other Confluence nodes
- Update your DNS, HTTP Proxy, or other front end devices to route traffic to your standby server
Returning to the primary instance
In most cases, you'll want to return to using your primary instance after you've resolved the problems that caused the disaster. This is easiest to achieve if you can schedule a reasonably-sized outage window.
You need to:
- Synchronize your primary database with the state of the secondary
- Synchronize the primary shared home directory with the state of the secondary
Perform the cut over
- Shutdown Confluence on the standby instance
- Ensure the database is synchronized correctly and configured to as required
- Use rsync or a similar uililty to synchronize the shared home directory to the primary server
- Start Confluence
- Check that Confluence is operating as expected
- Update your DNS, HTTP Proxy, or other front end devices to route traffic to your primary server
Our community and staff are active on Atlassian Answers. Feel free to contribute your best practices, questions and comments. Here are some of the answers relevant to this page:
If you encounter problems after failing over to your standby instance, check these FAQs for guidance:
If your database doesn't have the data available that it should, then you'll need to restore the database from a backup.
Once you've restored your database, the search index will no longer by in sync with the database. You can either do a full re-index, background or foreground, or recover from the latest index snapshot if you have one. This includes the journal id file for each index snapshot. The index snapshot can be older than your database backup; it'll synchronize itself as part of the recovery process.
If the search index is corrupt, you can either do a full re-index, background or foreground, or recover from an earlier index snapshot from the shared home directory if you have one.
You may be able to recover them from backups if you have them, or recover from the primary site if you have access to the hard drives. Tools such as rsync may be useful in these circumstances. Missing attachments won't stop Confluence performing normally; the missing attachments won't be available, but users may be able to upload them again.
Application links are stored in the database. If the database replica is up to date, then the application links will be preserved.
You do, however, also need to consider how each end of the link knows the address of the other:
- If you use host names to address the partners in the link and the backup Confluence server has the same hostname, via updates to the DNS or similar, then the links should remain intact and working.
- If the application links were built using IP addresses and these aren't the same, then the application links will need to be re-established.
- If you use IP addresses that are valid on the internal company network but your backup system is remote and outside the original firewall, you'll need to re-establish your application links.
|RPO||Recovery Point Objective||How up-to-date you require your Confluence instance to be after a failure.|
|RTO||Recovery Time Objective||How quickly you require your standby system to be available after a failure.|
|RCO||Recovery Cost Objective||
How much you are willing to spend on your disaster recovery solution.