Confluence Data Center Technical Overview

This page provides information on Confluence Data Center, which is a clustered solution that can provide performance at scale and high availability. This is essential if your Confluence instance is mission critical or has very high load.

How it works

On this page

The basics

Confluence Data Center enables you to configure a cluster similar to the one pictured below with:

Multiple server nodes for Confluence that store:
- logs
- caches
- Lucene indexes
- configuration files
- plugins
Multiple server nodes to run Synchrony, which is required for collaborative editing.
A shared file system that stores:
- attachments
- avatars / profile pictures
- icons
- export files
- import files
- plugins
A database that all nodes read and write to.
A load balancer to evenly direct requests to each node.

All Confluence nodes are active and process requests. A user will access the same Confluence node for all requests until their session times out, they log out, or a node is removed from the cluster.

Licensing

Your Data Center license is based on the number of users in your cluster, rather than the number of nodes. You can monitor the available license seats in the License page.

If you wanted to automate this process (for example to send alerts when you are nearing full allocation) you can use the REST API.

REST API...

The following GET requests require an authenticated user with system administrator permissions. The requests return JSON.

`<confluenceurl>/rest/license/1.0/license/userCount`	Number of active users
`<confluenceurl>/rest/license/1.0/license/remainingSeats`	Number of users you can add before reaching your license limit
`<confluenceurl>/rest/license/1.0/license/maxUsers`	Maximum number of users allowed by your license

Home directories

Confluence has a concept of a local home and shared home. Each Confluence node has a local home that contains logs, caches, Lucene indexes and configuration files. Everything else is stored in the shared home, which is accessible to each Confluence node in the cluster. Attachments, icons and avatars are stored in the shared home as are export and import files.

Add-ons can choose whether to store data in the local or shared home, depending on the needs of the add-on.

If you are currently storing attachments in your database you can continue to do so, but this is not available for new installations.

Caching

Confluence uses a distributed cache that is managed using Hazelcast. Data is evenly partitioned across all the Confluence nodes in a cluster, instead of being replicated on each node. This allows for better horizontal scalability, and requires less storage and processing power than a fully replicated cache.

Because of this caching solution, to minimize latency, your nodes should be located in the same physical location.

Indexes

A full copy of the Confluence indexes are stored on each Confluence node individually. A journal service keeps each index in synch.

When you first set up your cluster, you will copy the local home directory, including the indexes, from the first node to each new node.

When adding a new Confluence node to an existing cluster, you will copy the local home directory of an existing node to the new node. When you start the new node, Confluence will check if the index is current, and if not, request a recovery snapshot of the index from either the shared home directory, or a running node (with a matching build number) and extract it into the index directory before continuing the start up process. If the snapshot can't be generated or is not received by the new node in time, existing index files will be removed, and Confluence will perform a full re-index.

If a Confluence node is disconnected from the cluster for a short amount of time (hours), it will be able to use the journal service to bring its copy of the index up-to-date when it rejoins the cluster. If a node is down for a significant amount of time (days) its Lucene index will have become stale, and it will request a recovery snapshot from an existing node as part of the node startup process.

If you suspect there is a problem with the index on all nodes, you can temporarily disable index recovery on one node, rebuild the index on that node, then copy the new index over to each remaining node.

Cluster safety mechanism

The ClusterSafetyJob scheduled task runs every 30 seconds in Confluence. In a cluster, this job is run on one Confluence node only. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across the cluster. The ClusterSafetyJob compares the value in the database with the one in the cache, and if the value differs, Confluence will shut the node down - this is known as cluster split-brain. This safety mechanism is used to ensure your cluster nodes cannot get into an inconsistent state.

If cluster split-brain does occur, you need to ensure proper network connectivity between the clustered nodes. Most likely multicast traffic is being blocked or not routed correctly.

This mechanism also exists in standalone Confluence.

Balancing uptime and data integrity

By changing how often the cluster safety scheduled job runs and the duration of the Hazelcast heartbeat (which controls how long a node can be out of communication before it's removed from the cluster) you can fine tune the balance between uptime and data integrity in your cluster. In most cases the default values will be appropriate, but there are some circumstances where you may decide to trade off data integrity for increased uptime for example.

Here's some examples...

Uptime over data integrity

Cluster safety job	Hazelcast heartbeat	Effect
1 minute	1 minute	You could have network interruptions or garbage collection pauses of up to 1 minute without triggering a cluster panic. However, if two nodes are no longer communicating, conflicting data could be being written to the database for up to 1 minute, affecting your data integrity.
10 minutes	30 seconds	You could have network interruptions or garbage collection pauses of up to 30 seconds without nodes being evicted from the cluster. Evicted nodes then have up to 10 minutes to rejoin the cluster before the Cluster Safety Job kicks in and shuts down the problem node. Although this may result in higher uptime for your site, conflicting data could be being written to the database for up to 10 minutes, affecting your data integrity.

Data integrity over uptime

Cluster safety job	Hazelcast heartbeat	Effect
15 seconds	15 seconds	Network interruptions or garbage collection pauses longer than 15 seconds will trigger a cluster panic. Although this may result in higher downtime for your site, nodes can only write to the database while out of communication with each other for a maximum of 15 seconds, ensuring greater data integrity.
15 seconds	1 minute	You could have network interruption or garbage collection pauses up to 1 minute without nodes being evicted from the cluster. Once a node is evicted, it can only write to the database for a maximum of 15 seconds, minimizing the impact on your data integrity.

To find out how to change the cluster safety scheduled job, see Scheduled Jobs.

You can change the Hazelcast heartbeat default via the confluence.cluster.hazelcast.max.no.heartbeat.seconds system property. See Configuring System Properties.

Cluster locks and event handling

Where an action must only run on one node, for example a scheduled job or sending daily email notifications, Confluence uses a cluster lock to ensure the action is only performed on one node.

Similarly, some actions need to be performed on one node, and then published to others. Event handling ensures that Confluence only publishes cluster events when the current transaction is committed and complete. This is to ensure that any data stored in the database will be available to other instances in the cluster when the event is received and processed. Event broadcasting is done only for certain events, like enabling or disabling an add-on.

Cluster node discovery

When configuring your cluster nodes you can either supply the IP address of each cluster node, or a multicast address.

If you're using multicast:

Confluence will broadcast a join request on the multicast network address. Confluence must be able to open a UDP port on this multicast address, or it won't be able to find the other cluster nodes. Once the nodes are discovered, each responds with a unicast (normal) IP address and port where it can be contacted for cache updates. Confluence must be able to open a UDP port for regular communication with the other nodes.

A multicast address can be auto-generated from the cluster name, or you can enter your own, during the set-up of the first node.

Infrastructure and hardware requirements

The choice of hardware and infrastructure is up to you. Below are some areas to think about when planning your hardware and infrastructure requirements.

AWS Quick Start deployment option

If you plan to run Confluence Data Center on AWS, a Quick Start is available to help you deploy Confluence Data Center in a new or existing Virtual Private Cloud (VPC). You'll get your Confluence and Synchrony nodes, Amazon RDS PostgreSQL database and application load balancer all configured and ready to use in minutes. If you're new to AWS, the step-by-step Quick Start Guide will assist you through the whole process.

Confluence can only be deployed in a region that supports Amazon Elastic File System (EFS). See Running Confluence Data Center in AWS for more information.

It is worth noting that if you deploy Confluence using the Quick Start, it will use the Java Runtime Engine (JRE) that is bundled with Confluence (/opt/atlassian/confluence/jre/), and not the JRE that is installed on the EC2 instances (/usr/lib/jvm/jre/).

Servers

We recommend your servers have at least 4GB of physical RAM. A high number of concurrent users means that a lot of RAM will be consumed. You usually don't need to assign more than 4GB per JVM process, but can fine tune the settings as required.

You should also not run any additional applications (other than core operating system services) on the same servers as Confluence. Running Confluence, Jira and Bamboo on a dedicated Atlassian software server works well for small installations but is discouraged when running at scale.

Confluence Data Center can be run successfully on virtual machines. If you're using multicast, you can't run Confluence Data Center in Amazon Web Services (AWS) environments as AWS doesn't currently support multicast traffic.

Cluster nodes

Your Data Center license does not restrict the number of nodes in your cluster. We have tested the performance and stability with up to 4 nodes.

Each node does not need to be identical, but for consistent performance we recommend they are as close as possible. All cluster nodes must:

be located in the same data center
run the same Confluence version (for Confluence nodes) or the same Synchrony version (for Synchrony nodes)
have the same OS, Java and application server version
have the same memory configuration (both the JVM and the physical memory) (recommended)
be configured with the same time zone (and keep the current time synchronized). Using ntpd or a similar service is a good way to ensure this.

You must ensure the clocks on your nodes don't diverge, as it can result in a range of problems with your cluster.

Database

The most important requirement for the cluster database is that it have sufficient connections available to support the number of nodes.

For example, if:

each Confluence node has a maximum pool size of 20 connections
each Synchrony node has a maximum pool size of 15 connections (the default)
you plan to run 3 Confluence nodes and 3 Synchrony nodes

your database server must allow at least 105 connections to the Confluence database. In practice, you may require more than the minimum for debugging or administrative purposes.

You should also ensure your intended database is listed in the current Supported Platforms. The load on an average cluster solution is higher than on a standalone installation, so it is crucial to use the a supported database.

You must also use a supported database driver. Collaborative editing will fail with an error if you're using an unsupported or custom JDBC driver (or driverClassName in the case of a JNDI datasource connection). See Database JDBC Drivers for the list of drivers we support.

Shared home directory and storage requirements

All Confluence cluster nodes must have access to a shared directory in the same path. NFS and SMB/CIFS shares are supported as the locations of the shared directory. As this directory will contain large amount of data (including attachments and backups) it should be generously sized, and you should have a plan for how to increase the available disk space when required.

Load balancers

We suggest using the load balancer you are most familiar with. The load balancer needs to support ‘session affinity’ and WebSockets. This is required for both Confluence and Synchrony. If you're deploying on AWS you'll need to use an Application Load Balancer (ALB).

Here are some recommendations when configuring your load balancer:

Queue requests at the load balancer. By making sure the maximum number requests served to a node does not exceed the total number of http threads that Tomcat can accept, you can avoid overwhelming a node with more requests than it can handle. You can check the maxThreads in <install-directory>/conf/server.xml.
Don't replay failed idempotent requests on other nodes, as this can propagate problems across all your nodes very quickly.
Using least connections as the load balancing method, rather than round robin, can better balance the load when a node joins the cluster or rejoins after being removed.

Many load balancers require a URL to constantly check the health of their backends in order to automatically remove them from the pool. It's important to use a stable and fast URL for this, but lightweight enough to not consume unnecessary resources. The following URL returns Confluence's status and can be used for this purpose.

URL	Expected content	Expected HTTP Status
http://<confluenceurl>/status	{"state":"RUNNING"}	200 OK

See all status codes and responses...

HTTP Status Code	Response entity	Description
200	`{"state":"RUNNING"}`	Running normally
500	`{"state":"ERROR"}`	An error state
503	`{"state":"STARTING"}`	Application is starting
503	`{"state":"STOPPING"}`	Application is stopping
200	`{"state":"FIRST_RUN"}`	Application is running for the first time and has not yet been configured
404		Application failed to start up in an unexpected way (the web application failed to deploy)

Here are some recommendations, when setting up monitoring, that can help a node survive small problems, such as a long GC pause:

Wait for two consecutive failures before removing a node.
Allow existing connections to the node to finish, for say 30 seconds, before the node is removed from the pool.

Network adapters

Use separate network adapters for communication between servers. Cluster nodes should have a separate physical network (i.e. separate NICs) for inter-server communication. This is the best way to get the cluster to run fast and reliably. Performance problems are likely to occur if you connect cluster nodes via a network that has lots of other data streaming through it.

Additional requirements for collaborative editing

Collaborative editing in Confluence 6.0 and later is powered by Synchrony, which runs as a seperate process. You can deploy Synchrony on the same nodes as Confluence, or in its own cluster with as many nodes as you need.

If you chose to run Synchrony on the same nodes as Confluence, you will need at least 2 GB of additional memory (the default maximum heap size for Synchrony is 1 GB).

Your load balancer (and any other proxies) must support WebSocket connections and session affinity.

Additional requirements for high availability

Confluence Data Center removes the application server as a single point of failure. You can further minimize single points of failure by ensuring your load balancer, database and shared file system are also highly available.

User management

You can manage users in Confluence's internal directory, in an external LDAP directory, or in Atlassian Crowd or JIRA.

You can also connect Confluence Data Center to a SAML 2.0 identity provider for authentication and single sign-on (only available to Confluence Data Center).

Plugins and add-ons

The process for installing add-ons in Confluence Data Center is the same as for a standalone instance of Confluence. You will not need to stop the cluster, or bring down any nodes to install or update an add-on.

The Atlassian Marketplace indicates add-ons that are compatible with Confluence Data Center.

Add-on licenses for Data Center are sold at the single server rate, but must match or exceed your Confluence Data Center license tier. For example, if you are looking to have 3,000 people using Confluence Data Center, then you would buy any add-ons at the 2-001-10,000 user tier.

If you have developed your own plugins for Confluence you should refer to our developer documentation on How do I ensure my add-on works properly in a cluster? to find out how you can confirm your plugin is cluster compatible.

Ready to get started?

Contact us to speak with an Atlassian or get going with Data Center straight away.

For help with installation, take a look at Installing Confluence Data Center.

Page

Viewport

Confluence