Confluence Data Center sample deployment and monitoring strategy

On this page

Still need help?

The Atlassian Community is here for you.

Ask the community

At Atlassian, we use several enterprise-scale development tools. Our Confluence Data Center instance is used by about 2,400 full-time Atlassian employees, globally. 

How we use Confluence

Collaboration and open communication is vital to our culture, and much of this collaboration happens in Confluence. As of April 2018, we had 14,930,000 content items across 6,500 total spaces. On a six-hour snapshot of the instance's traffic, we saw an average of 341,000 HTTP calls per hour (with one hour peaking at 456,000 HTTP calls). Based on our Confluence Data Center load profiles, we'd categorise this instance as Large for both content and traffic. 

Click here for more load and scale details

These details were collected on April 30, 2018.


Content

Statistic Value
Attachments 2,424,000
Comments 1,976,000
Pages 686,000
Blog posts 60,000
Drafts 32,500

Scale

Statistic Value
Total spaces 6,500
Site spaces 1,500
Personal spaces 5,000
Content (All versions)

14,930,000

Content (Current versions) 6,350,500
Local users

11,000

Local groups 8,500


Good to know:

  • Content (All versions) is the total number of all versions of all pages, blog posts, comments, and files in the instance. It's the total number of rows in the CONTENT table in the Confluence database.
  • Content (Current versions) is the number of pages, blog posts, comments, and files in the instance. It doesn't include historical versions of pages, blog posts, or files.
  • Local users is the number of user accounts from local and remote directories that have been synced to the database. It includes both active and inactive accounts.

Load

The following statistics are from 9:00am to 3:00pm (Sydney time) of a typical workday in April 2018. These hours capture peaks and dips in our Confluence usage, showing the kind of traffic the instance gets.

Statistic

Average

Peak

HTTP calls per hour

341,000

456,000

Concurrent users per hour

222

277

The Large traffic profile covers 350,000 to 700,000 HTTP calls per hour. The average here is just under the threshold, but we have a one-hour peak of 456,000. The instance accommodates both average and peak traffic, so we'd classify its load as Large.

Our Confluence Data Center instance is used heavily during working hours in each of our locations, so to keep all Atlassians happy and productive, we need to make sure it performs. The instance is highly available, so an individual node failure won't be fatal to the instance.  This means we can focus more on maintaining acceptable performance, and less on keeping the site running. 

Infrastructure and setup


Our Confluence Data Center instance is hosted on an Amazon Web Services (AWS) Virtual Private Cloud, and is made up of the following : 

Function

Instance type

Number

Load balancer AWS Application Load Balancer 2
Confluence application c5.2xlarge (running Amazon Linux) 4
Synchrony application 
(for collaborative editing)
c5.large (running Amazon Linux) 2
Database (Amazon RDS Postgresql) m4.xlarge 1
Shared home directory Elastic File System (2.1 TiB) 1

Load distribution is managed by a proprietary Virtual Traffic Manager (VTM), and an application load balancer.  The Atlassian VTM performs two functions:

  1. Routing traffic between different Atlassian instances on the same domain, and
  2. Terminating SSL for the Confluence Data Center instance. 

The Synchrony cluster has two nodes, has XHR fallback enabled, and does not use the internal Synchrony proxy. The instance also uses 2.1 TiB of Amazon Elastic File System for storage.

Refer to the AWS documentation on Instance Types (specifically, General Purpose Instances and Compute-Optimized Instances) for details on each node type.

Integrated services

This Confluence Data Center instance has 70 user-installed apps (also known as add-ons) enabled, and is linked to the following mix of Atlassian applications:

  • 5 x Jira Software and Service Desk
  • 7 x Confluence
  • 2 x Bitbucket
  • 11 x Bamboo 
  • 2 x Fisheye / Crucible

This instance is also connected to Crowd, for user management.

Monitoring strategy

The strategies described here are tailored to Atlassian's business needs and available resources. Your enterprise environment might have unique needs and quirks that require different strategies. Consult an Atlassian  Technical Account Manager for guidance.

Our performance monitoring strategy is built around targeting an Apdex of 0.7, but keeping it above 0.4. This index assumes that a 1-second response time is our Tolerating threshold, while anything beyond 4 seconds is our Frustrated threshold.

With the Apdex index, maintaining general satisfaction in the instance involves managing the ratio of "happy" and "unhappy" users. Refer to Apdex overview for more information.

Maintaining an Apdex level within Tolerating levels means actively monitoring the instance for potential problems that could cause major slowdowns. Many of these alerts, monitoring strategies, and action plans are based on previous incidents we've since learned to resolve quickly or avoid. 

The following tables list our monitoring alerts, and what we do when they're triggered. 

General load

Metric we track Alerting level What we do when alert is triggered

Long-running tasks
Some tasks can cause memory problems when the space is too large.

Our monitoring tools send us an alert if a user starts any of these tasks:

We also receive an alert for each hour the task continues to run.

If the task appears to be stuck and starts triggering other alerts, we'll usually restart the node and kill the task.

We'll also contact the user to discuss other options. For space exports, this could mean reducing the size of the exported space, or exporting from a staging server.

Network throughput
We check this to gauge instance health and detect any suspicious external activity (for example, DDoS).

20Mbps or higher (as of April 2018)

We investigate other metrics to see if anything (other than high user traffic) caused the increased throughput. Over time, we check how many times the throughput triggers the alert to see whether we need to tweak the infrastructure (and the alerting level) again.

Number of active database connections
Having too many active database connections can slow down Confluence. 

More than 1000 connections

The m4.xlarge node type supports a maximum of 1,320 connections. If a node triggers the alert and continues rising, we'll perform a rolling restart. 

We'll also raise a ticket against Confluence, as a bug (specifically, a database connection leak) could also trigger this alert. To date, our instance has never triggered this alert.

Node CPU usage
High CPU use on either database or application nodes can indicate node-specific problems. These problems could slow down the instance.

We set two CPU usage alerts for application nodes:

  • 50% (warning)
  • 70% (error)

Likewise, for the database node:

  • 80% (warning)
  • 95% (error)

When an application node triggers its warning alert, we perform a heap dump and thread dump and investigate further. We perform a rolling restart if we think that instance is about to crash.

To date, the database node has always recovered on its own whenever it triggered its alerts.

Garbage Collector pauses
Long pauses can cause problems with cluster membership.

We also track this metric for development feedback. Pauses help us identify areas of Confluence that unnecessarily create a lot of objects. We analyze those areas for ways to improve Confluence.

Any Garbage Collector pause that lasts longer than 5 seconds

Usually this alert requires no action, but it can help warn us of possible outages. The data we collect here also helps us diagnose root causes of other outages.

If the Garbage Collector triggers this alert frequently, we check if the instance requires heap tuning.

Load balancer

Metric we track Alerting level What we do when alert is triggered

Number of timeouts
Requests timeout when back end nodes are unable to process them.

300 timeouts within a 1-hour period.

This many timeouts in a short amount of time is typically followed by other alerts. We investigate any triggered alerts and other metrics to see if an outage or similar problem is imminent.

Node health
The load balancer regularly checks the health of cluster nodes. It disconnects nodes that fail the check, and re-connects them if they pass the check later. An outage occurs when all nodes are disconnected at the same time.

See Load balancer configuration options for related details.

Whenever the load balancer disconnects or re-connects a node.

When a node disconnects, we check its state and restart it if necessary. We also check for other triggered alerts to see what could have caused the node to disconnect.

Internal server errors
When the load balancer returns an internal server error (error code 500), it usually indicates that there are no back end nodes to process a request.
When the load balancer encounters more than 10 internal server errors in a second.

We check for other triggered alerts to see if there's a problem in a specific subsystem (for example, database or storage).

Storage

Metric we track Alerting level What we do when alert is triggered
I/O on file system of shared home
High I/O slows down file access, which can also lead to timeouts.
PercentIOLimit is greater than 98. This means that the shared home's file system I/O is now over 98% of its limit. See Monitoring with Amazon CloudWatch for more details. We investigate whether we need to increase the I/O limit.
Disk space
We need to know early if we should expand disk space to accommodate usage or check further for storage-related issues.

We set two alerts for different levels of free disk space:

  • 30% (warning)
  • 10% (error)


If the amount of free space is running low, but the rate of consumption remains normal, we expand the available storage.

If the rate of disk consumption spikes abnormally, we check if there are misbehaving processes. This also involves checking the amount of disk space consumed within the last 24 hours.

Stability monitoring

The following alerts relate to the instance's overall stability. They don't get triggered as often, as most of the work we do to address performance problems also prevents outages.

Metric we track Events we alert for What we do when alert is triggered

Common error conditions
Whenever the instance suffers an outage, we check the logs for any new errors that led up to it. If we determine that a specific error signals an imminent outage we add alerts for it.

Common errors we set alerts for include (but aren't limited to):

Most of the errors we alert for are documented in the Confluence knowledge base, along with how to prevent them from causing an outage.

Hazelcast cluster membership
By default, Hazelcast removes and re-adds a node if it doesn't send a heartbeat within 30 seconds. We configured ours to do this within 60 seconds instead.
When a node is removed from and re-added to the cluster. 

We perform a log analysis, thread dump, and heap dump on the affected node. If any of these show a cluster panic or outage is imminent, we perform a rolling restart.

Cluster panics
These are severe failures, and usually require a full Confluence restart.

Every time a cluster panic occurs. Our monitoring tools check the application logs for events that correlated to panics the instance experienced in the past.

We perform a full Confluence restart. This means shutting down all application nodes, and starting them up one by one.

See Data Center Troubleshooting and Confluence Data Center Cluster Troubleshooting for related information.

Non-alerted metrics

We don't set any alerts for the following metrics, but we monitor them regularly for abnormal spikes or dips. We also check them whenever other alerts get triggered, or if the Apdex starts to drop.

Metric we track Monitoring practice

JVM memory
 We regularly monitor the instance to ensure that the JVM doesn't run out of memory. See Confluence crashes due to 'OutOfMemoryError Java heap space' error for related details.

When JVM starts running low on memory, we perform a heap dump and thread dump. We perform a rolling restart if either dump shows that a crash is imminent.


Number of active HTTP threads
Having too many threads in constant use can indicate an application deadlock.

If abnormal spikes in this metric coincide with deadlocks or similar problems, we restart each affected node.

If they don't, we just tune thread limits or take thread dumps for further investigation. See Health Check: Thread Limit for related information.

We're here to help

Over time, we may re-configure our instance and monitoring strategies based on changing needs, application improvements, or new insights.  Contact an Atlassian  Technical Account Manager for more guidance on deploying your own Data Center instance and forming a compatible monitoring strategy.

Our  Premier Support team  performs health checks by meticulously analyzing your application and logs, to ensure that your application's deployment fully meets the needs of your users. If the health check process reveals any performance gaps, Premier Support will recommend possible changes to your deployment.


Last modified on Jul 16, 2018

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.