Confluence 3.5 has reached end of life
Check out the [latest version] of the documentation
This page gives guidelines for operational management teams who are responsible for a large Confluence installation, or for a Confluence installation which is crucial to the business of their organisation.
On this page:
Introduction to this Page
Motivation for Presenting these Guidelines
Most Confluence installations start off small. Ten people in an early-adoption department use it for a couple of weeks. Everything works well and the good news starts spreading. Adoption increases throughout the organisation. More and more people use the wiki, and more and more rely on Confluence being up and running. After a while even the CEO starts blogging. And then a system outage occurs.
Now what?
Wikis like Confluence often grow into mission-critical applications within just a few months. Often adoption is so fast that IT departments haven't had the time to scale up their support.
We have assembled some requirements to help you make sure that your installation of Confluence can be mission critical. There are no surprises to be found here — all of the requirements would apply to any other piece of software that is mission critical within your organisation.
Who should Read these Guidelines?
The guidelines do not apply to you if you are using Confluence with just a few dozen users, and no one really minds if Confluence is down for a couple of hours because your database has crashed.
But if any one of the following applies to you, then these guidelines are a must read for you!
- The wiki has become your organisation's documentation base.
- Your users can't work properly when Confluence is down.
- Your boss or customer threatens to terminate your contract if you don't meet a strict service level agreement (SLA), such as 99.9% availability.
Requirements of Large or Mission-Critical Confluence Installations
Dedicated Hardware for Confluence
In a small work group with a few dozen or even hundreds of users, your Confluence installation can happily share the CPUs, memory and disks with other low-profile applications and a database.
But with thousands or even tens of thousands of users, you need dedicated hardware that runs Confluence and nothing else, and it needs to be fast hardware with plenty of RAM. While you can run Confluence in a virtualised environment such as VMware, we suggest you don't do it for mission-critical or high-load installations unless you are a real expert in virtualisation. Otherwise your other VMs might have performance problems which propagate to Confluence.
If you experience database-related problems, you should consider moving the Confluence database to a dedicated machine. Confluence itself can run queries that impact the performance of other applications, and other application problems or scheduled tasks can have an adverse affect on the usability of Confluence.
Dedicated Qualified Staff
If your Confluence installation is mission critical and your service level agreements require 24/7 up time, you need to be able to pinpoint problems quickly. You need qualified staff, dedicated to looking after Confluence, who are available during business hours and possibly beyond.
If you require assistance from the Atlassian Support team, you may need to answer some pretty technical questions to help us diagnose what is going on in your systems. Also keep in mind that Atlassian support assists you in finding problems in Confluence, but we can't help you administer your systems.
In particular, we recommend that you have dedicated staff in the roles listed below.
Operations Team with General Administrators
If your organisation relies on Confluence being up and running around the clock with very little downtime, you need people who can set up, maintain, tune and improve your Confluence installation. This requires at least one person, but ideally you will have a team of operational engineers.
If your wiki is mission critical, chances are that other IT systems within your organisation have already made it necessary to have such an operations team. So you will probably not need to hire someone specifically to administrate Confluence. But it is vital that supporting and maintaining Confluence is added to the list of responsibilities of that operations teams, and that you can get them to troubleshoot and analyse Confluence at short notice.
If problems arise and you need to contact Atlassian Support, these engineers will be our first point of contact. We may ask them to provide details of log files, application-server settings, monitoring systems, and so on.
Network Staff
If Confluence is mission critical for large numbers of users, it is vital that you have dedicated network staff available to track down problems when they arise.
A mission-critical installation will usually be used by hundreds or even thousands of users, and you don't want to keep them waiting because a network card breaks, or because someone has made an undocumented change to the network and you don't have an expert around who can figure it out.
Again, this only applies to mission-critical systems. If you use Confluence for less critical collaboration and knowledge sharing, and a broken network cable causing a day's downtime is no major catastrophe, then you will not need dedicated networking staff.
Database Staff
If Confluence is mission critical for a large number of users, you need an experienced database administrator (DBA) available to troubleshoot database performance issues and other potential problems. It is dangerous not to have an experienced full-time DBA at hand at short notice when running a mission critical application. While small installations of Confluence basically work 'out of the box', any system that involves high load or high-availability requirements needs continual monitoring, optimising and fine tuning of the Confluence database. Database monitoring is no trivial task — it's not something that anyone can learn quickly.
Developers
You may have decided to customise Confluence by changing its source-code, or by writing your own plugins. If your server is mission-critical, you must nominate staff who will be responsible for that code, and they must be up for the task. Otherwise you might end up in a situation in which your server experiences downtimes because of custom code is broken, or does not work with a newer version of Confluence anymore, but you can't fix the problem because no one knows how the customized code works, and you can't uninstall it either because it has become critical for your Confluence usage pattern. Keep good track of changes, and have someone available to jump into action if there is a problem Don't let the summer intern write mission-critical plugins, unless you have more senior staff to maintain that code as long as it is in use.
Constant Monitoring of Production Systems
You will need to monitor your production systems constantly.
When the wiki is the lifeblood of your organisation, you need know exactly what is going on inside, so that you can plan for future needs and analyse potential bottlenecks.
Monitoring involves a number of essential tasks, including those listed below:
- Monitoring log files.
- Checking for HTTP-availability and performance (e.g. by getting the same page every five minutes and displaying the time on a graph).
- Looking at many different parameters such as load, connections, IO, database-trends, and so on.
- Charting long-term trends.
- Keeping an access log of requests to the web server. This is vital, especially when requesting performance-related support from Atlassian.
Monitoring a web application like Confluence implies also monitoring the subsystems it uses. Many outages and downtimes are caused by broken mail servers, databases running out of space, file systems filling up and so on. It is often possible to detect these trends way before the actual web application breaks down. Keep an eye on the file system, and if you see it is getting closer to 90% utilisation, you can mend the situation without Confluence breaking down. Or even if the worst case happens (e.g. the database breaks down and Confluence is affected straight away) then having the proper monitoring for the database server makes troubleshooting a lot easier.
Tools for Monitoring Confluence
At Atlassian we use Hyperic. But the list of monitoring systems is long and we can't recommend a specific product over the other. If your organisation has a monitoring system already, make sure you hook up Confluence to it. If you don't have a monitoring system yet, you need to install one as soon as you feel Confluence is mission critical.
As an example of what our monitoring UI looks like, have a look at this screenshot:
The following screenshot shows one of our sensors looking at the HTTP response times of our documentation wiki over the last 8 days. You can clearly see an incident four days ago. Having the graph (and regularly looking at it) allowed us to pinpoint the problem. We analysed the access logs and found that webpage-profiling had been enabled but not disabled again, which caused performance problems.
This page would get too long if we described all our monitoring sensors - but just to give you an impression, this is what we monitor on the JVM level alone.
JVM basics
- Current Loaded Classes
- Daemon Thread Count
- Heap Memory Committed
- Heap Memory Max
- Heap Memory Used
- Loaded Classes
- Loaded Classes per Minute
- Object Pending Finalization Count
- Peak Thread Count
- Thread Count
- Unloaded Classes
- Unloaded Classes per Minute
JVM garbage collection
- Collection Count
- Collection Count per Minute
- Collection Time
- Collection Time per Minute
JVM memory: (Metrics for Eden space, Old Gen, Survivor space, Perm Gen)
- Commited Memory
- Used Memory
We get the same level of detail for our database, for the file system, for the CPU, for the network, and so on. Not all of this is needed all the time. But if your company depends on an application, then the more information you have at your fingertips the better. Fortunately these metrics can be extracted quite easily once you have a monitoring system in place.
Adherence to Strict Upgrade Procedures
Your organisation will have its own upgrading procedure. Here are a few recommendations that you should add to your list:
- Our main recommendation: Never change more than one component at a time. Sometimes it may be tempting to upgrade the server hardware when you upgrade Confluence, but we recommend you don't do that. It makes pinpointing errors much more difficult. So, for example, don't upgrade hard disks in conjunction with a Confluence version upgrade, don't change the Confluence configuration at the same time as you upgrade your Apache software, and don't upgrade a major third-party plugin the day you move your database system to a new machine. The list is endless, these were just a few examples to get you thinking.
- After each upgrade step, run Confluence for a couple of days to check that everything is still fine.
- Keep track diligently of what you change, and when. It will be nearly impossible for us to help you if you can't tell us what exactly you changed at what time.
- Keep a copy of all log files produced during the upgrade, together with notes about what changed between successive restarts.
Always take careful note of the upgrade notes published with the Release Notes of each Confluence version, as well as the Confluence Upgrade Guide.
Example
Here you can see an extract of our change log for http://confluence.atlassian.com
— the server that hosts this very page.
Sydney time |
Server time |
Event |
Reason/Purpose (including JIRA issues) |
---|---|---|---|
|
2008-03-25 22:18 |
Started upgrade to 2.8-m9-r3 (build #1314) |
|
|
2008-03-25 22:25 |
App server brought down due to failed database upgrade |
|
|
2008-03-26 00:51 |
Server brought back up after database restored from backup. Running 2.8-m9-r3. |
|
|
2008-03-28 04:18 |
GC algorithm changed from concurrent to parallel collector. Max heap increased from 1.4 GB to 2.0 GB |
|
|
2008-04-24 |
Hyperic agent started with connection to Resin. |
|
|
2008-05-08 20:30 - 22:30 |
Manual updates to menu.css, comments.js and comments.css in webapp |
Temporary fix for @JIRA, @JIRA which was impacting performance |
|
2008-05-12 |
Updated cache sizes for five caches, bounced server. |
Cache efficiency was low on these caches. |
2008-05-13 18:00-18:20 |
2008-05-13 03:00-03:20 |
Upgrade from Resin 3.0 to Tomcat 5.5 |
|
2008-05-14 16:30-17:00 |
|
Upgrade from Confluence 2.8.1-rc2 to 2.8.1-rc3 |
|
|
2008-05-14 20:30 |
Install new cronjob as j2ee for automating access log analysis |
Testing of Upgrades before Production Implementation
You should test upgrades in a staging environment.
Before rolling out a new version of Confluence (or of the software or hardware that it uses, e.g. database systems, application servers, data storage), make sure that you test the upgrade with real data (e.g. a database dump) on a completely independent machine.
Here's an example of what such a test would pick up: The new release of Confluence may not be compatible with a custom third party plugin you have previously installed, thus breaking the plugin's functionality. You may not even know that anyone installed that plugin — but maybe many people are already using it. You'll want to find out about this before you actually roll out the new version of Confluence.
Here is an outline for a simple upgrade test:
- Create a clone of your production environment, using a database dump to obtain a copy of the Confluence data. We'll call this your 'staging environment'.
- Upgrade the staging environment to the new version of Confluence.
- Ask a few selected users from different departments to check the pages they commonly access, but have them do it in the staging environment.
Hint: In addition to finding weirdnesses with plugins, this may also show whether training for new functionality is needed in some of the departments. The IT department staff may be able to handle the upgrade to a new version of Confluence without training, but perhaps the sales representatives who use the wiki less often will need some training.
Getting a license for your staging environment
Only a technical contact for your commercial/academic license is able to create a Developer license.
Atlassian supplies 'developer' licenses which can be used by existing commercial license holders who wish to deploy non-production installations of our software to use in QA/staging environments. Developer licenses are free of charge to commercial license holders and, like our commercial offerings, they include 12 months of updates starting from the date of purchase of the commercial license.
If you hold a commercial license, you can obtain a free developer license by following these steps:
- Log in to your Atlassian account.
- Under the "Licenses" heading, all of your licenses will be displayed. Click the plus sign next to a license to view its details.
- Click the 'View Developer License' link in the bottom right corner of the license detail panel, below your commercial license key.
Enforcing Security Guidelines
Security is one of the most important issues for Confluence. We are constantly spending large amounts of effort to keep up with security threats and to Confluence's security model. We treat security breaches with utmost priority, and the recent releases have been improved to fend off advanced attack vectors like cross-site scripting (XSS), cross-site request forgery (XSRF) and header injection flaws. Altogether we believe that Confluence is a very secure product. But of course as with any software there are occasional bugs, and we are fixing security issues whenever they come up. We regularly release minor software releases that contain security fixes. This means you should upgrade your system frequently. Obviously this can affect your system's uptime. You should also make sure your whole infrastructure around Confluence is made robust as well (consider operating systems, webservers, application servers, networks, social engineering aspects, etc).
As with any other distributed system, you need to decide on a case by case basis if classified documents can be stored in it. It is common practice to store the most secure documents on computers that are not even connected to the physical intranet. Please contact your company's security officer to learn more about your enterprise's security procedures.
Make sure to have qualified staff around, so you can deal with security issues quickly. Once a security patch becomes available or a security incident happens, speed is essential.
Please refer to our dedicated Configuring Confluence Security page for more technical details.
Load-Testing Environments
Many customers ask us,
So, how many users and spaces can I put into Confluence, and what is the best hardware do to so?
The answer is, 'It depends'.
It depends a lot on your use case. Confluence is so successful because it can cover a huge range of use cases. If most of your users only access Confluence infrequently, it is no problem to have 70 000 to 100 000 users. But if each user is a power-user who uses the system the whole day, there's a substantial decrease in number Confluence can take without tuning. If your pages are short, simple, and don't contain a lot of macros, then the situation will be vastly different from a system that relies heavily on macros, background-tasks, or other features.
If your system is large (for example serving more than 10 000 users or storing more than 1000 spaces) or mission-critical (which it could be with as few as 1000 users who use it all the time) you need one or more more load-testing environments.
Even if your system is working nicely for 20 000 users right now, it might take just another 2000 users to push it over the edge.
We recommend the following basic procedure:
- Set up an environment that closely resembles your production environment.
- Gather statistics from your production system.
- Regularly apply a similar kind of load (and slightly higher) to the load-testing environment.
- Analyse how well Confluence scales for your usage patterns.
The Confluence development team has load-testing scripts available which you can use to simulate load. You can also contact Atlassian Support for more details.
Tuning
You may need to be able to tune your installation in the ways mentioned below.
Optimising your System
If you have large numbers of users, then downloading all the static content (CSS, default images, JavaScript-files) may result in a high additional load on the application server that can be offloaded to a caching web server.
Please refer to the following additional information:
- Our general Performance Tuning page.
- Information on configuring a large Confluence installation.
Limiting Third-Party Plugins
You may have to restrict the number of third-party plugins installed on your Confluence instance.
Most third-party plugins are not specifically written for high-load environments. What works fine in low-load environments could have unexpected and adverse effects when thousands of users are competing for your application server's CPU time or for database IO.
A common source of problems is access to database connections. If you have fewer users than database connections, it does not matter if an operation holds on to a database connection for two seconds while it downloads some data from the internet. With hundreds of concurrent users, this could quickly become a bottleneck.
Confluence itself is tested and optimised to handle high loads and avoids these kinds of problems. But if you install a number of plugins that have not been tested against high load, your system may become unstable.
We recommend that you load test the common use cases of each unofficial third-party plugin if your Confluence installation is mission critical. Only activate plugins that are vital to your business, and never allow experimental plugins onto your production system until they have been tested in a staging environment.
Selecting and Tuning your JVM
You should select your JVM carefully and you may need to be able to tune it.
The selection of the JVM for your large Confluence instance can have a huge impact on the performance perceived by the users. Between versions 1.4 and 6 of the Sun Java JVM there have been some impressive improvements in performance, especially under high concurrent load.
Here are some essential guidelines:
- Always run the most recent point release of your selected JVM.
- Where ever possible run the most recent major release from your selected JVM manufacturer. The Sun JVM version 6 is much faster than 1.4, especially under high loads.
- Tune your garbage collection algorithms. Experiment with different algorithms and settings to get the response times you desire in your environment. Here are some specific guidelines for Sun JVM in the Sun documentation:
Customising Confluence to Optimise Performance
You may need to customise Confluence for performance reasons. Depending on your usage scenario, there may be ways to enhance Confluence performance that become necessary when you reach a certain level of usage.
Here are some things you might decide to do:
- Remove the display of the space list on the Dashboard. See Customising the Dashboard.
- Configure any search appliances or other crawlers which are configured to index the Confluence site:
- These should be suitably rate limited.
- Configure them to crawl only pages in the
/display/
URL path, and only current versions of pages.
Please refer to our general Performance Tuning page for more details.
RELATED TOPICS
Performance Tuning
Configuring a Large Confluence Installation
Confluence Clustering Overview
Requesting Performance Support
Confluence Administrator's Guide
Confluence Configuration Guide
Server Hardware Requirements Guide
Fix Out of Memory Errors by Increasing Available Memory