Jira Data Center sample deployment and monitoring strategy

At Atlassian, we use several enterprise-scale development tools, especially when it comes to our customer-facing services. The following two services are good examples:

Both services are backed by separate Jira Data Center instances. To date, both instances track a total of 1.9 million tickets, with a combined user base of around 4.8 million users

Jira Data Center 8.4 monitoring features

We've introduced several monitoring features in Jira Data Center 8.4, which was released months after the publication of this article. If you're looking for updated information on how to monitor the performance of your Jira Data Center deployment, see Jira Data Center monitoring.

How our Jira Data Center instances are used

Both getsupport.atlassian.com and jira.atlassian.com are used by customers and Atlassian employees across all time zones. This translates to a much larger user base than other instances we've covered before. Since they're public-facing services, we need to maintain high availability and good response times. This allows us to demonstrate that Jira Software Data Center and Jira Service Management Data Center can perform well under heavy, enterprise-grade load.


Click here for more load and scale details

Content

These statistics help illustrate the size and data complexity of each instance. 

Statistic

getsupport.atlassian.com

jira.atlassian.com

Issuesover 1.26 million661,000
Projects38140
Custom fields157224
WorkflowsActive2194
Total45242
Total usersover 1.1 millionover 3.68 million
Groups14861
Commentsover 8.6 millionover 1.38 million
Permission schemes8491
Issue security levels623
Custom filters (shared)8,5003,500

Traffic

These statistics help illustrate the amount of load each instance supports.


Statistic
getsupport.atlassian.comjira.atlassian.com

Active database connections per hour

Monthly average

6262
Weekly average6262
Peak6563
HTTP-level throughput


total average per hourover 22,500over 37,000
maximum per houralmost 43,000almost 51,000

Concurrent users per hour

approximately 5000 concurrent users per hour, on averageapproximately 3500 concurrent users per hour, on average


Good to know:

  • For Active database connections per hour, we used traffic data for the entire month of May 2018 as our sample. For Concurrent users per hour and HTTP-level throughput, we used traffic data for 5 consecutive business days in June 2018 to filter out low-traffic weekend hours.
  • Active database connections per hour: We track this through the DatabaseConnections metric (see Amazon RDS Metrics and Dimensions). Most of these open connections are pre-allocated. Since the peaks don't deviate much from the average, it means we set an adequate minimum thread pool for the database.
  • Concurrent users per hour: We counted the number of new, successfully authenticated sessions created per hour. We assume those users stay on the site for the entire hour.
  • HTTP-level throughput: We counted the number of requests made through the Load Balancer.

In a selected 12-hour traffic sample of both instances, we also saw that most users spent their time viewing issues (as opposed to creating them, viewing dashboards, and other activities).

Atlassian support and development teams also use these services to track tickets collaboratively with customers and partners. This makes both services crucial to how we collaborate with people outside the company.

Both instances get heavy traffic throughout the day, with some (mostly predictable) major spikes. As Data Center instances, they're both highly available, so an individual node failure won't be fatal. We can then focus less on maintaining uptime, and more on managing performance. 

Infrastructure and setup

Both Jira Data Center instances are hosted on separate Amazon Web Services (AWS) Virtual Private Clouds. Their topologies and configuration are identical except for the nodes used for the database and application cluster: 

FunctionInstance typeNumber
getsupport.atlassian.comjira.atlassian.com
Jira application

c5.9xlarge

m4.4xlarge3
Database (Amazon RDS Postgresql)

db.m4.4xlarge

db.m4.2xlarge1
Load balancerAWS Application Load Balancer1
Storage and file systemElastic File System 3


Each node in getsupport.atlassian.com's Jira application cluster uses one local 240GB disk, storing both /root and Jira index cache. In jira.atlassian.com, the application cluster nodes use a separate 40GB disk for /root and 30GB disk for the Jira index cache. The getsupport.atlassian.com instance has almost double the number of issues compared to jira.atlassian.com, which accounts for the large difference in disk size. In addition, both instances' application clusters do not auto-scale (as we don't support this).

Other than the node types and the size of each application node's disk, getsupport.atlassian.com and jira.atlassian.com are configured identically:


Component

Configuration

Database (Amazon RDS Postgresql)The database node uses General Purpose SSD storage, with single-instance High-Availability enabled.
Load balancerThe load balancer has only one target group.
Storage and file system

Each instance uses 3 Elastic File System shared volumes, one for each of the following:

  • Shared configuration
  • Shared home
  • Shared binaries
JVM24GB maximum heap size, using the G1 Garbage Collector

Application nodes

Each node also has swapping disabled. We did this because swapping can slow down Jira’s performance.

Instead, we set a maximum heap size of 24GB, using the G1 Garbage Collector. This helps ensure the node always has enough memory (which means it’ll never need to swap anyway).

Integrated services

The getsupport.atlassian.com instance has 43 user-installed apps enabled, while jira.atlassian.com has 40. Both instances use OpenID to authenticate users directly against an Active Directory store.

The following table shows each instance's mix of linked Atlassian applications:

Atlassian applicationgetsupport.atlassian.comjira.atlassian.com
Jira1113
Confluence58
Bitbucket02
Bamboo017
Fisheye/Crucible02
Others11

Monitoring strategy

The strategies described here are tailored to Atlassian's business needs and available resources. Your enterprise environment might have unique needs and quirks that require different strategies. Consult an Atlassian Technical Account Manager for guidance.

We keep a different update cadence for both instances; getsupport.atlassian.com gets updated every Long Term Support release, and jira.atlassian.com with each feature release. We quickly apply the latest bug fix release for each as well.

Most updates apply fixes and optimizations that help prevent outages or serious problems we've had in the past. Given our update cadence, this makes it tricky to set alerts, since thresholds that we set for one release might quickly become irrelevant in the next. In our experience, problems rarely persist across releases.

Instead, our alerts focus on a broader range of potential problems – as in, they also have to alert us of problems we've never had before. We use alerts that have consistently warned us of upcoming problems, even new ones, through multiple releases. The following sections explain the alerts and practices we use.

Focusing on sustained spikes

Both getsupport.atlassian.com and jira.atlassian.com get regular traffic spikes that can drive up resource usage. These spikes can sometimes hit critical levels (for example, 90% of CPU usage), but Jira can recover gracefully from most of them. Because of this, setting an alert for when metrics hit certain thresholds can set off many false positives.

What we do monitor, however, are sustained spikes. We want to know when a node's resource has been overworked for too long – that is, when a metric stays too long within a certain threshold. To do this, we set three dimensions for each node alert:

  1. Time period
  2. Warning level
  3. Critical level

We use third-party monitoring tools to take samples of certain metrics and calculate whether the metric stays within certain thresholds for the set time period. This allows us to ignore, for example, when a node has less than 10% of free disk space. Rather, we get alerted when it's had less than 10% of free disk space for 10 minutes.

Node

Metric we trackTime period (minutes)Warning levelCritical level

IOWait
We track how much time the CPU sits idle waiting for I/O operations to complete. 

10> 15% > 30%

System CPU time
This refers to the amount of time the CPU spends in kernel space. 

5> 80%> 90%

Remaining disk space
We need to know early if we should expand disk space to accommodate usage or check further for storage-related issues.

10< 10%None

Usable memory
We need to ensure that each node has enough usable physical RAM.

5< 5%< 3%

JVM

We use third-party tools to regularly check whether Jira's processes are still up. We also use the following alerts to monitor the health of the JVM on all application cluster nodes. These alerts help warn us that the instance might slow down, or a node is about to go offline. 

Metric we trackTime period (minutes)Warning levelCritical level

Garbage Collection time
An application node slows down if it spends too much CPU time in Garbage Collection.

5> 10%> 20%

Java Heap memory consumed
This alert tracks how much of the maximum Java Heap memory is currently in use. If this alert gets triggered too often, it could mean we have a memory leak. Or it could mean we need to tweak the JVM Heap.

5> 90%> 95%

Monitoring HTTP response 

We measure the HTTP response time of getsupport.atlassian.com and jira.atlassian.com to get a quick overview of their performance. To do this, we periodically ping both instances from three US regions. Our on-duty admins are automatically notified whenever:

  • HTTP response time is 7-10 seconds (warning)
  • HTTP response time is 10 seconds or more (critical)
  • A ping returns any HTTP return code that is not 200 (OK) or 302 (Redirect)

Load balancer health checks

Every 30 seconds, we run an automated health check request on the load balancer. The request times out after 29 seconds, at which point we’ll run another one.

If the load balancer times out two consecutive times, it is labeled unhealthy. The health check requests will continue anyway. The load balancer will be labeled healthy again once it responds successfully to two consecutive health checks.

Our on-duty admins get notified once the load balancer is labeled unhealthy. If it’s healthy by the time the admin starts investigating, they check how long the load balancer took to recover.

Log rotation

We stream our logs in real-time to a separate server. This allows us to combine logs from all nodes, and analyze the clusters as a whole. We include the node ID in each streamed log statement in case we need to trace isolated problems to specific nodes. 

Streaming and storing logs on a separate server also means we can delete old logs from each node every hour. Each node will automatically delete logs that are more than five hours old. We automated this to prevent logs from consuming too much disk space. This is critical for the database, which becomes unresponsive if its node runs out of disk space (when this occurs, the entire instance goes offline). 


We're here to help

Over time, we may re-configure our instance and monitoring strategies based on changing needs, application improvements, or new insights. Contact an Atlassian Technical Account Manager for more guidance on deploying your own Data Center instance and forming a compatible monitoring strategy.

Our Premier Support team performs health checks by meticulously analyzing your application and logs, to ensure that your application's deployment fully meets the needs of your users. If the health check process reveals any performance gaps, Premier Support will recommend possible changes to your deployment.

We also introduced a number of monitors you can use to analyze bottlenecks and get alerts for notable events. For more information, see Jira Data Center monitoring.



Last modified on Nov 5, 2020

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.