Scaling Bitbucket Server
This page discusses performance and hardware considerations when using Bitbucket Server.
Note that Bitbucket Data Center resources, not discussed on this page, uses a cluster of Bitbucket Server nodes to provide Active/Active failover, and is the deployment option of choice for larger enterprises that require high availability and performance at scale.
The type of hardware you require to run Bitbucket Server depends on a number of factors:
- The number and frequency of clone operations. Cloning a repository is one of the most demanding operations. One major source of clone operations is continuous integration. When your CI builds involve multiple parallel stages, Bitbucket Server will be asked to perform multiple clones concurrently, putting significant load on your system.
- The size of your repositories – there are many operations in Bitbucket Server that require more memory and more CPUs when working with very large repositories. Furthermore, huge Git repositories (larger than a few GBs) are likely to impact the performance of the Git client.
- The number of users.
On this page:
Here are some rough guidelines for choosing your hardware:
- Estimate the number of concurrent clones that are expected to happen regularly (look at continuous integration). Add one CPU for every 2 concurrent clone operations.
- Estimate or calculate the average repository size and allocate 1.5 x number of concurrent clone operations x min(repository size, 700MB) of memory.
If you’re running Bitbucket Data Center, check your size using the Bitbucket Data Center load profiles. If your instance is Large or XLarge, take a look at our infrastructure recommendations for Bitbucket Data Center AWS deployments.
See Scaling Bitbucket Server for Continuous Integration performance for some additional information about how Bitbucket Server's SCM cache can help the system scale.
Tickets and throttling
Bitbucket Server uses a ticket-based approach to throttling requests. The system uses a limited number of different ticket buckets to throttle different types of requests independently, meaning one request type may be at or near its limit, while another type still has free capacity.
Each ticket bucket has a default size that will be sufficient in many systems, but as usage grows, the sizes may need to be tuned. In addition to a default size, each bucket has a default timeout which defines the longest a client request is allowed to wait to acquire a ticket before the request is rejected. Rejecting requests under heavy load helps prevent cascading failures, like running out of Tomcat request threads because too many requests are waiting for tickets.
The following table shows ticket buckets the system uses, the default size and acquisition timeout, and what each is used for:
|scm-command||40 (Fixed)||2 seconds|
“scm-command” is used to throttle most of the day-to-day Git commands the system runs. For example:
“scm-command” tickets are typically directly connected to web UI and REST requests, and generally have very quick turnaround - most commands typically complete in tens to hundreds of milliseconds. Because a user is typically waiting, “scm-command” tickets apply a very short timeout in order to favor showing users an error over displaying spinners for extended periods.
|scm-hosting||1x-4x (Adaptive; see below)||5 minutes|
“scm-hosting” is used to throttle
For SSH only, “scm-hosting” is also used to throttle
“scm-hosting” uses an adaptive throttling mechanism (described in detail below) which allows the system to dynamically adjust the number of available tickets in response to system load. The default range is proportional to a configurable scaling factor, which defaults to the number of CPUs reported by the JVM. For example, if the JVM reports 8 CPUs, the system will default to 1x8=8 tickets minimum and 4x8=32 tickets maximum.
|scm-refs||8x (Fixed proportional)||1 minute|
“scm-refs” is used to throttle ref advertisements, which are the first step in the process of servicing both pushes and pulls.
Additionally, because most of the CPU and memory load are client side, pushes are throttled using the “scm-refs” bucket. Unlike a clone or a fetch, the pack for a push is generated using the client’s CPU, memory and I/O. While processing the received pack does produce load on the server side, it’s minimal compared to generating a pack for a clone or fetch.
The default size for the “scm-refs” bucket is proportional to a configurable scaling factor, which defaults to the number of CPUs reported by the JVM. For example, if the JVM reports 8 CPUs, the system will default to 8x8=64 “scm-refs” tickets.
Ref advertisements are generally served fairly quickly, even for repositories with large numbers of refs, so the default timeout for “scm-refs” is shorter than the default for “scm-hosting”.
“git-lfs” is used to throttle requests for large objects using Git LFS. LFS requests are much more similar to a basic file download than a pack request, and produce little system load. The primary reason they’re throttled at all is to prevent large numbers of concurrent LFS requests from consuming all of Tomcat's limited HTTP request threads, thereby blocking access to users trying to browse the web UI, or make REST or hosting requests.
Because LFS is predominantly used for large objects, the amount of time a single LFS ticket may be held can vary widely. Since it’s hard to make a reasonable guess about when a ticket might become available, requests for “git-lfs” tickets timeout immediately when the available tickets are all in use.
Prior to Bitbucket Server 7.3, "scm-hosting" was used to throttle all parts of hosting requests, including ref advertisements and pushes. This meant that the "scm-hosting" bucket often needed to be sized very generously to prevent fast-completing ref advertisements from getting blocked behind slow-running clones or fetches when competing for tickets. However, when the "scm-hosting" limit was very high, if a large number of clone or fetch requests were initiated concurrently, it could result in a load spike that effectively crippled or even crashed the server.
"scm-refs" tickets were introducted in Bitbucket Server 7.3 in order to help combat that risk. With "scm-refs", administrators can configure the system to allow for heavy polling load (typically from CI servers like Bamboo or Jenkins) without necessarily increasing the number of available "scm-hosting" tickets.
When ref advertisement caching is enabled (see Scaling Bitbucket Server for Continuous Integration performance), “scm-hosting” tickets are still used to throttle ref advertisements, instead of “scm-refs” tickets. If ref advertisement caching is disabled (the default setting), “scm-refs” tickets are used. “scm-hosting” tickets are used when ref advertisement caching is enabled to allow the request to proceed from serving a ref advertisement, to serving a pack (potentially also from the cache) without needing to release an “scm-refs” ticket and then acquire an “scm-hosting” ticket. Regardless of whether ref advertisement caching is enabled or not, “scm-refs” tickets are always used to throttle pushes.
Prior to Bitbucket Server 4.11, resource throttling was achieved by allocating a fixed number of tickets, and each hosting operation would have to acquire a ticket before it could proceed. Hosting operations finding no tickets available had to queue until one was available and would time out if it queued for too long. The default was 1.5 tickets per number of CPU cores, but you could increase this within the app properties. Getting this number right was challenging.
Bitbucket Server 4.11 introduced a new throttling approach for SCM hosting operations that adapts to the stress the machine is under, referred to as adaptive throttling. With adaptive throttling, Bitbucket Server examines the total physical memory on the machine and determines a maximum ticket number that the machine can safely support given an estimate of how much memory a hosting operation consumes, how much memory Bitbucket Server needs, and how much Elasticsearch needs. The default minimum (1 ticket per CPU core) and maximum (4 tickets per CPU core) of the ticket range can be changed.
Other characteristics of adaptive resource throttling:
- Allows a variable range of ticket values depending on how close current CPU usage is to a target CPU usage (defaults to 75% CPU usage). This can be changed.
- Every 5 seconds it resamples the CPU usage and recalculates how many tickets within the range can be supported.
- CPU readings are smoothed so as to not respond too suddenly to CPU spikes and overshoot/undershoot the optimum number of tickets.
- You previously set a non-default fixed number of tickets, for instance
- You previously configured this strategy explicitly, for instance
- The adaptive throttling configuration is invalid in same way
- The total memory of the machine is so limited that even the minimum number of tickets is unsafe
Understanding Bitbucket Server's resource usage
Most of the things you do in Bitbucket Server involve both the Bitbucket Server instance and one or more Git processes. For instance, when you view a file in the web application, Bitbucket Server processes the incoming request, performs permission checks, creates a Git process to retrieve the file contents and formats the resulting webpage. The same is true for the 'hosting' operations like pushing commits, cloning a repository, or fetching the latest changes.
As a result, when configuring Bitbucket Server for performance, CPU and memory consumption for both Bitbucket Server and Git should be taken into account.
When deciding on how much memory to allocate for Bitbucket Server, the most important factor to consider is the amount of memory required for Git. Some Git operations are fairly expensive in terms of memory consumption, most notably the initial push of a large repository to Bitbucket Server and cloning large repositories from Bitbucket Server. For large repositories, it is not uncommon for Git to use up to 500 MB of memory during the clone process. The numbers vary from repository to repository, but as a rule of thumb 1.5x the repository size on disk (contents of the
.git/objects directory) is a reasonable initial estimate of the required memory for a single clone operation. However, note that for larger repositories, memory usage can be significantly higher than this and is effectively only bounded by the amount of RAM in the system.
The clone operation is the most memory intensive Git operation. Most other Git operations, such as viewing file history, file contents and commit lists are lightweight by comparison. Clone operations also tend to retain their memory for significantly longer than other operations.
Bitbucket Server has been designed to have fairly constant memory usage. Any pages that could show large amounts of data (e.g. viewing the source of a multi-megabyte file) perform incremental loading or have hard limits in place to prevent Bitbucket Server from holding on to large amounts of memory at any time. In general, the default memory settings (
-Xmxlg) should be sufficient to run Bitbucket Server. Installing third-party apps may increase the system's memory usage. The maximum amount of memory available to Bitbucket Server can be configured in
The memory consumption of Git is not managed by the memory settings in
_start-webapp.bat. The Git processes are executed outside of the Java virtual machine, and as a result the JVM memory settings do not apply to Git.
In Bitbucket Server, much of the heavy lifting is delegated to Git. As a result, when deciding on the required hardware to run Bitbucket Server, the CPU usage of the Git processes is the most important factor to consider. And, as is the case for memory usage, cloning large repositories is the most CPU intensive Git operation. When you clone a repository, Git on the server side will create a pack file (a compressed file containing all the commits and file versions in the repository) that is sent to the client. Git can use multiple CPUs while compressing objects to generate a pack, resulting in spikes of very high CPU usage. Other phases of the cloning process are single-threaded and will, at most, max out a single CPU.
Encryption (either SSH or HTTPS) may impose a significant CPU overhead if enabled. As for whether SSH or HTTPS should be preferred, there's no clear winner. Each has advantages and disadvantages as described in the following table:
No CPU overhead for encryption, but plain-text transfer and basic authentication may be unacceptable for security.
Encryption has CPU overhead, but this can be offloaded to a separate proxy server (if the SSL/TLS is terminated there).
Encryption has CPU overhead.
Authentication is slower – it requires remote authentication with the LDAP or Crowd server.
Authentication is generally faster, but may still require an LDAP or Crowd request to verify the connecting user is still active.
Cloning a repository is slightly slower – it takes at least 2 separate requests - and potentially significantly more - each performing its own authentication and permission checks. The extra overhead is typically small, but depends heavily on the latency between client and server.
Cloning a repository takes only a single request.
Cloning a Git repository, by default, includes the entire history. As a result, Git repositories can become quite large, especially if they’re used to track binary files, and serving clones can use a significant amount of network bandwidth.
There’s no fixed bandwidth threshold we can document for the system since it will depend heavily on things like; repository size, how heavy CI (Bamboo, Jenkins, etc.) load is, and more. However, it’s worth calling out that Bitbucket Server’s network usage will likely far exceed other Atlassian products like Jira or Confluence.
Additionally, when configuring a Data Center cluster, because repository data must be stored on a shared home which is mounted via NFS, Bitbucket Data Center's bandwidth needs are even higher -and its performance is far more sensitive to network latency. Ideally, in a Data Center installation, nodes would use separate NICs and networks for client-facing requests (like hosting) and NFS access to prevent either from starving the other. The NFS network configuration should be as low latency as possible, which excludes using technologies like Amazon EFS.
Git repository data is stored entirely on the filesystem. Storage with low latency/high IOPS will result in significantly better repository performance, which translates to faster overall performance and improved scaling.
Available disk space in $BITBUCKET_HOME/caches, where Bitbucket Server’s SCM cache stores packs to allow them to be reused to serve subsequent clones, is also important for scaling. The SCM cache allows Bitbucket Server to trade increased disk usage for reduced CPU and memory usage, since streaming a previously-built pack uses almost no resources compared to creating a pack, and running low on free space will automatically disable the SCM cache, resulting in substantially higher CPU usage and slower clone performance. When possible, $BITBUCKET_HOME/caches should have a similar amount of total disk space to $BITBUCKET_HOME/shared/data/repositories, where the Git repositories are stored.
The size of the database required for Bitbucket Server primarily depends on the number of repositories the system is hosting and the number of commits in those repositories.
A very rough guideline is: 100 + ((total number of commits across all repositories) / 2500) MB.
So, for example, for 20 repositories with an average of 25,000 commits each, the database would need 100 + (20 * 25,000 / 2500) = 300MB.
Note that repository data is not stored in the database; it’s stored on the filesystem. As a result, having multi-gigabyte repositories does not necessarily mean the system will use dramatically more database space.
Where possible, it is preferable, albeit not required, to have Bitbucket Server’s database on a separate machine or VM, so the two are not competing for CPU, memory and disk I/O.
Since cloning a repository is the most demanding operation in terms of CPU and memory, it is worthwhile analyzing the clone operation a bit closer. The following graphs show the CPU and memory usage of a clone of a 220 MB repository:
Git process (blue line)
Bitbucket Server (red line)
Git process (blue line)
Bitbucket Server (red line)
This graph shows how concurrency affects average response times for clones:
The measurements for this graph were done on a 4 CPU server with 12 GB of memory.
Response times become exponentially worse as the number of concurrent clone operations
exceed the number of CPUs.
In order to be able to effectively diagnose performance issues, and tune Bitbucket Server’s scaling settings, it is important to configure monitoring. While exactly how to set up monitoring is beyond the scope of this page, there are some guidelines that may be useful:
At a minimum, monitoring should include data about CPU, memory, disk I/O (for any disks where Bitbucket Server is storing data), free disk space, and network I/O
Monitoring free disk space can be very important for detecting when the SCM cache is nearing free space limits, which could result in it being automatically disabled
When Bitbucket Server is used to host large repositories, it can consume a large amount of network bandwidth. If repositories are stored on NFS, for a Data Center cluster, bandwidth requirements are even higher
Bitbucket Server exposes many JMX counters which may be useful for assembling dashboards to monitor overall system performance and utilitization
Retaining historical data for monitoring can be very useful for helping to track increases in resource usage over time as well as detecting any significant shifts in performance
As users create more repositories, push more commits, open more pull requests and generally just use the system, resource utilization will increase over time
Historical averages can be useful in determining when the system is approaching a point where additional hardware may be required or, for Data Center installations, when it may be time to consider adding another cluster node
Configuring Bitbucket Server scaling options and system properties
The sizes and timeouts for the various ticket buckets are all configurable; see Bitbucket Server config properties.
When the configured limit is reached for the given resource, requests will wait until a currently running request has completed. If no request completes within a configurable timeout, the request will be rejected. When requests while accessing the Bitbucket Server UI are rejected, users will see either a 501 error page indicating the server is under load, or a popup indicating part of the current page failed to load. When Git client 'hosting' commands (pull/push/clone) are rejected, Bitbucket Server does a number of things:
Bitbucket Server will return an error message to the client which the user will see on the command line: "Bitbucket is currently under heavy load and is not able to service your request. Please wait briefly and try your request again."
A warning message will be logged for every time a request is rejected due to the resource limits, using the following format:
"A [scm-hosting] ticket could not be acquired (0/12)"
The ticket bucket is shown in brackets, and may be any of the available buckets (e.g. “scm-command”, “scm-hosting”, “scm-refs” or “git-lfs”)
For five minutes after a request is rejected, Bitbucket Server will display a red banner in the UI for all users to warn that the server is under load.
This period is also configurable.
The hard, machine-level limits throttling is intended to prevent hitting are very OS- and hardware-dependent, so you may need to tune the configured limits for your instance of Bitbucket Server. When hyperthreading is enabled for the server CPU, for example, the default number of “scm-hosting” and “scm-refs” tickets may be too high, since the JVM will report double the number of physical CPU cores. In such cases, we recommend starting off with a less aggressive value; the value can be increased later if hosting operations begin to back up and system monitoring shows CPU, memory and I/O still have headroom.