High availability for Stash

If Stash is a critical part of your development workflow, maximizing application availability becomes an important consideration. This guide aims at providing the background information you need to be able to set up Stash in a highly available configuration. There are however many possible configurations for setting up a HA environment for Stash, depending on the infrastructure components and software (SAN, clustered databases, etc.) you have at your disposal. This guide aims at providing a high-level overview and describes a possible configuration in more detail.

Please note that your feedback and comments are welcome! We very much value additional lessons learned from your experience with alternative scenarios!

High availability

High availabilty describes a set of practices aimed at delivering a specific level of "availability" by eliminating and/or mitigating failure via redundancy. Failure can result from unscheduled down-time due to network errors, hardware failures or application failures, but can also result from failed application upgrades. Practices for achieving high availability range from organizational concerns such as change management to automated failover procedures when system failures are detected. Setting up a highly available system involves:

Proactive Concerns

- Change Management (including Staging / Production instances for change implementation)
- Create redundancy of network, application, storage and databases
- Monitoring system(s) for both the network and applications

Reactive Concerns

- Technical Failover mechanism, either automatic or scripted semi-automatic with manual switchover
- Standard Operating Procedure for guided actions during crisis situations

This guide assumes that processes such as change management are covered and will focus on redundancy / replication and failover procedures. When it comes to setting up your infrastructure to quickly recover from system or application failure, you have different options. These options vary in the level of uptime they can provide. In general, as the required uptime increases, the complexity of the infrastructure and the knowledge required to administer the environment increases as well (and by extension the cost goes up as well).

Understanding the availability requirements for Stash

Central version control systems such as Subversion, CVS, ClearCase and many others require the central server to be available for any operation that involves the version control system. Committing code, fetching the latest changes from the repository, switching branches or retrieving a diff all require access to the central version control system. If that server goes down, developers are severely limited in what they can do. They can continue coding until they're ready to commit, but then they're blocked.

Git is a distributed version control system and developers have a full clone of the repository on their machines. As a result, most operations that involve the version control system don't require access to the central repository. When Stash is unavailable developers are not blocked to the same extent as with a central version control system.

As a result, the availability requirements for Stash may be less strict than the requirements for say Subversion.

Consequences of Stash unavailability

Unaffected

Affected

Developer:

Commit code
Create branch
Switch branches
Diff commits and files
...
Fetch changes from fellow developers

Developer:

Clone repository
Fetch changes from central repository
Push changes to central repository
Access Stash UI - create/do pull requests, browse code

Build server:

Clone repository
Poll for changes

Continuous Deployment:

Clone repository

Failover options

High availability and recovery solutions can be categorized as follows:

Failover option

Recovery time

Description

Possible with Stash

Automatic correction / restart

2-10 min (application failure)

hours-days (system failure)

Single node, no secondary server available
Application and server are monitored
Upon failure of production system, automatic restarting is conducted via scripting
Disk or hardware failure may require reprovisioning of the server and restoring application data from a backup

Cold standby

2-10 min

Secondary server is available
Stash is NOT running on secondary server
Filesystem and (optionally) database data is replicated between the 'active' server and the 'standby' server
All requests are routed to the 'active' server
On failure, Stash is started on the 'standby' server and shut down on the 'active' server. All requests are now routed to the 'standby' server, which becomes 'active'.

Warm standby

0-30 sec

Secondary service is available
Stash is running on both the 'active' server and the 'standby' server, but all requests are routed to the 'active' server
Filesystem and database data is replicated between the 'active' server and the 'standby' server
All requests are routed to the 'active' server
On failure, all requests are routed to the 'standby' server, which becomes 'active'
This configuration is currently not supported by Stash, because Stash uses in-memory caches and locking mechanisms. At this time, Stash only supports a single application instance writing to the Stash home directory at a time.

Automatic correction

Before implementing failover solutions for your Stash instance consider evaluating and leveraging automatic correction measures. These can be implemented through a monitoring service that watches your application and performs scripts to start, stop, kill or restart services.

A Monitoring Service detects that the system has failed.
A correction script attempts to gracefully shut down the failed system.
1. If the system does not properly shut down after a defined period of time, the correction script kills the process.
After it is confirmed that the process is not running anymore, it is started again.
If this restart solved the failure, the mechanism ends.
1. If the correction attempts are not or only partially successful a failover mechanism should be triggered, if one was implemented.

Cold standby

The cold standby (also called Active/Passive) configuration consists of two identical Stash servers, where only one server is ever running at a time. The Stash home directory on each of the servers is either a shared (and preferably highly available) network file system or is replicated from the active to the standby Stash server. When a system failure is detected, Stash is restarted on the active server. If the system failure persists, a failover mechanism is started that shuts down Stash on the active server and starts Stash on the standby server, which is promoted to 'active'. At this time, all requests should be routed to the newly active server.

For each component in the chain of high availability measures, there are various implementation alternatives. Although Atlassian does not recommend any particular technology or product, this guide gives examples and options for each step. In the following, each component in the system is described and an example configuration is used to illustrate the descriptions.

System setup

Gliffy Macro Error

An error occurred while rendering this diagram. Please contact your administrator.

Name: System Setup

Component

Description

Request Router

Forwards traffic from users to the active Stash instance.

High Availability Manager

Tracks the health of the application servers and decides when to fail over to a standby server and designate it as active.

Manages failover mechanisms and sends notifications on system failure.

Stash server

Each server hosts an identical Stash installation (identical versions).

Only one server is ever running a Stash instance at any one time (know as the active server). All others are considered as standbys.

The Stash home directory resides on a replicated or shared file system visible to all application servers (described in more detail below).

The Stash home directory must never be modified when the server is in standby mode.

Stash DB

The production database, which should be highly available. How this is achieved is not explored in this document. See the following database vendor-specific information on the HA options available to you:

Database	More Information
Postgres	http://www.postgresql.org/ docs/9.2/static/high-availability.html
MySQL	http://dev.mysql.com/ doc/refman/5.5/en/ha-overview.html
Oracle	http://www.oracle.com/ technetwork/database/features/ availability/index.html
SQLServer	http://technet.microsoft.com/ en-us/library/ms190202.aspx

Example HA implementation

This particular implementation is provided to illustrate the concepts, but hasn't been tested in production. We strongly recommend that you devise a solution that best fits your organisation's existing best practices and standards and is thoroughly tested for production readiness.

The example configuration that we'll use to illustrate the concepts consists of a Linux cluster of two nodes. Each node is a CentOS server with Java, Git and Stash installed. Stash's home directory is replicated between the nodes using DRBD, a block-level disk replication mechanism. The cluster is managed by CMAN. Pacemaker, a high availability resource manager, is used to manage two HA resources: Stash and a Virtual IP. Pacemaker runs on each machine, elects the 'primary' node for Stash and starts Stash on this node. The Virtual IP resource is configured to run on the same node as the Stash resource, removing the need for a separate 'request router' component. Pacemaker monitors Stash and when it detects a failure tries to restart Stash on the primary node. If the restart fails, or does not resolve the issues, it fails over to the secondary node. The Virtual IP resource is configured to run on the same node as the Stash resource and will also be moved to the secondary node.

Scripts to create a virtual network based on this example configuration using packer, vagrant and VirtualBox can be found in the stash-ha-example repository. Specifically, the scripts for installing the required software components can be found in the packer/scripts directory. The scripts for configuring the cluster can be found in the vagrant/scripts directory.

Gliffy Macro Error

An error occurred while rendering this diagram. Please contact your administrator.

Name: Example Stash HA implementation

Request router

All high availability solutions are based on redundancy, monitoring and failover. In the cold standby approach, only one server is running Stash at a time. It is the request router's responsibility to route all incoming requests to the node that is currently the primary node. For full high availability, the request router should be highly available itself, meaning that the component is monitored by the HA manager and can be failed over to a redundant copy in the network.

Requirements

Routes all incoming requests to the node that is currently the 'primary' node
Should be highly available itself

Options

Solution in example HA implementation

The example HA implementation does not include a separate Request Router server. Instead it includes a virtual IP HA resource that is co-located with the Stash resource. The virtual IP resource is managed by Pacemaker and will be moved to the standby node when the Stash resource fails over to the standby node.

Data replication

Stash stores its data in two places: the Stash home directory and the database that you have configured. The Stash home directory contains, among other things, the Git repositories being managed by Stash (with some additional Stash-specific files and directories), installed plugins, caches and log files. The database contains, among other things, your project and repository information and metadata, pull requests data and the data for your installed plugins.

Data in Stash's home directory and in the database are very tightly coupled. For instance, repository pull requests have their metadata, participants and comments stored in the database but certain Git-oriented information around merging and conflicts (which are used to display the diffs in the user interface) are stored in the managed Git repositories. If the two were to fall out of sync you might see an incorrect pull request diff, you might be left unable to merge the pull request, or Stash may simply refuse to display the pull request at all. Similarly, Stash plugins are installed from jar files in the Stash home directory but their state is stored in the database. If the two were to fall out of sync then plugins may malfunction or not appear installed at all, thus degrading your Stash experience.

When designing a high availability solution for Stash based on a replicated file system and database, it's important that the file system replication is atomic. The replicated file system must be a consistent snapshot of the 'active' filesystem. This is important because changes to a Git repository happen in predictable ways: first the objects (files, trees and commits) are written to disk, followed by updates of the refs (branches and tags). Some synchronisation tools such as rsync perform file-by-file syncing, which can result in an inconsistent Git repository if the repository is modified while the sync is happening (for example, if object files have not been synced, but the updated refs have been).

Furthermore, the tight coupling between the Stash home directory and database makes it essential that the Stash home directory and database are always consistent and in sync (see here for more information). By extension, this means that any high availability solution based on a replicated file system and database needs to ensure that the replicated file system and database are in sync. For example, if the replication is based on hourly synchronisation to a standby node, care must be taken to ensure that the synchronisation of the database and filesystem happen at the same time.

Requirements

File system replication must replicate a consistent snapshot of Stash's home directory.
The database and the file system must be replicated at the same time.

Options

Solution in example HA implementation

The example HA implementation uses a DRBD managed block device for its Stash home directory. By default, DRBD runs in a Primary/Secondary configuration in which only a single node can mount the DRBD managed volume at a time. In this configuration, DRBD should be managed by Pacemaker to ensure that the DRBD volume is co-located with the Stash resource.

In preparation for experimentation with an Active/Active configuration, the example HA implementation has configured DRBD in a dual-primary configuration, which allows both nodes to mount the DRBD managed volume at the same time.

Monitoring

To allow for monitoring in a high availability environment, Stash, since version 2.10, has supported a REST-based health check endpoint at /status that describes the current health of the instance. This endpoint supports only the GET verb and requires no authentication, XSRF protection header values, or mime-type headers. The /status endpoint has been designed to return sane output even when Stash is currently unavailable as a result of database migration or backup. Please note that other URLs such as /login or /rest/api/latest/application-properties will redirect to the maintenance page when Stash is performing database migration or backup. Using these endpoints may unintentionally trigger failover when these URLs are used for monitoring the health of the system.

Example usage:

> curl -i -u user -X GET http://localhost:7990/stash/status
Enter host password for user 'user':
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-AREQUESTID: 1040x7x0
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
Content-Type: application/json;charset=ISO-8859-1
Content-Length: 19
Date: Tue, 07 Jan 2014 17:20:04 GMT
{"state":"RUNNING"}

The following is a list of the responses the /status health check endpoint will return:

HTTP Status Code	Response entity	Description
200	{"state":"RUNNING"}	Stash is running normally
500	{"state":"ERROR"}	Stash is in an error state
200	{"state":"MAINTENANCE"}	Stash is in maintenance mode
200	{"state":"STARTING"}	Stash is starting
200	{"state":"STOPPING"}	Stash is stopping
200	{"state":"FIRST_RUN"}	Stash is running for the first time and has not yet been configured
404		Stash failed to start up in an unexpected way (the web application failed to deploy)

If a connection error occurs when trying to connect to the endpoint (but the server is reachable) then Tomcat has failed to start.

Monitoring frequency

Stash's health check is simple and not resource intensive. You should feel free to check as often as is deemed necessary to maximise continuity of Stash in your organisation. We do recommend, however, to not check more frequently than every 15 seconds so that the HA resource manager / cluster does not mistake transitory slowdowns such as stop-the-world garbage collection in Stash's JVM. We recommend a monitor timeout of 30 seconds because the first check after startup can be fairly slow. After startup completes, the check should take only a few milliseconds.

Requirements

Monitoring scripts must use the /status URL. Any other URL may redirect to the maintenance page when a backup is being performed, unintentionally triggering failover.
When a request to /status returns anything other than a 200 status code, Stash should be considered to be in an error state and should be failed over the standby node.

Solution in example HA implementation

The example HA implementation includes an OCF compliant script that's used for monitoring Stash's health. The script can be found here.

Failover

The following table outlines how we recommend that your HA resource manager responds to failure events:

Event	Response
Network connection from the request router to Stash is lost	Failover to a secondary node
Server failure	Failover to a secondary node
Stash crashes completely	Restart Stash on the active node
Stash reaches its memory limits (OOME)	Restart Stash on the active node
Stash loses connection to the database	Nothing. Stash will recover when the database comes back on line. Stash on another node will also fail to start if the database is unavailable.
The database is reported down	Nothing. Stash will recover when the database recovers. Stash on another node will also fail to start.
Stash fails to start up (e.g. wrong Git binary version)	Nothing. Manual intervention required. Stash on another node will also fail to start.

Split brain

A split-brain condition results when a cluster of nodes encounters a network partition and multiple nodes believe the others are dead and proceed to take over the cluster resources. In the context of a Stash HA installation this would involve multiple Stash instances running concurrently and making filesystem and database changes, potentially causing the filesystem and database to fall out of sync. As previously noted this must not be permitted to happen. There are several ways to address this:

Network redundancy

This involves configuring redundant and independent communications paths between nodes in the cluster. If you maximise the connectivity between nodes you minimise the likelihood of a network partition and a split brain. This is a preventative measure but it is still sometimes possible for the network to partition.

Resource fencing

This involves ensuring that the first node that believes the others are dead 'fences off' access to the resource that other nodes (which appear dead but may still be alive) may try to access. The losing nodes are prevented from making modifications, therefore maintaining consistency. In a Stash HA, the resources that would need to be fenced are the database and the replicated file system.

Node fencing or STONITH

This is a more aggressive tactic and again involves the first node that believes the others are dead, but instead of fencing off access to particular resources, it denies all resource access to them. This is most commonly achieved by power-cycling the losing nodes (aka "Shoot The Other Node In The Head" or STONITH). In a Stash HA, this would involve power-cycling the losing Stash servers.

Requirements

The application should fail over to a secondary node when a server failure is detected by the cluster manager (that is, the whole node is down or unreachable).
When an application failure is detected, the application should be restarted. If restarting does not resolve the issue, the application should be failed over to a secondary node.

Solution in example HA implementation

The example implementation uses Pacemaker to manage failover. Pacemaker in turn uses the provided OCF script to properly shut down the failing Stash and start Stash on the secondary node.

Please note that the vagrant provisioning script in the example implementation contains a simplified configuration that is aimed at testing failover. It configures Stash to immediately failover to a secondary node, without attempting to restart the application. It also disables the STONITH feature for ease of testing. In a production system, at least one restart should be attempted before failing over and STONITH should be enabled to handle 'split brain' occurrences.

Page tree

High availability for Stash