HealthCheck: Asynchronous Cache Replication Queues
Overview
The cache replication in Jira Data Center 7.9 and later is asynchronous, which means that cache modifications aren’t replicated to other nodes immediately after they occur, but are instead added to local queues. They are then replicated in the background based on their order in the queue.
Before we can queue cache modifications, we need to create local queues on each of your nodes. There are separate queues created for each node, so that modifications are properly grouped and ordered.
Queues are created automatically. Whenever you add a node to the cluster, the remaining nodes detect it and create a new queue on their file system. Nodes retrieve information about the whole cluster from the database by using the following query:
select node_id, node_state from clusternode
node_id | node_state |
---|---|
node1 | ACTIVE |
node2 | ACTIVE |
node3 | OFFLINE |
When a node is unable to deliver cache replication messages to another node the local queues will grow. This health check detects when the number of undelivered cache replication messages exceeds some limit:
- WARN when the size of the queue >= 1000
- ERROR when the size of the queue >= 100_000
Understanding the Results
Icon | Result | What this means |
---|---|---|
The health check passed successfully. | The health check did not discover any problems with cache replication. | |
Number of undelivered cache replications from node: < | The number of undelivered cache replications between two nodes has exceeded the warning threshold: 1000. | |
Number of undelivered cache replications from node: < | The number of undelivered cache replications between two nodes has exceeded the error threshold: 100_000. |
Troubleshooting
Stale node
If one of the nodes is a stale node but is still defined in the database in clusternode table and in state ACTIVE, please remove this node from the database or change the state to OFFLINE (see JRASERVER-42916 - Getting issue details... STATUS ). This change will be automatically picked up by all running nodes. When a node is removed from the cluster all nodes should stop delivering cache replication messages to this node. Queue files for this node are left and may be not empty. Those files can be safely manually deleted by the admin.
Connection issues
Node has a problem connecting to <node-name>
. Please check the network connectivity.
Cache replication messages are serialized before being persisted in the queue file and deserialized when the sending threads read the message from the queue. Standard Java Serialization is used, which is the same mechanism which is used "later" when sending this message over RMI.
With asynchronous caches we have also introduced caching of CachePeer - the object representing the remote cache. On a given node instance a single CachePeer is created representing a remote cache (i.e. a cache on a different node). It is re-used until we have any exception on any CachePeer operation or on any node configuration change.
Replication through RMI in EhCache has two phases:
- First phase it to lookup CachePeer in remote RMI registry
- Second phase is to do replication through CachePeer
In the first phase, the connection is done using the node attributes (IP, cacheListenerPort), as defined in clusternode table.
The connection details done in second phase are defined in the stub CachePeer set by the remote node. Note that this is a different port and may be a different IP/hostname then in the first phase.
Other
Following configuration options may be useful:
- cluster.properties file:
- ehcache.listener.socketTimeoutMillis, default: 5sec
- ehcache.object.port
- Java system properties:
- java.rmi.server.hostname, eg -Djava.rmi.server.hostname=127.0.0.1
Links
Jira Data Center cache replication
Monitoring the cache replication