Jira DC Cache Replication - dev tips & tricks
On this page I would like to focus on some specific properties of remote cache implementation in Jira DC.
TLDR
Most of the time you want to have a cache replicated by invalidation with a loader backed by the DB
Asynchronous replication vs synchronous
In Jira DC a remote cache is always an asynchronously replicated cache.
//Note: those cache settings are ignored in Jira DC
//Jira DC supports only async cache replication
new CacheSettingsBuilder().replicateAsynchronously();
new CacheSettingsBuilder().replicateSynchronously();
There is only a guarantee that operations on a replicated cache on given thread will be replicated on other nodes in the same order.
You can find more details about how this works in Jira Data Center cache replication.
ReplicateViaInvalidation vs ReplicateViaCopy
We have 2 types of replication. One works, is simple, recommended and this is what most Jira caches use: replication by invalidation. Such cache should always have a loader so the only operation which is done on such cache is removeByKey or removeAll. This triggers an asynchronous remove(key) or removeAll message to all other nodes.
Example of such cache definition:
cacheManager.getCache(
getClass().getName() + ".cache", //cache name
this::loadValueForKey, //cache loader
new CacheSettingsBuilder().remote().replicateViaInvalidation().unflushable().build() // replicated via invalidation
);
Warning
Note: if you don’t specify any cache setting explicitly this is what you will get by default:
cache is
remote
by default- cache is
replicateViaInvalidation
by default better always explicitly specify cache settings
without a loader such cache is probably useless - see “Non-sense caches”
The sequence of using such cache usually looks like this:
value = ...;
updateDB(key1,value); // store the new value of key in the store (which the cache loader is using)
cache.remove(key1); // this invalidates the key in the local cache synchronously and asynchronously on other nodes by sending a remove(key1) message to all nodes
....
cache.get(key1); //triggers the loader to get the value for key from the shared store
The other type of replication is replication by value. Such cache would not be backed by a loader, you would do all operations on this cache (put, remove) and puts will be replicated as puts and removes as removes. This one is tricky and there are only few use cases where it works out of the box.
The value are replicated asynchronously. Remember this is a replicated cache, not a synchronised shared storage, all synchronous cache operations are local (i.e. changes applied to the cache in memory and persisting the replicated message on local store - localq).
The problem here is that asynchronous put (and this is the main difference put vs remove only) will result in an undefined state of the cache on all nodes. The race between different put(K1) is happening between 2 different threads on a single node or between any node. You may want to solve this problem by using a cluster lock, but the cluster lock service is based on DB (and its not supporting well dynamic keys) so it will be better to switch to a cache replicated by invalidation.
There are however use cases where this is not a problem if the data we store in the cache is key-ed by the node (node id), i.e. there is only one node in the cluster updating a given key. Then the only thing we need to guarantee is that the cache on a given node is only updated by a single thread.
Example: Imagine a cache where each node would store a timestamp of the last created issue:
cacheTimestampByNode = cacheManager.getCache(
getClass().getName() + ".cache",
null, //no loader
new CacheSettingsBuilder().remote().replicateViaCopy().unflushable().build() //replicated via copy
)
The sequence would look like this:
onIssueCreate(issue) {
currentNode = getCurrentNodeId();
timestamp = issue.created();
runInCacheUpdateThread(
cache.put(currentNode, timesamp); //this updates the local cache synchronously and sends an async put(currentNode, timestamp) to all nodes
)
}
If you would like to have a cache replicated by value, where the value can be updated by any node - please change your mind. You will need to have another expensive and problematic layer which would provide synchronisation (cluster locks, cluster messages). You will also have to deal with nodes lifecycle (nodes going up/down/clearing caches).
Warning
You almost never want to use this pattern: cache replicated via value with cluster synchronisation mechanism
Guarantee of delivering
Since Jira 8.12.0, the are 2 types of replicated caches which have different delivery guarantees:
cache replicated operation (remove/removeAll) triggered by cache replicated by invalidation have a guarantee of delivery
cache replicated operation (put/remove/removeAll) triggered by cache replicated by value have NO guarantee of delivery
Cache lifecycle
A remote cache can not be created dynamically but must be created as part of the app/Jira lifecycle: the remote cache must be defined in a non-lazy singleton service constructor.
When a remote cache is created an RMI cache peer will be created representing this cache. If the cache is not created when Jira/plugin is up, operations on this cache happening on other nodes would not be able to replicate to such node.
WARN [LOCALQ] [VIA-INVALIDATION] Abandoning sending because cache does not exist on destination node: [cache-name]
...
java.rmi.NotBoundException
...
Note that the above situation can also happen (legally) when the other node is in the process of starting (Jira is up with the RMI port open but the plugin is still starting or Jira is up but the plugin with this cache is being restarted. Also this may happen during ZDU when the plugin with such cache is not ZDU friendly and is using this cache during ZDU, when there are still other nodes which do not have this cache yet).
Non-sense caches
So as described we have actually one type of cache you usually want to have in DC:
→ cache replicated via invalidation with a loader backed by a persistent data store (DB)
The other type of cache makes sense only in few use cases where data is keyed by the node:
→ cache replicated via value without loader where given key value can only be updated by a single node by design
Watch out - our cache API allows to create other cache which makes completely no sense in Jira DC:
Example: cache replicated by invalidation with no loader
(puts and removes will be replicated as removes so would end up in a cache which would be very busy being very empty)
// node1
cache.put(k1, v); //replicated as remove(k1)
// node2
cache.put(k1, v); //replicated as remove(k1)
// node1
cache.get(k1) == null
// node2
cache.get(k1) == null
java.rmi.NotBoundException
java.rmi.NotBoundException exception is thrown when a node tries to replicate a cache-replication message (REMOVE, PUT) and the destination node is up but this cache is not available on the other node; it can be a temporary state (lets cover this later) or permanent (there is something wrong - let's cover this later).
Jira assumes remote caches are (and should) be created when the plugin starts.
So usually the code will look something like this:
@Component
public class MyComponent {
private final Cache<Serializable, Serializable> myRemoteCache;
public MyComponent(@ComponentImport final CacheManager cacheManager) {
this.myRemoteCache = cacheManager.getCache(
"my.remote.cache.name",
null,
new CacheSettingsBuilder()
.replicateViaInvalidation()
.build());
}
}
Let's have a 2 node cluster and a cache replication event on Node1. Let's see when could we see this exception.
- Both nodes are up and on both nodes the plugin is initialised and the cache is created (on node start): java.rmi.NotBoundException - can not happen
- Node1 is up and Node2 is down (by down I mean Node1 can't connect to RMI port on Node2): java.rmi.NotBoundException - can not happen
- Node1 is up and Node2 is starting - RMI port is open before all plugins are up (so before MyComponent was created on Node2). java.rmi.NotBoundException - can happen but this is a temporary situation. Node1 will retry n-times (information in log) to deliver this cache replication messages and assume Node2 is starting and that this cache will be available soon when this plugin is fully up.
- Both nodes are up but on node2 the cache was not created: java.rmi.NotBoundException - can happen
Possible explanation:- different versions of Jira running on both nodes and one version is missing this plugin) - this can happen when doing "zero downtime upgrade" and is usually not a problem (in case of using recommended remote caches replicated via invalidation)
- remote cache not created on node start but during the lifetime of the node - this is an error in the code or would need another mechanism (cluster lock) to synchronise the creation of the cache on all nodes; not recommended anyway;
Links
Jira Data Center cache replication
Monitoring the cache replication