Hazelcast network partition causing Plugin Issues in Bitbucket Data Center
Platform Notice: Data Center - This article applies to Atlassian products on the Data Center platform.
Note that this knowledge base article was created for the Data Center version of the product. Data Center knowledge base articles for non-Data Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
A Hazelcast network partition causes the node to leave the cluster. After this, some third-party apps, such as Script Runner, fail to be enabled.
Environment
Bitbucket Datacenter 8.x and above
Diagnosis
Plugin Enable Event
Identify on which node the plugin enabling event happens. It looks like the plugin is trying to get enabled but there are no errors logged after this.
2024-07-18 10:40:41,717 INFO [http-nio-7990-exec-102] B0845 *O6789Ox640x892167x18 18ggda <IP> "PUT /rest/plugins/1.0/com.onresolve.stash.groovy.groovyrunner-key HTTP/1.1" c.a.plugin.manager.PluginEnabler Resolving 1 plugins
2024-07-18 10:40:41,718 INFO [http-nio-7990-exec-102] B0845 *O6789Ox640x892167x18 18ggda <IP> "PUT /rest/plugins/1.0/com.onresolve.stash.groovy.groovyrunner-key HTTP/1.1" c.a.plugin.manager.PluginEnabler Enabling 1 plugins: [com.onresolve.stash.groovy.groovyrunner]
Thread Dumps
From the thread dumps, you will note multiple blocked threads for the PluginEnabler package. In the below example, the time these threads were blocked corresponds to the timestamp when the plugin enable event was triggered at 10:40:41 in the above example logs. Below is the stack trace corresponding to these blocked threads. From the stack threads it appears that the plugin enable event is tied to the hazelcast threads.
10:48:25 - http-nio-7990-exec-94
State:BLOCKED
CPU usage:0.00%
Running for: 0:00.00
Waiting for
This thread is waiting for notification on lock [0x86d0f1f] owned by hz.hazelcast.cached.thread-2
Locks held
This thread does not hold any locks
Stack trace
com.atlassian.plugin.osgi.factory.OsgiPlugin.enableInternal(OsgiPlugin.java:388)
com.atlassian.plugin.impl.AbstractPlugin.enable(AbstractPlugin.java:260)
com.atlassian.plugin.manager.PluginEnabler.actualEnable(PluginEnabler.java:120)
com.atlassian.plugin.manager.PluginEnabler.enable(PluginEnabler.java:97)
com.atlassian.plugin.manager.PluginEnabler.enableAllRecursively(PluginEnabler.java:69)
com.atlassian.plugin.manager.DefaultPluginManager.lambda$enablePlugins$30(DefaultPluginManager.java:1582)
com.atlassian.plugin.manager.DefaultPluginManager$$Lambda$6457/0x0000000803536840.run(Unknown Source)
com.atlassian.plugin.manager.PluginTransactionContext.wrap(PluginTransactionContext.java:63)
com.atlassian.plugin.manager.DefaultPluginManager.enablePlugins(DefaultPluginManager.java:1549)
com.atlassian.stash.internal.plugin.ClusteredPluginController.enablePlugins(ClusteredPluginController.java:43)
Cause
The plugin enabled event occurred on the node that had left and rejoined the cluster.
Hazelcast network partition
There was a hazelcast network partition in which one of the nodes of the Bitbucket Data Center instance dropped out of the cluster membership and rejoined. The huge clock jump as per the below log caused the node to leave the cluster:
2024-07-17 00:02:29,443 WARN [hz.hazelcast.cached.thread-4] c.h.i.c.impl.ClusterHeartbeatManager [<IP>]:5701 [user] [3.12.14-atlassian-6] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 154770 ms, Heartbeat-Timeout: 60000 ms
2024-07-17 08:36:26,773 WARN [hz.hazelcast.cached.thread-12] c.h.i.c.impl.ClusterHeartbeatManager [<IP>]:5701 [user] [3.12.14-atlassian-6] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 37157 ms, Heartbeat-Timeout: 60000 ms
2024-07-17 08:02:11,922 WARN [hz.hazelcast.cached.thread-3] c.h.i.c.impl.ClusterHeartbeatManager [<IP>]:5701 [user] [3.12.14-atlassian-6] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 34369 ms, Heartbeat-Timeout: 60000 ms
Cluster Member leaving:
2024-07-17 00:02:39,804 INFO [hz.hazelcast.event-5] c.a.s.i.c.HazelcastClusterService Node '/<IP>:5701' was REMOVED from the cluster. Updated cluster:
[/<IP>:5701 master this uuid='5c79575a-d35f-4447-9070-13f22b7a9fc7' vm-id='1efd3a5c-2523-4e2c-86c1-860b1728e460']
2024-07-17 00:02:39,898 WARN [hz.hazelcast.cached.thread-17] c.h.m.i.n.i.MemberMapInvalidationMetaDataFetcher [<IP>]:5701 [admin] [3.12.14-atlassian-6] Can't fetch or extract invalidation meta-data of Member [<IP>]:5701 - 9ca5ed35-cee2-4cbc-8236-313dafcc5a31
Cluster Member joining back:
2024-07-17 00:02:51,413 WARN [hz.hazelcast.cached.thread-16] c.h.internal.cluster.ClusterService [<IP>]:5701 [admin] [3.12.14-atlassian-6] Resetting local member UUID. Previous: 5c79575a-d35f-4447-9070-13f22b7a9fc7, new: b1f8e6a5-9256-4f76-ba37-27c35de58c67
2024-07-17 00:02:59,751 INFO [hz.hazelcast.event-5] c.a.s.i.c.HazelcastClusterService Node '/<IP>:5701' was ADDED to the cluster. Updated cluster:
[/<IP>:5701 master uuid='9ca5ed35-cee2-4cbc-8236-313dafcc5a31' vm-id='c9138be2-8dd6-47a6-bd55-0a4c478daeff'],
[/<IP>:5701 this uuid='b1f8e6a5-9256-4f76-ba37-27c35de58c67' vm-id='1efd3a5c-2523-4e2c-86c1-860b1728e460']
During this phase, the PluginStateSplitBrainHandler got invoked. When that happens, the local state of the plugin system is compared to the state in database. If discrepancies are found, actions are taken to bring the local state back in sync with the plugin state in database.
PluginStateSplitBrainHandler invoked
2024-07-17 00:02:59,905 INFO [threadpool:thread-2] c.a.s.i.p.PluginStateSplitBrainHandler 0 new plugins were found and installed
2024-07-17 00:02:59,783 INFO [threadpool:thread-3] c.a.s.i.p.PluginStateSplitBrainHandler 0 new plugins were found and installed
Solution
- Try to enable the plugin while having a session on the other node that never left the cluster.
- If this fails as well, try to bring down all the nodes in the cluster, and then bring them up one by one and then try to enable the plugin once more.