Mirror farm fails to synchronise on update-ref after a recent startup or when a node is added to the farm

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Summary

After multiple nodes in a Mirror farm are started, or a new node is started, it is possible for the initial sync to fail with an error such as:

1 "fatal: cannot update ref 'refs/heads/master': trying to write ref 'refs/heads/master' with nonexistent object"

It is important to note that this error can occur in other legitimate failures with synchronization. This article will only cover a particular scenario where this occurs after a recent startup of a mirror node that joins already running mirror nodes.

Environment

Bitbucket mirror farm with multiple nodes (6.7+).

Diagnosis

In one of the Bitbucket node logs, we see an error after a startup:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2021-08-05 08:38:38,483 DEBUG [farm-operation-3] c.a.b.i.m.m.f.s.RepositorySynchronizationOperation Repository[2917]: snapshot synchronization 2021-08-05 08:38:38,492 ERROR [farm-operation-2] c.a.b.i.m.m.f.t.o.RetryingMirrorOperation MirrorOperation: updateRef failed attempt 5/5 for request: RepositorySynchronizationRequest{changeId=67113846669978b357c1a4c79c74cac374f7eca8, externalRepositoryId=2917, orchestratingNodeVmId=a8ff28f7-4a24-431e-8ce1-3b7e2fe2ea15, type=snapshot}; giving up com.atlassian.bitbucket.scm.CommandFailedException: '/usr/local/git/bin/git update-ref --stdin -z' exited with code 128 saying: fatal: cannot update ref 'refs/heads/master': trying to write ref 'refs/heads/master' with nonexistent object 50c3207daa60bc493c7bea70a6ff07351f06ab09 at com.atlassian.bitbucket.scm.DefaultCommandExitHandler.onError(DefaultCommandExitHandler.java:47) at com.atlassian.bitbucket.scm.git.command.GitCommandExitHandler.evaluateThrowable(GitCommandExitHandler.java:111) at com.atlassian.bitbucket.scm.git.command.GitCommandExitHandler.onError(GitCommandExitHandler.java:208) at com.atlassian.bitbucket.scm.DefaultCommandExitHandler.onExit(DefaultCommandExitHandler.java:32) at com.atlassian.bitbucket.internal.process.nu.NioNuProcessHandler.callExitHandler(NioNuProcessHandler.java:285) at com.atlassian.bitbucket.internal.process.nu.NioNuProcessHandler.finish(NioNuProcessHandler.java:326) at com.atlassian.bitbucket.internal.process.nu.NioNuProcessHandler.onExit(NioNuProcessHandler.java:123) at com.zaxxer.nuprocess.internal.BasePosixProcess.onExit(BasePosixProcess.java:319) at com.zaxxer.nuprocess.linux.ProcessEpoll.handleExit(ProcessEpoll.java:371) at com.zaxxer.nuprocess.linux.ProcessEpoll.cleanupProcess(ProcessEpoll.java:334) at com.zaxxer.nuprocess.linux.ProcessEpoll.process(ProcessEpoll.java:272) at com.zaxxer.nuprocess.internal.BaseEventProcessor.run(BaseEventProcessor.java:81)

Other nodes may see this log at the time of the error:

1 2 2021-08-05 08:37:53,388 DEBUG [farm-operation-5] c.a.b.i.m.m.f.s.RepositorySynchronizationOperation Repository[2917]: snapshot synchronization 2021-08-05 08:38:38,418 ERROR [threadpool:thread-31] c.a.b.i.m.m.f.s.RepositorySnapshotSyncEventVisitor Snapshot failed for externalRepository: [TEST/repo]#2917 error: {class com.atlassian.bitbucket.scm.CommandFailedException=[437e79a3-9088-4564-ba0f-380c2bc03fab]}

The node with the error has been started recently, and joined another node. Up to the error, the problem node shows many of the following logs with debug logs on:

1 2021-08-04 15:42:28,687 DEBUG [threadpool:thread-4] c.a.b.i.m.m.f.s.FarmOrchestrator Skipping repository[2917] sync as lock could not be acquired

After the error, the repository sync is retried and succeeds.

Cause

In order to understand the cause, we need to quickly explain what occurs on the mirror farm during a sync. The mirror farm splits out fetch and update-ref operations to ensure that when you hit the load balanced mirror, you always get the same results (refs). It ensures that each node fetches what it needs first before scheduling an update-ref across all nodes. This way the refs change on every node simultaneously and ensures consistency. One node in the farm generally picks up the role of orchestrating this process. It can be any node and is generally the first to receive the request.

The issue is that live nodes can send a fetch request to the cluster before a node is added. If the fetch takes a substantial period of time, it is possible for a node to join the cluster during the fetch. This new node will not have received the fetch event, and if it is out of date it will not be able to start a fetch due to the repository lock. The repository lock is used to ensure that only one fetch request is processed by all nodes at the same time (one fetch operation of a repository per node).

When other nodes finish the fetch operation, the orchestrator node will then publish an update-ref event. This event gets sent to all nodes, including the new node. If the new node is out of date or missing the ref object, it will fail with the "fatal: cannot update ref" error. To ensure the consistency mentioned earlier this failure is sent back to all nodes, and all nodes do not update refs. The whole operation to sync the repository is retried, and on the second try the new node will receive the fetch event, allowing the second try to succeed.

Solution

There is two workarounds for this issue.

  1. Ensure a node is fully synced after starting up before starting other nodes

  2. Wait for the second retry to occur, allowing for the second sync to succeed

Updated on March 19, 2025

Still need help?

The Atlassian Community is here for you.