When rebuilding the Content Indexing, it is marked as REBUILD FAILED but it keeps progressing afterwards
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
A Confluence administrator rebuilds the indexing of a site and after some seconds the index is marked as REBUILD FAILED in the Confluence UI, but it keeps progressing in the background.
Diagnosis
Occasionally, it can happen that Confluence takes longer than expected to start rebuilding the indexes since the moment the Rebuild moment was click on the Content Index Administration page.
Checking the atlassian-confluence-index.log
the following can be found:
2023-10-31 12:45:20,971 INFO [Caesium-1-4] [internal.index.lucene.LuceneFullReindexManager] fullReindexLock Locking indexes for full reindex
2023-10-31 12:46:20,362 WARN [Caesium-1-1] [index.status.schedule.ReIndexHouseKeepingJobRunner] lambda$repairRebuildingJobIfNeeded$1 There was no updates for current re-index job for a while. Last update received at 2023-10-31T12:45:10.367150Z. Marking it as REBUILD_FAILED
However, you can observe later that the reindexing starts and it can even getting COMPLETED without any issue:
2023-10-31 12:49:16,891 INFO [Caesium-1-1] [internal.index.lucene.LuceneFullReindexManager] fullReindexUnlock Unlocking indexes after full reindex
2023-10-31 12:49:16,898 INFO [lucene-interactive-reindexing-thread] [internal.index.status.DefaultReIndexJobManager] indexRebuildStarted Scheduled a job to monitor progress of rebuilding index
2023-10-31 12:49:16,900 INFO [lucene-interactive-reindexing-thread] [internal.index.lucene.LuceneReIndexer] reIndex Indexing starting for stage CONTENT_ONLY
2023-10-31 12:51:22,128 INFO [lucene-interactive-reindexing-thread] [internal.index.lucene.LuceneReIndexer] lambda$reIndex$7 full reindex starting for CONTENT_ONLY, deleting documents from index
2023-10-31 12:51:22,128 INFO [lucene-interactive-reindexing-thread] [internal.index.lucene.LuceneReIndexer] lambda$reIndex$7 full reindex documents deleted for CONTENT_ONLY, starting full reindex
2023-10-31 12:51:22,133 INFO [lucene-interactive-reindexing-thread] [confluence.internal.index.ConcurrentBatchIndexer] submitBatches Partitioning indexable entities [387468 com.atlassian.confluence.pages.Draft] up to 100 at a time across indexing threads
...
...
2023-10-31 13:01:20,361 INFO [Indexer: 2] [confluence.internal.index.ConcurrentBatchIndexer] logProgress Re-index progress: 100 of 387468. 0% complete. Memory usage: 4 GB free, 9 GB total
2023-10-31 13:01:20,391 INFO [Indexer: 3] [confluence.internal.index.ConcurrentBatchIndexer] logProgress Re-index progress: 200 of 387468. 0% complete. Memory usage: 4 GB free, 9 GB total
...
...
Noticed that there are almost 4 minutes in the example above between the first entry related to the index rebuild, and the actual starting of the indexing operation.
Cause
The ReIndexHouseKeepingJobRunner runs periodically (every 60 seconds) to detect and fix stalled re-index jobs.
A site re-index job can become stalled when:
- The Node that is rebuilding index has been restarted.
- Some nodes drop out of cluster during receiving new index snapshot.
- Any issues that long block the re-index job state to be updated, like no updates in a minute.
The job reads com.atlassian.confluence.index.status.ReIndexJob
value from the BANDANA table and then check the lastRebuildingUpdate field. If the field is updated over REBUILDING_INDEX_NO_UPDATES_MAX_SECONDS (by default 60 seconds) ago, the job will mark the current re-index job as REBUILD_FAILED.
Solution
To mitigate this situation, you can make use of the system property confluence.rendex.noupdate.max.seconds
, which should give additional time to ReIndexHouseKeepingJobRunner for periodically checking the re-index jobs.
Increase this System Properties in your
setenv.sh
file (in the example below, it is increased to 5 minutes):CATALINA_OPTS="-Dconfluence.rendex.noupdate.max.seconds=300 ${CATALINA_OPTS}"
Note the code is currently defining this property as confluence.rendex.noupdate.max.seconds and not confluence.reindex.noupdate.max.seconds (there is no typo there)
- Restart Confluence and try reindexing again.