Jira Index Recovery Through Snapshot Restore Fails With "CorruptIndexException: file mismatch"


Platform Notice: Data Center - This article applies to Atlassian products on the Data Center platform.

Note that this knowledge base article was created for the Data Center version of the product. Data Center knowledge base articles for non-Data Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

 

Summary

While attempting to recover Jira index through Snapshot Restore, in rare circumstances, you might run into this exception "CorruptIndexException: file mismatch". While the corrective action against this exception is quite simple, (i.e., rebuilding the index), it can be very complex to understand what may have caused the index corruption in the first place.

This Article is intended to explain certain scenarios where this could happen.

Environment

Any Jira Data Center Environment

Diagnosis

Here's the general flow you'll notice in the logs after a failed attempt to restore Index from a snapshot:

  • Index snapshot restore is initiated:
2023-04-03 21:14:36,324-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring index with <X> issues
2023-04-03 21:14:36,332-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring search indexes - 1% complete... [INDEX-FIXER] Replacing indexes
2023-04-03 21:14:36,332-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Starting the index restore process by replacing existing indexes. For this, we're holding the 'stop-the-world' index lock.
  • It might progress for a while, depending on where in the process it finds the corrupted Lucene file:
2023-04-03 21:14:36,732-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Re-index start time: {2023-03-31 20:06:32.145}, Latest DB issue-version date: {2023-04-03 21:14:36.010}
2023-04-03 21:14:36,735-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.LegacyIndexFixer] [INDEX-FIXER] [LEGACY] Re-index start time: {2023-03-31 20:06:32.145}, Latest DB (jiraissue table) date: {2023-04-03 21:14:35.653}
2023-04-03 21:14:36,735-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Done replacing existing indexes. The rest of the index restore process will continue without the lock.
2023-04-03 21:14:36,735-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] Restoring search indexes - 20% complete... [INDEX-FIXER] Restored index backup
2023-04-03 21:14:36,735-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.DefaultIndexRecoveryManager] [INDEX-FIXER] [LEGACY] Until Jira 9.0, we're running the legacy index fixer before the new versions-based one
2023-04-03 21:14:36,736-0700 localhost-startStop-1 INFO      [c.a.j.index.ha.LegacyIndexFixer] [INDEX-FIXER] [LEGACY] Re-indexing issues from: Fri Mar 31 20:06:32 PDT 2023 to: Mon Apr 03 21:14:35 PDT 2023 ...
  • Then it finds a corrupted Lucene file and calls this "CorruptIndexException: file mismatch" error and fails to restore the Index:
2023-04-03 21:14:47,995-0700 localhost-startStop-1 WARN      [c.a.jira.index.AccumulatingResultBuilder] org.apache.lucene.index.CorruptIndexException: file mismatch, expected id=3szs4k6fwb774nmibsoiezrps, got=2e4x84h22ssxc50oywtt09wfw (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/atlassian/application-data/jira/caches/indexesV1/comments/_3rh8f.si")))
com.atlassian.jira.util.RuntimeIOException: org.apache.lucene.index.CorruptIndexException: file mismatch, expected id=3szs4k6fwb774nmibsoiezrps, got=2e4x84h22ssxc50oywtt09wfw (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/atlassian/application-data/jira/caches/indexesV1/comments/_3rh8f.si")))
	at com.atlassian.jira.index.WriterWrapper$1.get(WriterWrapper.java:88)
	at com.atlassian.jira.index.WriterWrapper$1.get(WriterWrapper.java:79)
	at com.atlassian.jira.index.WriterWrapper.<init>(WriterWrapper.java:73)
	at com.atlassian.jira.index.WriterWrapper.<init>(WriterWrapper.java:79)
	at com.atlassian.jira.index.DefaultIndexEngine$DefaultWriterFactory.apply(DefaultIndexEngine.java:246)
	at com.atlassian.jira.index.DefaultIndexEngine$DefaultWriterFactory.apply(DefaultIndexEngine.java:241)
	at com.atlassian.jira.index.DefaultIndexEngine$WriterReference.doCreate(DefaultIndexEngine.java:226)
	at com.atlassian.jira.index.DefaultIndexEngine$WriterReference.doCreate(DefaultIndexEngine.java:203)
	at com.atlassian.jira.index.DefaultIndexEngine$ReferenceHolder$2.get(DefaultIndexEngine.java:286)
	at com.atlassian.jira.concurrent.ResettableLazyReference.getOrCreateUnderLock(ResettableLazyReference.java:97)
	at com.atlassian.jira.concurrent.ResettableLazyReference.getOrCreate(ResettableLazyReference.java:89)
	at com.atlassian.jira.index.DefaultIndexEngine$ReferenceHolder.apply(DefaultIndexEngine.java:283)
	at com.atlassian.jira.index.DefaultIndexEngine.write(DefaultIndexEngine.java:150)
	at com.atlassian.jira.index.DefaultIndex.perform(DefaultIndex.java:28)
	at com.atlassian.jira.index.QueueingIndex$Task.perform(QueueingIndex.java:215)
	at com.atlassian.jira.index.QueueingIndex$Task.index(QueueingIndex.java:226)
	at com.atlassian.jira.index.QueueingIndex$Task.run(QueueingIndex.java:194)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.lucene.index.CorruptIndexException: file mismatch, expected id=3szs4k6fwb774nmibsoiezrps, got=2e4x84h22ssxc50oywtt09wfw (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/atlassian/application-data/jira/caches/indexesV1/comments/_3rh8f.si")))
	at org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
	at org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:360)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1121)
	at com.atlassian.jira.index.MonitoringIndexWriter.<init>(MonitoringIndexWriter.java:43)
	at com.atlassian.jira.index.MonitoringIndexWriter.create(MonitoringIndexWriter.java:58)
	at com.atlassian.jira.index.WriterWrapper$1.get(WriterWrapper.java:86)
	... 17 more
	Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (8465308b). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/atlassian/application-data/jira/caches/indexesV1/comments/_3rh8f.si")))
		at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:463)
		at org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:266)
		... 24 more
  • This "file mismatch" can happen on one or more Lucene documents - however, the restore would not continue after the first hit. So in general, we would see just one document getting reported:
  • Count of the exception:
$ grep -c CorruptIndexException atlassian-jira.log* | grep -v ':0$'
atlassian-jira.log:10591
atlassian-jira.log.1:35102
  • All the errors are reported on the same Lucene document:
$ grep 'CorruptIndexException' atlassian-jira.log | awk -F'path=' '{print $2}' | sort | uniq -c
  10591 "/atlassian/application-data/jira/caches/indexesV1/comments/_3rh8f.si")))

$ grep 'CorruptIndexException' atlassian-jira.log.1 | awk -F'path=' '{print $2}' | sort | uniq -c
  35102 "/atlassian/application-data/jira/caches/indexesV1/comments/_3rh8f.si")))


  • In some cases, the index corruption logs would look like either of below
2024-08-14 00:14:45,524-0400 main ERROR      [c.a.jira.cluster.DefaultClusterManager] Current node: node1. Couldn't recover index even though it had been found in shared. Current list of other nodes: [nodeX, nodeX+1, nodeX+2]
com.atlassian.jira.util.RuntimeIOException: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/var/lib/jira/atlassian/application-data/jira/caches/indexesV2/comments/segments_xxxx")))
2024-08-14 00:14:53,151-0400 JiraTaskExecutionThread-1 INFO anonymous     [c.a.j.index.request.DefaultReindexRequestManager] Re-indexing started
2024-08-14 00:14:53,152-0400 JiraTaskExecutionThread-1 INFO anonymous     [c.a.j.util.index.CompositeIndexLifecycleManager] Reindex All starting...
2024-08-14 00:14:53,164-0400 JiraTaskExecutionThread-1 INFO anonymous     [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Created node re-index service, paused=true, running period=5sec, delay=10sec
...
2024-08-14 00:14:55,543-0400 JiraTaskExecutionThread-1 ERROR anonymous     [c.a.j.util.index.CompositeIndexLifecycleManager] Reindexing encountered an error for one of the indexers. Reindex all continues. Error for indexer: DefaultIndexManager: paths: [/var/lib/jira/atlassian/application-data/jira/caches/indexesV2/comments, /var/lib/jira/atlassian/application-data/jira/caches/indexesV2/issues, /var/lib/jira/atlassian/application-data/jira/caches/indexesV2/changes, /var/lib/jira/atlassian/application-data/jira/caches/indexesV2/worklogs]
com.atlassian.jira.util.RuntimeIOException: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/var/lib/jira/atlassian/application-data/jira/caches/indexesV2/comments/segments_xxxx")))


Cause

The main reason for the failing index is because the node cannot read from one or more of the files of the current index.

The exact footprint of the issue can vary depending on where it finds the file mismatch. It can be sort of counter-intuitive to understand why the process would fail when the corruption is detected on the existing index, especially when we're trying to restore from a good snapshot which should de facto assume there's something wrong with the current one as a premise and ignore any error. However, this is where the bug is:

Additional Info:

The issue of file corruption is a complex and challenging topic, and while the bug mentioned above is only a part of the problem and very specific to Jira's index snapshot recovery, it is only one aspect of a much broader concern. In particular, file corruption resulting from a "CorruptIndexException: file mismatch" in a Lucene-based search engine can be especially difficult to trace to its root cause. Our research has focused specifically on this exception, and we offer the following notes in hopes of shedding some light on the matter.

When a "CorruptIndexException: file mismatch" exception occurs, it is typically due to a mismatch between the segments of an index, such that the data being read is not the same as the data that was previously written. This discrepancy is identified by the checksum mismatch e.g., "expected id=3szs4k6fwb774nmibsoiezrps, got=2e4x84h22ssxc50oywtt09wfw". While there are many possible reasons for this issue, including data corruption due to hardware or software failures, network connectivity problems, file system issues such as disk errors, or concurrent modifications by multiple threads or processes, the cause can be difficult to pin down definitively.

Even in the exception itself, Lucene suggests a bunch of possibilities:

 Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (8465308b). possibly transient resource issue, or a Lucene or JVM bug 

There are various potential causes of data corruption in a storage subsystem, including bugs in the Filesystem bugs, kernel bugs, drive firmware bugs, and incompatible RAID controllers. Additionally, faulty hardware such as RAM, RAID controller, or drive can also be responsible.

Although professional systems employ protective mechanisms to prevent corruption, it is impossible to guarantee 100% reliable data storage. Despite best efforts, silent corruption can occur without warning, making it nearly impossible to create a completely immune system against it.

Apache Lucene employs a straightforward yet effective method for identifying corruption that may have been missed by lower-level systems. Each relevant file in a Lucene index includes a CRC32 checksum in its footer, which is adept at detecting the random corruption that commonly occurs on disk. Although checksum verification can be time-consuming and resource-intensive, it is typically not performed often due to the effort required to read each file's entirety. Certain situations like Index segment merging, Index recovery etc. can trigger Lucene to validate the checksum. Nevertheless, checksum mismatches serve as a reliable indicator of data that Lucene read but did not previously write.

However, silent corruption poses a unique challenge, as it rarely results in any visible indications of corruption beyond a checksum mismatch. The CorruptIndexException exception is typically only triggered when an unusual event occurs, such as the rare occasions when it is necessary to read an entire file. This process provides an opportunity to verify the checksum and identify any corruption that may have occurred. Unfortunately, this method cannot determine the cause of the corruption or when it occurred, as it may have been introduced long before it was detected. Lucene's file writing and checksumming process is simple, sequential, and widely used, but the underlying system calls it employs are complex, concurrent, and variable. As a result, infrastructure-related factors are often the root cause of corruption and are beyond the application's control.


Solution

Keeping the Lucene document corruption itself (Additional Info section) aside, as an Application owner we should be able to restore the index snapshot to get the instance back on its feet. Until the bug is fixed, a couple of potential workarounds would be:

  • Stop Jira on the node, then delete the folders below from the local node's caches\indexesV1 or caches\indexesV2 directory depending on your Jira version and the error message, then restart Jira to induce a snapshot restore:

    • changes
    • comments
    • entities
    • issues
    • plugins
    • worklogs
  • Or, do a lock and reindexing (foreground) on the node by taking it out of the Load balancer rotation.


Last modified on Jan 27, 2025

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.