How to troubleshoot and optimize the Full Reindex in Jira Data Center
Platform Notice: Data Center - This article applies to Atlassian products on the Data Center platform.
Note that this knowledge base article was created for the Data Center version of the product. Data Center knowledge base articles for non-Data Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
This article is meant to be a follow-up to How to increase the speed of full reindex in Jira Server and Data Center.
It offers some insights and techniques Admins can use to:
- Understand why Jira's Full Reindex is slow
- Optimize it so it takes full advantage of the infrastructure and parallelism to Reindex all the Issues as quickly as possible.
When a Jira node is performing a Full Reindex, it reports "maintenance" to the /status
URL probe and doesn't process end-users requests, so it's interesting to have finish as soon as possible using as much resource available as possible.
This article was written based off experience from the Atlassian Support and is provided as-is.
Going beyond this to further optimize the Full Reindex is the expertise and scope of Atlassian certified Solution Partners.
Known Reindex performance issues
Optimal reindex speed is observed on Jira 9.4.latest and 9.12.6.
Issue | Summary | Affects | Fixed on |
---|---|---|---|
May increase up to 4× the Full Reindex time observed on Jira 8.20. | 9.0.0 – 9.4.5 9.5.0 – 9.6.99 | 9.4.6 – 9.4.99 9.7.0 – 9.99.99 | |
May double the Full Reindex time observed on 9.4.6. | 9.5.0 — 9.12.5 9.13.0 — 9.14.99 | 9.12.6 — 9.12.99 9.15.0 — 9.99.99 |
The ".99" version is to illustrate "all that came after on this range", so "9.4.6 – 9.4.99" includes 9.4.8, 9.4.10, 9.4.16; and "9.0.0 – 9.4.5" includes 9.1.1, 9.3.2, 9.4.1, etc.
Environment
All versions of Jira Data Center 8 and 9. Both Jira Software and Jira Service Management.
Solution
Jira's Full Reindex is very I/O intensive: many Threads read a lot of data from the DB at the same time and write it to memory and disk. The main factors that can contribute to a Full Reindex low performance are:
- Custom Field suboptimal configuration (expensive global context fields)
- Memory shortage on the Jira node (JVM Heap)
- DB underperformance
- Local storage underperformance (Index folders on the local Jira home folder)
To address that, we may follow these general steps:
- Check the relevant logs
- Confirm there are no memory (Heap) shortage on the JVM
- Assess and optimize Custom Fields contexts
- Increase the Index Threads and DB connections as needed
- Adjust Lucene RAM Buffer
- Check for filesystem underperformance
- Check for DB underperformance
- Increase the Reindex batch sizes
- Limiting factors
While these are general guidelines, your instance details (custom fields types, amount and distribution of data, etc) and infra-structure may be such that you may have to revert some changes (because they weren't beneficial or were even harmful) or rely on just some advices instead of all of them. This requires several iterations of Full Reindex, analysis of data, config changes and repeat.
It is possible to further optimize beyond what's described here, but the config changes start affecting operations beyond the Full Reindex and may not grant as many benefits as the complexity added to the instance. Atlassian certified Solution Partners may be able to advise further through their professional services, though.
1. Relevant logs
After a Full Reindex there are a few log entries very useful for the performance assessment and tuning:
grep -E -a -h " main | JiraTaskExecutionThread-[0-9]+ " application-logs/atlassian-jira.log* | grep -E "Reindex All starting|Reindex All COMPLETED|Canned Response.*FOREGROUND reindex all|ReindexAll took|Index backup started" | sort
(change the file path to the correct one. application-logs/atlassian-jira.log is the path of the unzipped Support Zip)
2023-12-27 12:46:44,589-0500 JiraTaskExecutionThread-1 ··· Reindex All starting...
2023-12-27 12:46:44,693-0500 JiraTaskExecutionThread-1 ··· Canned Response 'REINDEX ALL EVENT' is about to start FOREGROUND reindex all
2023-12-27 12:46:44,884-0500 JiraTaskExecutionThread-1 ··· Canned Response REINDEX ALL EVENT has finished FOREGROUND reindex all
2023-12-27 18:19:43,620-0500 JiraTaskExecutionThread-1 ··· ReindexAll took: 19978969 ms in foreground, index size is 14 GB
2023-12-27 18:19:47,878-0500 JiraTaskExecutionThread-1 ··· Index backup started. Requesting node: ···, currentNode: ···
2023-12-27 18:20:00,152-0500 JiraTaskExecutionThread-1 ··· Reindex All COMPLETED without any errors. Total time: 19990862ms. Reindex run: 453
(··· is redacted info stripped out for better legibility)
The bulk of the Reindex, and what this article aims at improving, is the time between the "Canned Response finished" (line 3) and the "ReindexAll took" (line 4).
The time between line 4 and 5 is what it takes to reindex "shared entities" like filters, dashboards, etc. Some customers with thousands of Filters and Dashboards may have a considerable time on this, but the bulk of the Reindex time should be between lines 3 and 4.
2. Eliminate memory pressure
The Full Reindex reads from the DB, parses the data and stores it into the Lucene memory (which is shortly after flushed to disk). This means the Full Reindex makes intensive use of the "Eden" portion of the JVM Heap — a lot of very short lived objects.
We should load the GC logs into tools like GC Viewer (Best practices for performance troubleshooting tools) to confirm there's no Heap shortage and GC throughput is high (>= 97%) even during Full Reindex — and no Full GC is ever observed. A Full GC event during the Full Reindex will compromise the data we'd analyze and we need to get rid of it first:
- Restart a Jira node and kick off a Full Reindex right after it's online (even better if you can leave it out of the LB and access it through an alternate Tomcat port)
- If a Full GC still occurs, we may need to increase the Heap memory (first make sure there's the
-XX:UseG1GC
JVM Opt)
If we have Thread dumps during the Full Reindex, check if there are no GC Threads with high CPU — this indicates GC overhead and needs to be addressed first before we can assertively move forward.
3. Optimize Custom Fields context
Perhaps the most important evidence of the Full Reindex is the "field indexing cost" report printed in the atlassian-jira.log after each successful Full Reindex:
This lists the most expensive fields to reindex and provides two important infos: the count (the number of Issues that this field was indexed on) and the avg time of reindex:
Even if the Field's not used in the Project, it's still being indexed — and if it's a calculated, dynamic or scripted field (each 3rd party app names them differently), they'll greatly impact reindex time regardless.
You may learn what custom fields are these by querying the DB:
select * from customfield where id in (14501, 14502, 14503, 14504);
Or running the query from this article:
If there are "native" Jira fields ranking high, we may suspect of an underlying underperformance in the filesystem or DB (addressed on the next items).
Actions
You may benefit from cleaning up this instance — specially reducing the context of the most expensive custom fields first. These articles below may help you on this effort:
- Clean up your Jira instance
- Advanced cleanup
- How to list all fields and screens in use by a Project in Jira
Archiving Issues is always a good option to decrease the Full Reindex duration and even improve performance on other regular end-user operations in Jira that rely on Index searches.
4. Index Threads and DB connections
Jira's Full Reindex (and since Jira 9, the Index delta catch up on startup) are parallelized among a number of Index Threads we can configure. Background Reindex, on the other hand, is single-threaded.
Since the Full Reindex is not CPU intensive, we may benefit from having more Index Threads reading from the DB at once and writing to memory and disk (Lucene disk flush is single-threaded, though). The more we can parallelize, the better.
As a starting point, you can check the number of CPU cores available to Jira and set the Reindex Threads to double that amount:
<available-processors>24</available-processors>
This <available-processors> line is present inside the Support Zip: /application-properties/application.xml
.
We could set the Indexer Threads to 50 based on this. We'd edit the jira-config.properties file located in the Jira home folder:
jira.index.issue.threads = 50
If the file doesn't exist, you can create it. If the property exist, update it or add it in a new line if it's not there. See Edit the jira-config.properties file in Jira server for more.
Now we need enough DB connections to support these many Index Threads. On the dbconfig.xml file in Jira home folder, update the pool-max-size to at least the double of Index Threads:
<pool-max-size>100</pool-max-size>
Remember since Jira 8.0 we advise having at least 40 for the pool-max-size. It's OK if you have more, though you may run into bottlenecks if you work with less.
Depending on the instance (custom field types and all), we could even go beyond 2× the CPU cores, but generally this — along with the other tuning items below — is enough to considerably drop the Reindex time. Going beyond 2× the CPU requires a more detailed Thread dump and CPU load analysis.
5. Lucene RAM Buffer
Each Index Threads creates and stores a lot of Lucene Documents in memory during Reindex, and Lucene flushes this data to the disk every 5 minutes or when it's Ram Buffer's full.
Lucene's RAM Buffer is 1024MB by default since Jira 8.0 and this may not be enough depending on how many Custom Fields you have and how much data you have stored in them.
If you notice on the atlassian-jira.log messages like this during the Full Reindex (they're only printing during Full Reindex, actually), you should double it's size:
2023-12-17 02:15:52,700-0500 ClusterMessageHandlerServiceThread:thread-1 WARN [c.a.jira.index.MonitoringIndexWriter] [lucene-stats] Detected frequent flushes (every 14 millis) of lucene index which is below warning limit: jira.index.warn.flush.min.interval.millis=1000 millis. This may affect the foreground indexing performance. Please visit https://confluence.atlassian.com/x/w0VwOQ for more information.
This message's printed if the average flush rate on the 5-minute interval was below 1,000 milliseconds. The presence of this log message may indicate the RAM Buffer's not enough for the Reindex and this may be compromising disk performance.
We can increase it by updating or adding this to the jira-config.properties:
jira.index.batch.maxrambuffermb = 2048
This Buffer's taken from the JVM Heap — this is why it's important to assert the Heap's not already under too much pressure (topic #2).
Reducing the fields context or archiving Issues can also considerably contribute to easing the pressure on the Lucene RAM Buffer.
Increasing the number of Index Threads may also increase the pressure on Lucene RAM Buffer (as there will be more Threads filling up the Buffer at the same time).
6. Filesystem underperformance
Jira uses the local home folder for it's Indexes and the Full Reindex is very disk intensive. Improving the technology of the local storage will not only benefit the Full Reindex but also general performance in Jira (every search, gadget, board, WebHook and many Automations) rely on the Index as data source.
This article on Troubleshooting performance with Jira Stats can help us on this.
We'll look into the "timeToAddMillis" metric and we expect it to be around 1ms. If it's above that, it indicates a potential filesystem underperformance.
We can also download the JAR from Test disk access speed for a Java application and run it to the local Index folder and compare with the benchmark on the article.
Here'a command and sample output of parsing the atlassian-jira.log for some relevant JIRA-STATS:
7. Database underperformance
Database underperformance is a very common cause of Reindex underperformance, too.
First thing, the DBA team should monitor the DB's CPU during the Full Reindex and assess if there are any maintenance or optimization tasks that could improve Jira's performance.
Again the JIRA-STATS may help us identify potential DB issues:
We’d expect these stats to be around:
getIssueVersionMillis: 5ms
incrementIssueVersionMillis: 10ms
latencyNanos: as low as possible (3ms is already suspicious)
When DB performance impacts Jira Reindex, we won't see any single query take much CPU or run for long — but instead, it's an increase of very few milliseconds that cascades into a considerable duration increase overall. This article dives into the DB latency impact on Jira operations:
All these stats take into account: Java time + network time + DB time + network time + Java time. The latencyNanos stat is collected every minute and the query is select * from productlicense, which returns a small and stable amount of rows and data. So network time may interfere as well, and also Full GC cycles (the stat may report longer because the Thread was interrupted so the JVM could free up Heap memory).
If Jira's running on Postgres or MS SQL Server, you'll need to perform some DB optimization tasks regularly:
VACUUM FULL;
ANALYZE VERBOSE;
REINDEX DATABASE <jira-db-name>;
See Optimize and Improve PostgreSQL Performance with VACUUM, ANALYZE, and REINDEX.
Daily for hot tables - UPDATE STATISTICS <table.name>;
Weekly for all Jira DB - UPDATE STATISTICS with fullscan;
Please don't underestimate or disregard this — we keep having consistent reports of Full Reindex times dropping below 50% or more when these exact commands are executed (even if cases with autovacuum enabled).
8. Increase the Reindex batch sizes
Jira Full Reindex works by having a single Thread load a batch of Issues from the DB to the memory (just their Ids), and spawn a number of Index Threads to consume this batch in parallel (going to the DB for data and creating the Index documents). The Index Threads consume this batch by fetching a number of Issues from it at every iteration (50 is the default value), until there's no more Issues left in the batch. Then they die or sleep until the batch is replenished again.
By design, the single "producer" Thread waits until the loaded batch is depleted and all Index Threads have completed their work — this may lead to an excessive waiting time while we could've had the Index Threads working for longer if we had a bigger batch loaded every time. i.e. have this waiting time every 40,000 Issues, not every 4,000.
This waiting time happens every so often, and we can make a better use of it by bumping the batch sizes to benefit more from the parallel phase of the Reindex, and rely less on the synchronized or single-threaded phase.
We can update or add these lines into the jira-config.properties:
jira.index.background.batch.size = 40000
jira.index.issue.maxqueuesize = 40000
jira.index.sharedentity.maxqueuesize = 40000
These 3 configs should always have the same number. Their default is 4000.
You can start with 10× that, 40000, and bump it to 80000 or more until you notice no more benefit from increasing it (see the exact formula below). The tradeoff is more Heap memory will be consumed during Full Reindex and the pressure on the DB and filesystem (Lucene RAM Buffer) will also increase, as there will be more Index Threads fetching data and processing them in parallel.
A good starting point is 40,000 (10× the default) and you can increase it by tweaking the "iteration" variable on this equation:
Index Threads × 50 × Iterations = Batch size
The more iterations we allow, the more we'll rely on the Index Threads parallel processing and the more we'll demand from all resources: CPU, Heap memory, disk, DB CPU and network.
The defaults on Jira 8 and 9, as an example, are: 20 × 50 × 4 = 4000
. 4 iterations seems to be too low for most instances (we have JRASERVER-76819 to increase this) so you can start off with 10 or 20 iterations and adjust from there at every Full Reindex execution.
9. Limiting factors
While we may want the Full Reindex to finish as quickly as possible, optimizing it this much will likely put pressure on the the node CPU, JVM Heap memory, local storage, Database load and network even (node to database). If you start noticing some bottlenecks or perfomance degradation as a side effect on other nodes or in the Full Reindex duration itself, for example, you may revert the most recent changes and try others — if unable to improve the infra-structure.