Understand the index process in Jira server
Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
How does JIRA indexing work?
JIRA uses a third-party library created by Apache called Lucene for Dashboards, Issue Search, Reports, and Boards.
- Basic concept of Lucene:
- Documents:
- A Document is the unit of search and index. An index consists of one or more Documents.
- Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.
- A Lucene Document doesn't necessarily have to be a document in the common English usage of the word. For example, if you're creating a Lucene index of a database table of users, then each user would be represented in the index as a Lucene Document.
- Fields:
A Document consists of one or more Fields. A Field is simply a name-value pair. For example, a Field commonly found in applications is title. In the case of a title Field, the field name is title and the value is the title of that content item.
- Indexing in Lucene thus involves creating Documents comprising of one or more Fields, and adding these Documents to an IndexWriter.
- Searching:
- Searching requires an index to have already been built. Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead.
- This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).
- It involves creating a Query(usually via a QueryParser) and handing this Query to an IndexSearcher, which returns a list of Hits.
- Queries:
- Lucene has its own mini-language for performing searches. Read more about the Lucene Query Syntax.
- The Lucene query language allows the user to specify which field(s) to search on, which fields to give more weight to (boosting), the ability to perform boolean queries (AND, OR, NOT) and other functionality.
- Documents:
- JIRA uses Lucene for its indexing. More specifically:
- When an issue is created or modified in JIRA, a Lucene Document is created that contains the fields from that issue as well as some additional calculated data that is useful for searching. Whenever JIRA indexes an issue, it passes the issue and its associated Lucene Document to the plugin, which can modify the Lucene Document however it likes before returning the names of any fields it added. That Lucene Document is then added to the Lucene index, replacing any previous Document for that issue in the index.
- Additional Lucene Documents are created for Issue Comments, Work Logs and Change History entries.
- Depending upon the changes made to the issue, none, some or all of the Issue Comments, Work Logs and Change History Documents may be regenerated, replacing previous Documents.
- The Lucene index may need to be regenerated if it becomes corrupt or there are configuration changes made that mean the indexed data needs to be updated. The simplest example of a configuration change that causes a reindex is when a calculated custom field is added and the result of the calculation needs to be indexed so it can be used for searching. Adding or removing the visibility of a custom field from a project also raises the need to reindex JIRA (at least affected projects, but we don't automatically detect which projects need reindexing and just say "Reindex all the things").
- Configuration changes never require Comments, Work Logs and Change History to be reindexed and so a background reindex does not reindex these entries. (Some of these may be reindexed for some issues to ensure correctness of the index for issues that are updated otherwise while the reindex is running).
- The database is the canonical source of truth for JIRA and the index uses SQL to read the database so it can build the index. Custom fields, especially those supplied by plugins, may retrieve data in other ways and include this in the index. Keep in mind that retrieving data from a remote system, can have a disastrous impact on indexing performance.
How does Lucene use RAM?
- JIRA processes one issue at a time and uses fairly minimal amount of memory while building the Lucene Documents.
- Lucene stores its index on disk in file segments that may be 1-2 GB in size and will merge or/and restructure these segments as Documents are added and deleted. This will usually use large amounts of memory.
- When searching, large amounts of memory can also be used, particularly when sorting large result sets, although in the JIRA 5.x and JIRA 6.x time frames the memory requirements in these areas were greatly reduced for many search scenarios.
- For fast searching, Lucene loads certain data structures entirely into RAM:
- Field cache, which is used under-the-hood when you sort by a field, takes some amount of per-Document RAM depending on the field type (String is by far the worst). This is loaded the first time you sort on that field.
- Norms, which encode the a-priori Lucene Document boost computed at indexing time, including length normalization and any boosting the application does, consume 1 byte per field X Lucene Document used for searching. For example, if your application searches 3 different fields, such as body, title and abstract, then that requires 3 bytes of RAM, per Lucene Document. These are loaded on-demand the first time that field is searched.
- Deletions, if present, consume 1 bit per Lucene Document, created during IndexReader construction.
- Field cache, which is used under-the-hood when you sort by a field, takes some amount of per-Document RAM depending on the field type (String is by far the worst). This is loaded the first time you sort on that field.
How do I tell how big the index is?
Navigate to the JIRA_HOME directory and run the following (for linux):
du -h caches/
This should return something like the following:
8.0K caches/indexesV2/plugins
244K caches/indexesV2/changes
124K caches/indexesV2/comments
416K caches/indexesV2/issues
8.0K caches/indexesV2/worklogs
56K caches/indexesV2/entities/searchrequest
56K caches/indexesV2/entities/portalpage
116K caches/indexesV2/entities
920K caches/indexesV2
4.0K caches/tmp_attachments
928K caches
In this case the index is 928k.
Which part of the process consumes the most resources?
- Generally speaking the building of the Lucene Documents to index takes the largest amount of resource when indexing an issue and reading data from the database can be a very substantial part of this. This is highly influenced by what plugins are installed and how they store and index data.
- The indexing process places high demands on three resource areas:
- CPU and memory
- Database
- File system holding the Lucene files
- Any of the above areas can become a bottle neck, for example long ping times or poor response from the database, a slow file system (i.e. using a NAS). If indexing performance is a concern all these should be looked at.
- Background reindexing uses just a single thread and in theory should trundle along in the background without significant impact on the rest of JIRA. In some circumstances though it may have a significant impact. On large systems under heavy load with a high rate of issue searches then there may be some significant impact, and it may be wise to perform the re-index during a period of lower activity, e.g. in the middle of the night.
What type of contention, or locking / blocking is anticipated during reindexing?
- For background reindexing, none. JIRA will read all the issues in the system and re-build the index entries for each issue as it goes. The background reindex runs very much as if a user has just updated an issue causing it to be reindexed.
- For a foreground reindex, total. For a foreground reindex, we totally delete the old index and build it again from scratch.
- In a cluster, a background reindex will cause a small pause on secondary nodes when the full reindex is being swapped into place after being copied from the node that performed the reindex.
If I cancel the re-index, can I still use JIRA?
Lock JIRA and rebuild index - you can't stop reindex. If you restart JIRA during this process, JIRA will be not usable, index will be corrupted and you need to run reindex.
Background re-index - After you cancel a re-index all issues currently processed will be in their correct state and any not processed will be in the same state as before the re-index was started. If a re-index was not actually needed, starting and then cancelling a re-index will do no harm at all. If a configuration change means a re-index is required, then the re-index will still be required and should be allowed to run to completion at some time in the near future.