How to disable indexing of attachments
Sometimes a user can experience problems indexing large MSExcel or MSPowerPoint documents and the reindexing may cause potential
Unknown Ptg warning messages that are harmless. There is already a request to Suppress these warnings from the re-indexing of unreadable documents by the POI library.
The error is usually not serious yet can sometimes cause problems when large attachments are used. So you may want to disable indexing of a particular type of attachment type.
To do this, you can use one of the methods described below.
In Confluence 6.2.2 we made some changes to protect your site from out of memory errors while indexing large attachments, including introducing a configurable file size check before beginning the text extraction and indexing process. See Configuring Attachment Size to find out how this works before disabling attachment indexing completley as you may be able to adjust the limits to suit your site.
Method 1: Using the Administration Console
- Go to Confluence Admin > Manage Add-ons.
- Toward the middle of the screen is a pulldown menu that probably says User Installed. Change it to All Add-ons.
- Scroll down to Attachment Extractors under System Add-ons
- Expand Attachment Extractors
- Click the + sign next to "1 of 1 modules enabled"
- Hover over the PDF Content Extractor and a disable button will appear.
- Click the disable button.
- Scroll down to Office Connector plugin
- Expand Office Connector plugin
- Expand x out of x modules enabled
- Disable the following modules:
- Word Content Extractor
- Word XML Content Extractor
- Powerpoint 97 Content Extractor
- Powerpoint 2007 Content Extractor
- Excel 97 Content Extractor
- Excel 2007 Content Extractor
The search query will ignore all attachment contents of the type corresponding to the disabled module.
Please note that the bundled modules will be again enabled after the restart. For more permanent solution use method 2.
Method 2: Editing the
atlassian-plugin.xml files of plugins
You need to modify the content of the
atlassian-plugin.xml file in the following JAR files and comment out the relevant file type extractor:
confluence-attachment-extractors-x.x.jar(for PDF) or
OfficeConnector-x.x.jar(for Office files)
Both of these JAR files are located in the
If you are unfamiliar with modifying JAR files, please refer to the How to edit files in Confluence JAR files document for further information.
You can identify file type extractors in
atlassian-plugin.xml files by the occurrence of
ContentExtractor in their
ContentExtractor for a file type is disabled, all files of that type become unsearchable.
The example below shows a pdfContentExtractor disabled which would prevent PDF attachments from being indexed.
<atlassian-plugin key="com.atlassian.confluence.plugins.attachmentExtractors" name="Attachment Extractors"> <plugin-info> <description>This plugin extracts searchable text from various attachment types.</description> <version>1.1</version> <vendor name="Atlassian Pty Ltd" url="http://www.atlassian.com/"/> </plugin-info> <!-- <extractor name="PDF Content Extractor" key="pdfContentExtractor" class="com.atlassian.bonnie.search.extractor.PdfContentExtractor" priority="1100"> <description>Indexes contents of PDF files</description> </extractor> --> </atlassian-plugin>
The following table shows the file type extractors in the
atlassian-plugin.xml of the
OfficeConnector-x.x.jar file, which require commenting out to prevent indexing:
Type of attachment
File Type Extractor
Word 97/2007 (
PowerPoint 97 (
PowerPoint 2007 (
Excel 97 (
Excel 2007 (