How to disable indexing of attachments

Still need help?

The Atlassian Community is here for you.

Ask the community

Purpose

Sometimes a user can experience problems indexing large MSExcel or MSPowerPoint documents and the reindexing may cause potential Unknown Ptg warning messages that are harmless. There is already a request to Suppress these warnings from the re-indexing of unreadable documents by the POI library.

The error is usually not serious yet can sometimes cause problems when large attachments are used. So you may want to disable indexing of a particular type of attachment type.

To do this, you can use one of the methods described below.

In Confluence 6.2.2 we made some changes to protect your site from out of memory errors while indexing large attachments, including introducing a configurable file size check before beginning the text extraction and indexing process. See Configuring Attachment Size to find out how this works before disabling attachment indexing completley as you may be able to adjust the limits to suit your site.

 

Solution

Method 1: Using the Administration Console

  1. Go to Confluence Admin > Manage Add-ons.
  2. Toward the middle of the screen is a pulldown menu that probably says User Installed. Change it to All Add-ons
  3. Scroll down to Attachment Extractors under System Add-ons
    1. Expand Attachment Extractors
    2. Click the + sign next to "1 of 1 modules enabled"
    3. Hover over the PDF Content Extractor and a disable button will appear. 
    4. Click the disable button.
  4. Scroll down to Office Connector plugin
    1. Expand Office Connector plugin
    2. Expand x out of x modules enabled
    3. Disable the following modules:
      1. Word Content Extractor
      2. Word XML Content Extractor
      3. Powerpoint 97 Content Extractor
      4. Powerpoint 2007 Content Extractor
      5. Excel 97 Content Extractor
      6. Excel 2007 Content Extractor

The search query will ignore all attachment contents of the type corresponding to the disabled module.

Please note that the bundled modules will be again enabled after the restart. For more permanent solution use method 2.

Method 2: Editing the atlassian-plugin.xml files of plugins

You need to modify the content of the atlassian-plugin.xml file in the following JAR files and comment out the relevant file type extractor:

  • confluence-attachment-extractors-x.x.jar (for PDF) or
  • OfficeConnector-x.x.jar (for Office files)

Both of these JAR files are located in the confluence\WEB-INF\atlassian-bundled-plugins directory.

If you are unfamiliar with modifying JAR files, please refer to the How to edit files in Confluence JAR files document for further information.

You can identify file type extractors in atlassian-plugin.xml files by the occurrence of ContentExtractor in their key attribute.

Once the ContentExtractor for a file type is disabled, all files of that type become unsearchable.

The example below shows a pdfContentExtractor disabled which would prevent PDF attachments from being indexed.

<atlassian-plugin key="com.atlassian.confluence.plugins.attachmentExtractors" name="Attachment Extractors">
    <plugin-info>
        <description>This plugin extracts searchable text from various attachment types.</description>
        <version>1.1</version>
        <vendor name="Atlassian Pty Ltd" url="http://www.atlassian.com/"/>
    </plugin-info>

    <!--
    <extractor name="PDF Content Extractor" key="pdfContentExtractor" class="com.atlassian.bonnie.search.extractor.PdfContentExtractor" priority="1100">
        <description>Indexes contents of PDF files</description>
    </extractor>
    -->

</atlassian-plugin>

The following table shows the file type extractors in the atlassian-plugin.xml of the OfficeConnector-x.x.jar file, which require commenting out to prevent indexing:

Type of attachment

File Type Extractor

Word 97/2007 (.doc and .docx)

<extractor name="Word Content Extractor" key="wordContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.word.WordTextExtractor" priority="1099">
    <description>Indexes contents of Word 97/2007 files</description>
</extractor>

PowerPoint 97 (.ppt)

<extractor name="PowerPoint 97 Content Extractor" key="ppt97ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.powerpoint.PowerPointTextExtractor" priority="1099">
    <description>Indexes contents of PowerPoint 97 files</description>
</extractor>

PowerPoint 2007 (.pptx)

<extractor name="PowerPoint 2007 Content Extractor" key="ppt2k7ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.powerpoint.PowerPointXMLTextExtractor" priority="1099">
    <description>Indexes contents of PowerPoint 2007 files</description>
</extractor>

Excel 97 (.xls)

<extractor name="Excel 97 Content Extractor" key="excel97ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.excel.ExcelTextExtractor" priority="1099">
    <description>Indexes contents of Excel 97 files</description>
</extractor>

Excel 2007 (.xlsx)

<extractor name="Excel 2007 Content Extractor" key="excel2k7ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.excel.ExcelXMLTextExtractor" priority="1099">
    <description>Indexes contents of Excel 2007 files</description>
</extractor>
Last modified on Aug 3, 2017

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.