Document Contents Are Not Searchable

Still need help?

The Atlassian Community is here for you.

Ask the community

Symptoms

Following errors are shown in the logs:

2012-06-29 14:41:00,327 WARN [scheduler_Worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: My_PDF_Examplem.pdf v.2 (8912924) admin)
com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document
        at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:66)
        at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40)
        at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36)
        at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104)
        at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97)
        at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
...
Caused by: java.io.IOException: Error: Expected an integer type, actual=''
        at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1310)
        at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:81)
        at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:449)
        at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1112)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:591)
        at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:45)
        ... 30 more

Cause

Confluence is not able to index some attachments. The files in question may be corrupt or Confluence could be experiencing OOM problems during the indexing task.

Workaround

  1. Disable indexing of attachments following the instructions in How to disable indexing of attachments. That will stop Confluence from indexing the content of the attachments, so the contents will no longer be visible in search. The title of the attachment however will still be indexed and searchable.
  2. After the above is done, Rebuild the Content Indexes from scratch.

 

Last modified on Sep 28, 2016

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.