Document Contents Are Not Searchable
Symptoms
Following errors are shown in the logs:
2012-06-29 14:41:00,327 WARN [scheduler_Worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: My_PDF_Examplem.pdf v.2 (8912924) admin)
com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document
at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:66)
at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40)
at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36)
at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104)
at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97)
at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
...
Caused by: java.io.IOException: Error: Expected an integer type, actual=''
at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1310)
at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:81)
at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:449)
at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1112)
at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:591)
at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:45)
... 30 more
Cause
Confluence is not able to index some attachments. The files in question may be corrupt or Confluence could be experiencing OOM problems during the indexing task.
Workaround
- Disable indexing of attachments following the instructions in How to disable indexing of attachments. That will stop Confluence from indexing the content of the attachments, so the contents will no longer be visible in search. The title of the attachment however will still be indexed and searchable.
- After the above is done, Rebuild the Content Indexes from scratch.
Last modified on Sep 28, 2016
Powered by Confluence and Scroll Viewport.