Content of larger Office files not searchable

Still need help?

The Atlassian Community is here for you.

Ask the community

Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.

Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Problem

For Office documents (*.xlsx, *.pptx, *.docx) that exceed certain thresholds, Confluence can't extract text and make it available in searches. Some of the attachment size limitations for indexing are outlined here :

As per the knowledge base article linked above, the process of extracting attachment content for indexing is memory intensive and can cause out of memory errors when large files are uploaded. The size limit here is a safeguard built into Confluence to prevent this happening.

Diagnosis

  • Place the below classes in DEBUG
com.atlassian.confluence.internal.index.attachment
com.atlassian.confluence.internal.index
com.atlassian.confluence.search.lucene
com.atlassian.bonnie.search.extractor
  • After reproducing the issue, we see below entries where Confluence is complaining about Document being too big for text extraction which explains why the contents of these files are not searchable. 
2021-03-06 18:25:56,021 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.DefaultAttachmentExtractedTextManager] getContent Can't read extracted text of attachment 884741
2021-03-06 18:25:56,055 WARN [attachment-text-extraction-worker-1] [confluence.impl.hibernate.ConfluenceHibernateTransactionManager] doRollback Performing rollback. Transactions:
  ->[com.atlassian.confluence.internal.index.attachment.AttachmentTextExtractionFunction.apply]: PROPAGATION_REQUIRES_NEW,ISOLATION_DEFAULT (Session #674982162)
 -- referer: http://localhost:8090/pages/resumedraft.action?draftId=884737&draftShareId=ca5fced8-89dd-4fff-8660-4aa3d2903ce3& | url: /rest/documentConversion/latest/conversion/thumbnail/results | traceId: e9c077ee7d66b117 | userName: admin
2021-03-06 18:25:56,056 DEBUG [Caesium-1-1] [search.lucene.extractor.AttachmentExtractedTextExtractor] addFields Error when extracting text for 884741
java.util.concurrent.CompletionException: java.lang.RuntimeException: com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of PowerPoint document: Document too big for text extraction, bailing out
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of PowerPoint document: Document too big for text extraction, bailing out
    at com.atlassian.confluence.extra.officeconnector.index.AbstractAttachmentExtractor.extract(AbstractAttachmentExtractor.java:33)
    at com.atlassian.confluence.internal.index.attachment.DelegatingAttachmentTextExtractor.lambda$extract$1(DelegatingAttachmentTextExtractor.java:35)
    at java.util.Optional.flatMap(Optional.java:241)
    at com.atlassian.confluence.internal.index.attachment.DelegatingAttachmentTextExtractor.extract(DelegatingAttachmentTextExtractor.java:35)
    at com.atlassian.confluence.internal.index.attachment.AttachmentTextExtractionFunction.apply(AttachmentTextExtractionFunction.java:70)
    at com.atlassian.confluence.internal.index.attachment.AttachmentTextExtractionFunction.apply(AttachmentTextExtractionFunction.java:22)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:295)
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98)


  • We also have a known issue  :  CONFSERVER-58824 - Getting issue details... STATUS


Solution

You can override the thresholds by setting a Java sysprop and restarting your instance. 

 Note:   We have used the value 6mb as an example, be sure to test this in your lower instance prior to rolling this onto Production. Also, please be mindful when increasing the size here, as shared earlier in this comment, text extraction is a powerful operation and resource intensive.

Extension

System prop key

Example

.docx

officeconnector.textextract.word.docxmaxsize

-Dofficeconnector.textextract.word.docxmaxsize=6052413

.xlsx

officeconnector.excel.extractor.maxlength

-Dofficeconnector.excel.extractor.maxlength=6052413

.pptx

officeconnector.powerpoint.extractor.maxlength

-Dofficeconnector.powerpoint.extractor.maxlength=6052413

DescriptionUnable to search content in PPT files when ppt file uses design templates or contains images
ProductConfluence




Last modified on Dec 7, 2021

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.