Content of larger Office files not searchable
Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Problem
For Office documents (*.xlsx, *.pptx, *.docx) that exceed certain thresholds, Confluence can't extract text and make it available in searches. Some of the attachment size limitations for indexing are outlined here :
As per the knowledge base article linked above, the process of extracting attachment content for indexing is memory intensive and can cause out of memory errors when large files are uploaded. The size limit here is a safeguard built into Confluence to prevent this happening.
Diagnosis
- Place the below classes in DEBUG
com.atlassian.confluence.internal.index.attachment
com.atlassian.confluence.internal.index
com.atlassian.confluence.search.lucene
com.atlassian.bonnie.search.extractor
- After reproducing the issue, we see below entries where Confluence is complaining about Document being too big for text extraction which explains why the contents of these files are not searchable.
2021-03-06 18:25:56,021 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.DefaultAttachmentExtractedTextManager] getContent Can't read extracted text of attachment 884741
2021-03-06 18:25:56,055 WARN [attachment-text-extraction-worker-1] [confluence.impl.hibernate.ConfluenceHibernateTransactionManager] doRollback Performing rollback. Transactions:
->[com.atlassian.confluence.internal.index.attachment.AttachmentTextExtractionFunction.apply]: PROPAGATION_REQUIRES_NEW,ISOLATION_DEFAULT (Session #674982162)
-- referer: http://localhost:8090/pages/resumedraft.action?draftId=884737&draftShareId=ca5fced8-89dd-4fff-8660-4aa3d2903ce3& | url: /rest/documentConversion/latest/conversion/thumbnail/results | traceId: e9c077ee7d66b117 | userName: admin
2021-03-06 18:25:56,056 DEBUG [Caesium-1-1] [search.lucene.extractor.AttachmentExtractedTextExtractor] addFields Error when extracting text for 884741
java.util.concurrent.CompletionException: java.lang.RuntimeException: com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of PowerPoint document: Document too big for text extraction, bailing out
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of PowerPoint document: Document too big for text extraction, bailing out
at com.atlassian.confluence.extra.officeconnector.index.AbstractAttachmentExtractor.extract(AbstractAttachmentExtractor.java:33)
at com.atlassian.confluence.internal.index.attachment.DelegatingAttachmentTextExtractor.lambda$extract$1(DelegatingAttachmentTextExtractor.java:35)
at java.util.Optional.flatMap(Optional.java:241)
at com.atlassian.confluence.internal.index.attachment.DelegatingAttachmentTextExtractor.extract(DelegatingAttachmentTextExtractor.java:35)
at com.atlassian.confluence.internal.index.attachment.AttachmentTextExtractionFunction.apply(AttachmentTextExtractionFunction.java:70)
at com.atlassian.confluence.internal.index.attachment.AttachmentTextExtractionFunction.apply(AttachmentTextExtractionFunction.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:295)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98)
Solution
You can override the thresholds by setting a Java sysprop and restarting your instance.
Note: We have used the value 6mb as an example, be sure to test this in your lower instance prior to rolling this onto Production. Also, please be mindful when increasing the size here, as shared earlier in this comment, text extraction is a powerful operation and resource intensive.
Extension | System prop key | Example | ||
---|---|---|---|---|
.docx |
|
| ||
.xlsx |
|
| ||
.pptx |
|
|