How to exclude extracted text from attachments directory backup

Still need help?

The Atlassian Community is here for you.

This guide only applies to Confluence Server or Data Center 6.5 and later.

Purpose

When a file is uploaded in Confluence, its text is extracted and indexed so that people can search for the content of a file, not just the filename. From Confluence 6.5, we store this extracted text in the filesystem alongside the attached file, so that when that file needs to be reindexed (for example, when the page it's attached to changes), we don't need to re-extract the content of the file.  We'll only re-extract the content when a new version of the file is uploaded, and store extracted text for the latest version of the, not earlier versions.

The files containing the extracted text are generally quite small, but over time this can add up to a lot of additional files, and increase the total size of the attachments directory backup (part of your home / shared home directory). For this reason, you might want to exclude these files when backing up your attachments directory.

Solution

To exclude these files from your backup, you can rely on the file extension which is always *.extracted_text. For example, the following unix shell script backs up attachments without including the extracted text files.

\$tar -czf attachments.tar.gz –exclude '*.extracted_text' ./shared/attachments

The extracted text files are also not included when you perform a space or site export.