SAXException error when running content anonymizer for confluence
Summary
Atlassian may request XML backup to troubleshoot bugs in Confluence. To protect the customer' data from leaking, the tool of Content Anonymizer can be used to clean backup data(entities.xml). However, some special characters may cause SAXException during cleaning.
For example, special character (code 55357: emoji of smiling face) caused below error.
$java -jar confluence-export-cleaner-1.1-jar-with-dependencies.jar entities.xml cleaned.xml
2021-04-14 21:40:12,157 INFO Starting to clean export file 'entities.xml'. This may take a few minutes.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXException: Cannot output character with code 55357 in the encoding UTF-8' within a CDATA section javax.xml.transform.TransformerException: Cannot output character with code 55357 in the encoding UTF-8' within a CDATA section
Cause
Anonymizer tool is not able to deal with special characters (like smiling face) included in the backup file (entities.xml) of confluence.
Solution
If the size of entities.xml is small, special characters can be removed via editor manually.
However, if the size is too large to edit directly, below method can be used.
Download the tool of removing special character from: atlassian-xml-cleaner-0.1.jar
Running above to remove special character.
java -jar atlassian-xml-cleaner-0.1.jar entities.xml > entities-clean.xml
Then running anonymizer tool to clean entities.xml.
Reference
The tool of cleaning special characters is originally used to for Jira, see detail at : Removing invalid characters from XML backups.