SAXException error when running content anonymizer for confluence
Atlassian may request XML backup to troubleshoot bugs in Confluence. To protect the customer' data from leaking, the tool of Content Anonymizer can be used to clean backup data(entities.xml). However, some special characters may cause SAXException during cleaning.
For example, special character (code 55357: emoji of smiling face) caused below error.
$java -jar confluence-export-cleaner-1.1-jar-with-dependencies.jar entities.xml cleaned.xml 2021-04-14 21:40:12,157 INFO Starting to clean export file 'entities.xml'. This may take a few minutes. Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXException: Cannot output character with code 55357 in the encoding UTF-8' within a CDATA section javax.xml.transform.TransformerException: Cannot output character with code 55357 in the encoding UTF-8' within a CDATA section
Anonymizer tool is not able to deal with special characters (like smiling face) included in the backup file (entities.xml) of confluence.
If the size of entities.xml is small, special characters can be removed via editor manually.
However, if the size is too large to edit directly, below method can be used.
Download the tool of removing special character from: atlassian-xml-cleaner-0.1.jar
Running above to remove special character.
java -jar atlassian-xml-cleaner-0.1.jar entities.xml > entities-clean.xml
Then running anonymizer tool to clean entities.xml.
The tool of cleaning special characters is originally used to for Jira, see detail at : Removing invalid characters from XML backups.