Database Restore/Migration fails - SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence

Still need help?

The Atlassian Community is here for you.

Ask the community

Symptoms

After using the database migration wizard or using the Bitbucket backup client - the restoration of the generated XML backup fails with the following error:

2020-07-13 07:39:41,634 INFO         Initializing
2020-07-13 07:39:43,253 INFO         Unpacking bitbucket-20200709-172649-440.tar to /media/atl/bitbucket
2020-07-13 10:25:18,721 INFO         Validating database before restore
2020-07-13 10:25:20,450 INFO         Restoring database schema definition
2020-07-13 10:25:33,482 INFO         Restoring database data
2020-07-13 10:25:40,133 ERROR        bitbucket-20200709-172649-440.tar could not be restored
com.atlassian.stash.internal.backup.liquibase.LiquibaseDataAccessException: SAX parsing error while parsing backup file; nested exception is org.xml.sax.SAXParseException; lineNumber: 10888480; columnNumber: 36; Invalid byte 2 of 4-byte UTF-8 sequence.
	at com.atlassian.stash.internal.backup.liquibase.DefaultLiquibaseMigrationDao.parse(DefaultLiquibaseMigrationDao.java:229)
	at com.atlassian.stash.internal.backup.liquibase.DefaultLiquibaseMigrationDao.scan(DefaultLiquibaseMigrationDao.java:215)
	... 10 more frames available in the log file

Cause

When performing a database migration, it uses the same classes/logic as the Bitbucket backup client to take an XML backup of the current database schema/data and then restore that backup into the target database.

While the XML backup is successfully generated, when this same XML backup is read - we use the third-party Apache Xerces XML parser to do this, which contains an unresolved bug which can result in this error when reading particularly large XML backups containing 4-byte UTF-8 sequences. This is because once the read buffer gets exhausted, the next 4-byte UTF-8 character parsed experiences an off-by-one error, resulting in the error above.

The above criteria means that this issue will most likely be seen in large XML backups that also contain a wide variety of particular special characters (4-byte UTF-8 sequences). These special characters generally include less common CJK characters, various historic scripts, mathematical symbols, and emojis.

Workaround

Without resolving the bug with the above XML parser or changing to a different XML parsing utility, the options for getting past this issue come down to either:

  1. Reducing the overall amount of content within the XML backup to prevent the read buffer from becoming exhausted 
    • This is not recommended, as this is a variable threshold depending on the amount of 4-byte UTF-8 characters in the XML backup - meaning it may not be clear exactly how much data you would need to remove (and at what location in the XML backup) to get past this error.
  2. Removing/substituting these 4-byte UTF-8 characters in the XML backup prior to restoring it into the target database.

We recommend choosing the second option, as this will minimize the number of changes that need to be made to the XML backup to allow the restore to succeed. These are the steps that can be performed to easily remove these characters:

  1. Download the JAR file: atlassian-xml-cleaner-0.1.jar
  2. Open a command prompt and locate the XML or ZIP backup file on your computer, ensuring that it is extracted if it's within a ZIP file. In this example, we will use stash-data.xml.
  3. Run the cleaner as shown: 

    $ java -jar atlassian-xml-cleaner-0.1.jar stash-data.xml > stash-data-clean.xml
  4. This will create a copy of stash-data.xml as stash-data-clean.xml with the invalid characters removed. 
  5. Copy the stash-data-clean.xml file into another directory, rename it back to stash-data.xml, and create a new ZIP with the updated stash-data.xml file.

After performing the above steps to produce an updated backup .ZIP file, follow the standard process for restoring the backup to the desired database using the Bitbucket backup client.

Resolution

The ultimate resolution to this issue is being tracked in the following bug ticket:

Last modified on Jul 14, 2020

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.