Configuring Character Encoding
This page explains the encoding settings that are applicable in Confluence and how they relate to application behaviour.
To avoid problems with character encoding, make sure the encoding used across the different components of your system are the same. In general, always set all character encodings to UTF-8:
- below. – described
- Database – see Configuring Database Character Encoding.
- Application server – see Configuring URL Encoding on Tomcat Application Server
Configuring the Confluence character encoding
By default, Confluence uses UTF-8 character encoding to deliver its pages.
Note: While it is possible to change the character encoding, we recommend that you leave this as it is unless you are certain of what you are doing.
In summary: Changing the Confluence character encoding will change your HTTP request and response encoding and your filesystem encoding as used by exports and Velocity templates.
To change the Confluence character encoding via the UI:
Choose the cog icon , then choose General Configuration under Confluence Administration
Choose General Configuration in the left-hand panel.
Enter the new character encoding of your choice in the text box next to Encoding.
Note: At runtime, the character encoding is available in
More details about character encoding
There are three places where character encoding matters to Confluence:
- Database encoding - usually the most important; it is where almost all user data is stored.
- Filesystem encoding - important for attachment storage (pre-2.2), reading Velocity templates and writing exported files.
- HTTP request and response encoding - important for form parsing, correct rendering by the browser and browser interpretation of encoded URLs.
Problems generally arise when Confluence thinks one of the above encoding is different to what it actually is. For example, Confluence might believe the database is using ISO-8859-1 encoding, when in fact it is UTF-8 encoded.
In certain cases (for example, Microsoft Windows), it might not be possible to use a fully Unicode filesystem (that is, a default Windows installation does not support Unicode filenames properly). If so, keep UTF-8 for the other two and be aware that your operating system might have limitations around international attachments (pre-2.2), backup and restore of international data, etc.
Java character encoding
Java always uses the multibyte UTF-16 character encoding for all
String data*. This means that each of the encodings above defines how, at that particular point, characters are converted to and from Java's native UTF-16 format into some other format that the browser, filesystem or database might understand.
So when a request comes in to Confluence, we convert it from the request encoding to UTF-16. Then we store that data into the database, converting from UTF-16 to the database's encoding. Retrieving information from the database and sending it back to the browser is the same process in the opposite direction.
char represents single Unicode code point from the Base Multilingual Plane (BMP), encoded as UTF-16. Multiple
chars are used as surrogate pairs for characters beyond U+FFFF.
Confluence character encoding
The Confluence character encoding is used in the following parts of the system:
- ConfluenceWebWorkConfiguration sets
webwork.i18n.encodingto the this encoding, which WebWork uses in the response Content-Type header.
- AbstractEncodingFilter sets the HTTP request encoding to this encoding. This seems unnecessary, since the Content-Type header from the client should include the encoding used. This affects form submissions and file uploads.
- VelocityUtils reads in Velocity templates using this encoding when reading templates from disk.
- AbstractXmlExporter creates its output using this encoding.
- GeneralUtil uses this encoding when doing URLEncode and URLDecode. Different browsers have different support for character sets in URLs, so it's uncertain how much benefit this provides.
See Configuring Confluence Character Encoding (described above.)
The database encoding is the responsibility of your JDBC drivers. The drivers are responsible for reading and writing from the database in its native encoding and translating this data to and from Java Strings (which are UTF-16). For some drivers, such as MySQL, you must set Unicode encoding explicitly in the JDBC URL. For others, the driver is smart enough to determine the database encoding automatically.
Ideally, your database itself should be in a Unicode encoding (and we recommend doing this for the simplest configuration), but that is not necessary as long as:
- the database encoding supports all the characters you want to store in Confluence
- your JDBC drivers can properly convert from the database encoding to UTF-16 and vice-versa.
The filesystem encoding is mostly ignored by Confluence, except for the cases where the above configuration setting above plays a part (exports, velocity). When attachments are uploaded, they are written as a stream of bytes directly to the filesystem. It is the same when they are downloaded: the bytes from the file InputStream are written directly to the HTTP response.
In some places in Confluence, we use the default filesystem encoding as determined by the JVM and stored in the
file.encoding system property (it can be overridden by setting this property at startup). This encoding is used by the Java InputStreamReader and InputStreamWriter classes by default. This encoding should probably never be used; for consistent results across all filesystem access we should be using the encoding set in the General Configuration.
In certain cases we explicitly hard-code the encoding used to read or write data to the filesystem. Two important examples are:
- importing Mbox mailboxes which are known to be ISO-8859-1
- Confluence Bandana config files are always stored as UTF-8.
Some application servers, Tomcat for example, have an encoding setting that modifies Confluence URLs before they reach the application. This can prevent access to international pages and attachments (really anything with international characters in the URL). See configuring your Application Server URL encoding.
Problems with character encodings
If Confluence has the wrong idea about encoding for one of the above, it manifests itself in different ways:
- Incorrect database encoding - user data is corrupted between saving and restoring from the database. This often happens after a delay, as we cache data as it is written to the database and only later retrieve the corrupted copy from the database.
- Incorrect/non-Unicode filesystem encoding - international filenames break attachment download/upload/removal (pre-2.2); exports break with international content or attachments.
- Incorrect HTTP encoding - incorrect encoding selected by browser, resulting in incorrect rendering of characters. Changing browser encoding causes page to render properly. Broken URLs when linking to pages or attachments with non-ASCII characters.
- Mac users please note that MacRoman encoding is compatible with UTF-8. You do not need to change your encoding settings if you are already using MacRoman.
- This is a good article by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)