To trigger data exports through the REST API, you’ll need:
- A valid Confluence Data Center license
- Systems Administrator global permissions
There are a number of security and performance impacts you’ll need to consider before getting started.
If you need to filter out data based on security and confidentiality, this must be done after the data is exported.
Exported files are saved in your shared home directory, so you’ll also want to check this is secured appropriately.
When scheduling your exports, we recommend that you:
- Limit the amount of data exported using the
fromDateparameter, as a date further in the past will export more data, resulting in a longer data export.
- Schedule exports during hours of low activity, or on a node with no activity, if you do observe any performance degradation during the export.
|Number||Approximate export duration|
|Pages||25 million||12 hours|
|Comments||15 million||1 hour|
|Analytics events||20 million||2 hours|
The total export time was around 16 hours.
Test performance VS production
The data presented here is based on our own internal testing. The actual duration and impact of data export on your own environment will likely differ depending on your infrastructure, configuration, and load.
Our tests were conducted on a single node Data Center instance in AWS:
- EC2 instance type:
- RDS instance type:
Performing the data export
/exportREST API endpoint:
fromDate parameter limits the amount of data exported. That is, only data on entities created or updated after the
fromDate value will be exported.
If you trigger the export without the
fromDate parameter, all data from the last 365 days will be exported.
If your application is configured to use a context path, such as
/confluence, remember to include this in the
/export REST API endpoint has three methods:
When you use the
POST method, specify a
fromDate value. This parameter only accepts date values set in ISO 8601 format (yyyy-MM-ddTHH:mmTZD). For example:
(you'll need to use URL encoding in your request, for example
Here is an example request, using cURL and a personal access token for authentication:
curl -H "Authorization:Bearer ABCD1234" -H "X-Atlassian-Token: no-check" -X POST https://myexamplesite.com/rest/datapipeline/latest/ export?fromDate=2020-10-22T01:30:11Z
"X-Atlassian-Token: no-check" header is only required for Confluence. You can omit this for Jira.
POST request has the following responses:
Data export started. For example:
Another data export is already running:
Data export failed due to an inconsistent index:
If this occurs, you may need to reindex and then retry the data export.
Alternatively, you can force a data export using the
The following response is returned when you force an export an an inconsistent index to warn you that the data might be incomplete.
GET request returns a
200 code, but the response will be different depending on what stage the export is in:
|Before you start the first export|
|During an export|
|After a successful export|
|After a cancellation request, |
but before the export is
|After an export is cancelled|
DELETE request has the following responses:
Request discarded, because there is no ongoing export:
However, if the JVM is not notified after a crash or hardware-level failure occurs, the export process may get locked. This means you'll need to manually mark the export as CANCELLED by making a
DELETE request. This releases the process lock, allowing you to perform another data export.
Configuring the data export
You can configure the format of the export data using the following system properties.
Specifies whether embedded line breaks should be preserved in the output files. Line breaks can be problematic for some tools such as Hadoop.
This property is set to
Escaping character for embedded line breaks. By default, we'll print
Check the status of an export
You can check the status of an export and view when your last export ran from within your application’s admin console. To view data export status, go to > Data pipeline. > General Configuration
- Not started - no export is currently running
- Started - the export is currently running
- Completed - the export has completed
- Cancellation requested - a cancellation request has been sent
- Cancelled - the export was cancelled
- Failed - the export failed.
For help resolving failed or cancelled exports, see Data pipeline troubleshooting.
Each time you perform a data export, we assign a numerical job ID to the task (starting with 1 for your first ever data export). This job ID is used in the file name, and location of the files containing your exported data.
Location of exported files
Exported data is saved as separate CSV files. The files are saved to the following directory:
<shared-home>/data-pipeline/export/<job-id>if you run Confluence in a cluster
<local-home>/data-pipeline/export/<job-id>you are using non-clustered Confluence
<job-id> directory you will see the following files:
To load and transform the data in these files, you'll need to understand the schema. See Data pipeline export schema.
Sample Spark and Hadoop import configurations
If you have an existing Spark or Hadoop instance, use the following references to configure how to import your data for further transformation.
%python # File location file_location = "/FileStore/**/export_2020_09_24T03_32_18Z.csv" # Automatically set data type for columns infer_schema = "true" # Skip first row as it's a header first_row_is_header = "true" # Ignore multiline within double quotes multiline_support = "true" # The applied options are for CSV files. For other file types, these will be ignored. Note escape & quote options for RFC-4801 compliant files df = spark.read.format("csv") \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("multiLine", multiline_support) \ .option("quote", "\"") \ .option("escape", "\"") \ .option("encoding", "UTF-8").load(file_location) display(df)
CREATE EXTERNAL TABLE IF NOT EXISTS some_db.datapipeline_export ( `page_id` string, `instance_url` string, `space_key` string, `page_url` string, `page_type` string, `page_title` string, `page_status` string, `page_content` string, `page_parent_id` string, `labels` string, `page_version` string, `creator_id` string, `last_modifier_id` string, `created_date` string, `updated_date` string, `last_update_description` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "escapeChar" = "\\", 'quoteChar' = '"', 'separatorChar' = ',' ) LOCATION 's3://my-data-pipeline-bucket/test-exports/' TBLPROPERTIES ('has_encrypted_data'='false');