Data pipeline

Still need help?

The Atlassian Community is here for you.

Ask the community

This feature is available with a Confluence Data Center license.

Data pipeline provides an easy way to export data from your Jira or Confluence site, and feed it into your existing data platform (like  Tableau  or  PowerBI ). This allows you to:
  • generate richer reports and visualizations of site activity
  • better understand how your teams are using your application
  • make better decisions on optimizing the use of Jira or Confluence in your organization

You can trigger a data export of the current state data through the REST API, and view the status of your exports in your application’s admin console. Data will be exported in CSV format. You can only perform one data export at a time.

For a detailed reference of the exported data's schema, see Data pipeline export schema.

Data pipeline is available in Data Center editions of:

  • Jira 8.14 and later
  • Confluence 7.12 and later

On this page:

Requirements

To trigger data exports through the REST API, you’ll need:

Considerations

There are a number of security and performance impacts you’ll need to consider before getting started.

Security

The export will include all data, including PII (Personally Identifiable Information) and restricted content. This is to provide you with as much data as possible, so you can filter and transform to generate the insights you’re after.

If you need to filter out data based on security and confidentiality, this must be done after the data is exported.

Exported files are saved in your shared home directory, so you’ll also want to check this is secured appropriately.

Performance impact

Exporting data can take a long time in large instances. We intentionally export data at a limited rate to keep any performance impact to your site under a 5% threshold. It’s important to note that there is no impact to performance unless an export is in progress.

When scheduling your exports, we recommend that you:

  • Limit the amount of data exported using the fromDate parameter, as a date further in the past will export more data, resulting in a longer data export.
  • Schedule exports during hours of low activity, or on a node with no activity, if you do observe any performance degradation during the export.

Our test results showed the following approximate durations for the export.

NumberApproximate export duration
Users100,0008 minutes
Spaces15,00012 minutes
Pages25 million12 hours
Comments15 million1 hour
Analytics events20 million2 hours

The total export time was around 16 hours. 

Test performance VS production

The data presented here is based on our own internal testing. The actual duration and impact of data export on your own environment will likely differ depending on your infrastructure, configuration, and load. 

Our tests were conducted on a single node Data Center instance in AWS:

  • EC2 instance type: c5.4xlarge
  • RDS instance type: db.m5.4xlarge

Performing the data export

To export the current state data, use the /export REST API endpoint:
https://<base-url>/rest/datapipeline/latest/export?
fromDate=<yyyy-MM-ddTHH:mmTZD>

The fromDate parameter limits the amount of data exported. That is, only data on entities created or updated after the fromDate value will be exported.

If you trigger the export without the fromDate parameter, all data from the last 365 days will be exported. 

If your application is configured to use a context path, such as /jira or /confluence, remember to include this in the <base-url>

The /export REST API endpoint has three methods:

POST method...

POST method

When you use the POST method, specify a fromDate value. This parameter only accepts date values set in ISO 8601 format (yyyy-MM-ddTHH:mmTZD). For example:

  • 2020-12-30T23:01Z

  • 2020-12-30T22:01+01:00
    (you'll need to use URL encoding in your request, for example 2020-12-30T22%3A03%2B01%3A00)

Here is an example request, using cURL and a personal access token for authentication:

curl -H "Authorization:Bearer ABCD1234" -H "X-Atlassian-Token: no-check" 
-X POST https://myexamplesite.com/rest/datapipeline/latest/
export?fromDate=2020-10-22T01:30:11Z

The "X-Atlassian-Token: no-check" header is only required for Confluence. You can omit this for Jira.

The POST request has the following responses:

Code

Description

202

Data export started. For example:

{
  "startTime":"2021-03-03T12:08:24.045+11:00",
  "nodeId":"node1",
  "jobId":124,
  "status":"STARTED",
  "config":{
     "exportFrom":"2020-03-03T12:08:24.036+11:00",
     "forcedExport":false
  }
}

409

Another data export is already running:

{

  "startTime":"2021-03-03T12:08:24.045+11:00",
  "nodeId":"node1",
  "jobId":124,
  "status":"STARTED",
  "config":{
     "exportFrom":"2020-03-03T12:08:24.036+11:00",
     "forcedExport":false
  }
}

422

Data export failed due to an inconsistent index:

{
  "startTime": "2021-01-13T09:01:01.917+11:00",
  "completedTime": "2021-01-13T09:01:01.986+11:00",
  "nodeId": "node2",
  "jobId": 56,
  "status": "FAILED",
  "config": {
    "exportFrom": "2020-07-17T08:00:00+10:00",
    "forcedExport": false
  },
  "errors": [
    {
      "key": "export.pre.validation.failed",
      "message": "Inconsistent index used for export job."
    }
  ]
}

If this occurs, you may need to reindex and then retry the data export.

Alternatively, you can force a data export using the forceExport=true query parameter. However, forcing an export on an inconsistent index could result in incomplete data.

The following response is returned when you force an export an an inconsistent index to warn you that the data might be incomplete.

{
  "startTime": "2021-01-13T09:01:42.696+11:00",
  "nodeId": "node2",
  "jobId": 57,
  "status": "STARTED",
  "config": {
    "exportFrom": "2020-07-17T08:01:00+10:00",
    "forcedExport": true
  },
  "warnings": [
    {
      "key": "export.pre.validation.failed",
      "message": "Inconsistent index used for export job."
    }
  ]
}
GET method...

GET method

The GET request returns a 200 code, but the response will be different depending on what stage the export is in:

StatusSample response
Before you start the first export
{}
During an export
{
  "startTime": "2020-11-01T06-35-41-577+11",
  "nodeId": "node1",
  "jobId": 125,
  "status": "STARTED"
  "config":{
     "exportFrom":"2020-03-03T12:08:24.036+11:00",
     "forcedExport":false
  }
}
After a successful export
{
  "startTime":"2021-03-03T12:08:24.045+11:00",
  "completedTime":"2021-03-03T12:08:24.226+11:00",
  "nodeId":"node3",
  "jobId":125,
  "status":"COMPLETED",
  "config": {
    "exportFrom":"2020-03-03T12:08:24.036+11:00",
    "forcedExport":false 
  },
  "statistics" {
    "exportedEntities":23,
    "writtenRows":54
  }
}
After a cancellation request,
but before the export is
actually cancelled
{
  "startTime":"2021-03-03T12:08:24.045+11:00",
  "completedTime":"2021-03-03T12:08:24.226+11:00",
  "nodeId":"Node1",
  "jobId":125,
  "status":"CANCELLATION_REQUESTED",
  "config": {
    "exportFrom":"2020-03-03T12:08:24.036+11:00",
    "forcedExport":false 
  },
}
After an export is cancelled
{
  "startTime": "2020-11-02T04-20-34-007+11",
  "cancelledTime": "2020-11-02T04-24-21-717+11",
  "completedTime": "2020-11-02T04-24-21-717+11",
  "nodeId":"node2",
  "jobId":125,
  "status":"CANCELLED",
  "config": {
    "exportFrom":"2020-03-03T12:08:24.036+11:00",
    "forcedExport":false 
  },
  "statistics" {
    "exportedEntities":23,
    "writtenRows":12
  }
}
DELETE method...

DELETE method

The DELETE request has the following responses:

CodeDescription
200

Cancellation accepted.

{
  "status": "OK",
  "message": "Cancellation request successfully received.
 Currently running export job will be stopped shortly."
}
409

Request discarded, because there is no ongoing export:

{
  "status": "WARNING",
  "message": "Cancellation request aborted. There is no
export job running to cancel."
}

Automatic cancellation

If a node running a data export is gracefully shut down, the export will be automatically marked as CANCELLED.

However, if the JVM is not notified after a crash or hardware-level failure occurs, the export process may get locked. This means you'll need to manually mark the export as CANCELLED by making a DELETE request. This releases the process lock, allowing you to perform another data export.

Configuring the data export

You can configure the format of the export data using the following system properties.

Default valueDescription
plugin.data.pipeline.embedded.line.break.preserve
false

Specifies whether embedded line breaks should be preserved in the output files. Line breaks can be problematic for some tools such as Hadoop.

This property is set to False by default, which means that line breaks are escaped.

plugin.data.pipeline.embedded.line.break.escape.char
\\n

Escaping character for embedded line breaks. By default, we'll print \n for every embedded line break.

Check the status of an export

You can check the status of an export and view when your last export ran from within your application’s admin console. To view data export status, go to  > General Configuration > Data pipeline.

There are a number of export statuses:
  • Not started - no export is currently running
  • Started - the export is currently running
  • Completed - the export has completed
  • Cancellation requested - a cancellation request has been sent
  • Cancelled - the export was cancelled
  • Failed - the export failed.

For help resolving failed or cancelled exports, see Data pipeline troubleshooting

Output files

Each time you perform a data export, we assign a numerical job ID to the task (starting with 1 for your first ever data export). This job ID is used in the file name, and location of the files containing your exported data. 

Location of exported files

Exported data is saved as separate CSV files. The files are saved to the following directory:

  • <shared-home>/data-pipeline/export/<job-id> if you run Confluence in a cluster
  • <local-home>/data-pipeline/export/<job-id> you are using non-clustered Confluence

Within the <job-id> directory you will see the following files:

  • users_job<job_id>_<timestamp>.csv 

  • spaces_job<job_id>_<timestamp>.csv

  • pages_job<job_id>_<timestamp>.csv

  • comments_job<job_id>_<timestamp>.csv

  • analytics_events_job<job_id>_<timestamp>.csv

To load and transform the data in these files, you'll need to understand the schema. See Data pipeline export schema.

Sample Spark and Hadoop import configurations

If you have an existing Spark or Hadoop instance, use the following references to configure how to import your data for further transformation.


Spark / Databricks...

%python
# File location
file_location = "/FileStore/**/export_2020_09_24T03_32_18Z.csv" 

# Automatically set data type for columns
infer_schema = "true"
# Skip first row as it's a header
first_row_is_header = "true"
# Ignore multiline within double quotes
multiline_support = "true"

# The applied options are for CSV files. For other file types, these will be ignored. Note escape & quote options for RFC-4801 compliant files
df = spark.read.format("csv") \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("multiLine", multiline_support) \
  .option("quote", "\"") \
  .option("escape", "\"") \
  .option("encoding", "UTF-8").load(file_location)

display(df)

Hadoop...

CREATE EXTERNAL TABLE IF NOT EXISTS some_db.datapipeline_export (
  `page_id` string,
  `instance_url` string,
  `space_key` string,
  `page_url` string,
  `page_type` string,
  `page_title` string,
  `page_status` string,
  `page_content` string,
  `page_parent_id` string,
  `labels` string,
  `page_version` string,
  `creator_id` string,
  `last_modifier_id` string,
  `created_date` string,
  `updated_date` string,
  `last_update_description` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "escapeChar" = "\\",
  'quoteChar' = '"',
  'separatorChar' = ','
) LOCATION 's3://my-data-pipeline-bucket/test-exports/'
TBLPROPERTIES ('has_encrypted_data'='false');

Troubleshooting issues with data exports

Exports can fail for a number of reasons, for example if your search index isn’t up to date. For guidance on common failures, and how to resolve them, see Data pipeline troubleshooting in our knowledge base. 

Last modified on May 5, 2021

Was this helpful?

Yes
No
Provide feedback about this article

In this section

Powered by Confluence and Scroll Viewport.