Data pipeline

This feature is available with a Confluence Data Center license.

Data pipeline provides an easy way to export data from Jira, Confluence, or Bitbucket, and feed it into your existing data platform (like Tableau or PowerBI). This allows you to:
  • generate richer reports and visualizations of site activity
  • better understand how your teams are using your application
  • make better decisions on optimizing the use of Jira or Confluence in your organization

You can trigger a data export in your application’s admin console or through the REST API. Data will be exported in CSV format. You can only perform one data export at a time.

For a detailed reference of the exported data's schema, see Data pipeline export schema.

Data pipeline is available in Data Center editions of:

  • Jira 8.14 and later
  • Confluence 7.12 and later
  • Bitbucket 7.13 and later

On this page:

Requirements

To trigger data exports through the REST API, you’ll need:

Considerations

There are a number of security and performance impacts you’ll need to consider before getting started.

Security

The export will include all data, including PII (Personally Identifiable Information) and restricted content. This is to provide you with as much data as possible, so you can filter and transform to generate the insights you’re after.

If you need to filter out data based on security and confidentiality, this must be done after the data is exported.

Exported files are saved in your shared home directory, so you’ll also want to check this is secured appropriately. 

Export performance

Exporting data can take a long time in large instances. We intentionally export data at a limited rate to keep any performance impact to your site under a 5% threshold. It’s important to note that there is no impact to performance unless an export is in progress.

When scheduling your exports, we recommend that you:

  • Limit the amount of data exported using the fromDate parameter, as a date further in the past will export more data, resulting in a longer data export.
  • Schedule exports during hours of low activity, or on a node with no activity, if you do observe any performance degradation during the export.

Our test results showed the following approximate durations for the export...

NumberApproximate export duration
Users100,0008 minutes
Spaces15,00012 minutes
Pages25 million12 hours
Comments15 million1 hour
Analytics events20 million2 hours

The total export time was around 16 hours. 

Test performance VS production

The data presented here is based on our own internal testing. The actual duration and impact of data export on your own environment will likely differ depending on your infrastructure, configuration, and load. 

Our tests were conducted on a single node Data Center instance in AWS:

  • EC2 instance type: c5.4xlarge
  • RDS instance type: db.m5.4xlarge

Access the data pipeline

To access the data pipeline go toAdministration  > General Configuration > Data pipeline.

Schedule regular exports

The way to get the most value out of the data pipeline is to schedule regular exports. The data pipeline performs a full export every time, so if you have a large site, you may want to only export once a week.

To set the export schedule:

  1. From the Data pipeline screen, select Schedule settings.
  2. Select the Schedule regular exports checkbox.
  3. Select the date to include data from. Data from before this date won’t be included. This is usually set to 12 months or less.
  4. Choose how often to repeat the export.
  5. Select a time to start the export. You may want to schedule the export to happen outside working hours.
  6. Select the Schema version to use (if more than one schema is available).
  7. Save your schedule.

Timezones and recurring exports

We use your server timezone to schedule exports (or system timezone if you’ve overridden the server time in the application). The export schedule isn’t updated if you change your timezone. If you do need to change the timezone, you’ll need to edit the schedule and re-enter the export time.

You can schedule exports to happen as often as you need. If you choose to export on multiple days, the first export will occur on the nearest day after you save the schedule. Using the example in the screenshot above, if you set up your schedule on Thursday, the first export would occur on Saturday, and the second export on Monday. We don’t wait for the start of the week.

Export schema

The export schema defines the structure of the export. We version the schema so that you know your export will have the same structure as previous exports. This helps you avoid problems if you’ve built dashboards or reports based on this data.

We only introduce new schema versions for breaking changes, such as removing a field, or if the way the data is structured changes. New fields are simply added to the latest schema version.

Older schema versions will be marked as ‘deprecated’, and may be removed in future versions. You can still export using these versions, just be aware we won’t update them with any new fields.

Check the status of an export

You can check the status of an export and view when your last export ran from the data pipeline screen. 

The Export details table will show the most recent exports, and the current status.

Select   > View details to see the full details of the export in JSON format. Details include the export parameters, status, and any errors returned if the export failed.

For help resolving failed or cancelled exports, see Data pipeline troubleshooting

Cancel an export

To cancel an export while it is in progress:
  • Go to the Data pipeline screen.
  • Select  next to the export, and choose Cancel export.
  • Confirm you want to cancel the export.

It can take a few minutes for the processes to be terminated. Any files already written will remain in the export directory. You can delete these files if you don’t need them.

Exclude projects from the export

You can exclude spaces from the export by adding them to an opt-out list. This is useful if you don’t need to report on that particular space, or if it contains sensitive content that you’d prefer not to export.

To add spaces to the opt-out list, make a POST request to <base-url>/rest/datapipeline/1.0/config/optout and pass the space keys as follows.

{ 
 "type": "SPACE", 
 "keys": ["HR","TEST"] 
}

These spaces will be excluded from all future exports.

For full details, including how to remove spaces from the opt-out list, refer to the Data pipeline REST API reference

Automatic data export cancellations

If you shut down a node running a data export, the export will be cancelled. However, if the JVM is not notified after a crash or hardware-level failure, the export process may get locked. This means you'll need to manually mark the export as cancelled (through the UI, or via the REST API by making a DELETE request). This releases the process lock, allowing you to perform another data export.

Configuring the data export

You can configure the format of the export data using the following system properties.

Default valueDescription
plugin.data.pipeline.embedded.line.break.preserve
false

Specifies whether embedded line breaks should be preserved in the output files. Line breaks can be problematic for some tools such as Hadoop.

This property is set to False by default, which means that line breaks are escaped.

plugin.data.pipeline.embedded.line.break.escape.char
\\n

Escaping character for embedded line breaks. By default, we'll print \n for every embedded line break.

plugin.data.pipeline.minimum.usable.disk.space.after.export
5GB

To prevent you from running out of disk space, the data pipeline will check before and during an export that there is at least 5GB free disk space.

Set this property, in gigabytes, to increase or decrease the limit. To disable this check, set this property to -1 (not recommended).

Use the data pipeline REST API

You can use the data pipeline REST API to export data.

To start a data pipeline export, make a POST request to <base-url>/rest/datapipeline/latest/export.

Here is an example request, using cURL and a personal access token for authentication:

curl -H "Authorization:Bearer ABCD1234" -H "X-Atlassian-Token: no-check" 
-X POST https://myexamplesite.com/rest/datapipeline/latest/
export?fromDate=2020-10-22T01:30:11Z

You can also use the API to check the status, change the export location, and schedule or cancel an export. 

For full details, refer to the Data pipeline REST API reference

Output files

Each time you perform a data export, we assign a numerical job ID to the task (starting with 1 for your first ever data export). This job ID is used in the file name, and location of the files containing your exported data. 

Location of exported files

Exported data is saved as separate CSV files. The files are saved to the following directory:

  • <shared-home>/data-pipeline/export/<job-id> if you run Confluence in a cluster
  • <local-home>/data-pipeline/export/<job-id> you are using non-clustered Confluence

Within the <job-id> directory you will see the following files:

  • users_job<job_id>_<schema_version>_<timestamp>.csv 

  • spaces_job<job_id>_<schema_version>_<timestamp>.csv

  • pages_job<job_id>_<schema_version>_<timestamp>.csv

  • comments_job<job_id>_<schema_version>_<timestamp>.csv

  • analytics_events_job<job_id>_<schema_version>_<timestamp>.csv

To load and transform the data in these files, you'll need to understand the schema. See Data pipeline export schema.

Set a custom export path

By default, the data pipeline exports the files to the home directory, but you can use the REST API to set a custom export path.

To change the root export path, make a PUT request to <base-url>/rest/datapipeline/1.0/config/export-path.

In the body of the request pass the absolute path to your preferred directory. 

For full details, including how to revert back to the default path, refer to the Data pipeline REST API reference

Sample Spark and Hadoop import configurations

If you have an existing Spark or Hadoop instance, use the following references to configure how to import your data for further transformation.


Spark / Databricks...

%python
# File location
file_location = "/FileStore/**/export_2020_09_24T03_32_18Z.csv" 

# Automatically set data type for columns
infer_schema = "true"
# Skip first row as it's a header
first_row_is_header = "true"
# Ignore multiline within double quotes
multiline_support = "true"

# The applied options are for CSV files. For other file types, these will be ignored. Note escape & quote options for RFC-4801 compliant files
df = spark.read.format("csv") \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("multiLine", multiline_support) \
  .option("quote", "\"") \
  .option("escape", "\"") \
  .option("encoding", "UTF-8").load(file_location)

display(df)

Hadoop...

CREATE EXTERNAL TABLE IF NOT EXISTS some_db.datapipeline_export (
  `page_id` string,
  `instance_url` string,
  `space_key` string,
  `page_url` string,
  `page_type` string,
  `page_title` string,
  `page_status` string,
  `page_content` string,
  `page_parent_id` string,
  `labels` string,
  `page_version` string,
  `creator_id` string,
  `last_modifier_id` string,
  `created_date` string,
  `updated_date` string,
  `last_update_description` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "escapeChar" = "\\",
  'quoteChar' = '"',
  'separatorChar' = ','
) LOCATION 's3://my-data-pipeline-bucket/test-exports/'
TBLPROPERTIES ('has_encrypted_data'='false');

Troubleshooting issues with data exports

Exports can fail for a number of reasons, for example if your search index isn’t up to date. For guidance on common failures, and how to resolve them, see Data pipeline troubleshooting in our knowledge base. 

Last modified on Nov 29, 2022

Was this helpful?

Yes
No
Provide feedback about this article

In this section

Powered by Confluence and Scroll Viewport.