Bitbucket Support

This feature is available with a Bitbucket Data Center license.

Data pipeline provides an easy way to export data from Jira, Confluence, or Bitbucket, and feed it into your existing data platform (like Tableau or PowerBI). This allows you to:

generate richer reports and visualizations of site activity
better understand how your teams are using your application
make better decisions on optimizing the use of Jira or Confluence in your organization

You can trigger a data export of the current state data through the REST API, and view the status of your exports in your application’s admin console. Data will be exported in CSV format. You can only perform one data export at a time.

For a detailed reference of the exported data's schema, see Data pipeline export schema.

Data pipeline is available in Data Center editions of:

Jira 8.14 and later
Confluence 7.12 and later
Bitbucket 7.13 and later

Requirements

To trigger data exports through the REST API, you’ll need:

A valid Bitbucket Data Center license
Bitbucket system admin global permissions

Considerations

There are a number of security and performance impacts you’ll need to consider before getting started.

Security

The export will include all data, including PII (Personally Identifiable Information) and restricted content. This is to provide you with as much data as possible, so you can filter and transform to generate the insights you’re after.

If you need to filter out data based on security and confidentiality, this must be done after the data is exported.

Exported files are saved in your shared home directory, so you’ll also want to check this is secured appropriately.

Performance impact

Exporting data is a resource-intensive process impacting application nodes, your database, and indexes. In our internal testing, we observed performance degradation in all product functions on the node actively performing an export.

To minimize the risk of performance problems, we strongly recommend that you:

Perform the data export during hours of low activity, or on a node with no activity.
Limit the amount of data exported through the fromDate parameter, as a date further in the past will export more data, resulting in a longer data export.

Our test results showed the following approximate durations for the export.

Amount of data

Approximate export duration

Small data set

27 million commits
250,000 pull requests
1.5 million pull request activity records
6,500 repositories
2,000 users

10 hours

Large data set

207 million commits
1 million pull requests
6.8 million pull request activity records
52,000 repositories
25,000 users

35 hours

Test performance vs production

The performance data presented here is based on our own internal testing. The actual duration and impact of a data export on your own environment will likely differ depending on:

your infrastructure, configuration, and load
amount of pull request activity to be exported.

Our tests were conducted on Data Center instances in AWS:

Small - EC2 instance type m5d.4xlarge and RDS instance type db.m4.4xlarge
Large - EC2 instance type c5.2xlarge and RDS instance type db.m5.large

We intentionally export data quite slowly to keep any performance degradation under a 5% threshold. If you run Bitbucket in a cluster, you could use your load balancer to redirect traffic away from the node performing the export.

Performing the data export

Use the data pipeline REST API to export data.

If your application is configured to use a context path, such as /jira, remember to include this in the <base-url> in the examples below.

To export the current state data, make a POST request to <base-url>/rest/datapipeline/latest/export.

Use the fromDate parameter to limit the data exported to just entities created or updated after the fromDate value.

This parameter only accepts date values set in ISO 8601 format (yyyy-MM-ddTHH:mmTZD). For example:

2020-12-30T23:01Z
2020-12-30T22:01+01:00
(you'll need to use URL encoding in your request, for example 2020-12-30T22%3A03%2B01%3A00)

If you trigger an export without the fromDate parameter, all data from the last 365 days will be exported.

Here is an example request, using cURL and a personal access token for authentication:

curl -H "Authorization:Bearer ABCD1234" -H "X-Atlassian-Token: no-check" 
-X POST https://myexamplesite.com/rest/datapipeline/latest/
export?fromDate=2020-10-22T01:30:11Z

The "X-Atlassian-Token: no-check" header is only required for Confluence. You can omit this for Jira.

The POST request returns the following responses.

Sample responses...

Code

Description

202

Data export started. For example:

{
  "startTime":"2021-03-03T12:08:24.045+11:00",
  "nodeId":"node1",
  "jobId":124,
  "status":"STARTED",
  "config":{
     "exportFrom":"2020-03-03T12:08:24.036+11:00",
     "forcedExport":false
  }
  "rootExportPath":"/path/data-pipeline/export"
}

409

Another data export is already running:

{

  "startTime":"2021-03-03T12:08:24.045+11:00",
  "nodeId":"node1",
  "jobId":124,
  "status":"STARTED",
  "config":{
     "exportFrom":"2020-03-03T12:08:24.036+11:00",
     "forcedExport":false
  }
  "rootExportPath":"/path/data-pipeline/export"
}

422

Data export failed due to an inconsistent index:

{
  "startTime": "2021-01-13T09:01:01.917+11:00",
  "completedTime": "2021-01-13T09:01:01.986+11:00",
  "nodeId": "node2",
  "jobId": 56,
  "status": "FAILED",
  "config": {
    "exportFrom": "2020-07-17T08:00:00+10:00",
    "forcedExport": false
  },
  "errors": [
    {
      "key": "export.pre.validation.failed",
      "message": "Inconsistent index used for export job."
    }
  ]
  "rootExportPath":"/path/data-pipeline/export"
}

If this occurs, you may need to reindex and then retry the data export.

Alternatively, you can force a data export using the forceExport=true query parameter. However, forcing an export on an inconsistent index could result in incomplete data.

The following response is returned when you force an export an an inconsistent index to warn you that the data might be incomplete.

{
  "startTime": "2021-01-13T09:01:42.696+11:00",
  "nodeId": "node2",
  "jobId": 57,
  "status": "STARTED",
  "config": {
    "exportFrom": "2020-07-17T08:01:00+10:00",
    "forcedExport": true
  },
  "warnings": [
    {
      "key": "export.pre.validation.failed",
      "message": "Inconsistent index used for export job."
    }
  ]
  "rootExportPath":"/path/data-pipeline/export"
}

Automatic data export cancellations

If a node running a data export is gracefully shut down, the export will be automatically marked as CANCELLED.

However, if the JVM is not notified after a crash or hardware-level failure occurs, the export process may get locked. This means you'll need to manually mark the export as CANCELLED by making a DELETE request. This releases the process lock, allowing you to perform another data export.

Configuring the data export

You can configure the format of the export data through the following configuration properties.

Default value	Description
plugin.data.pipeline.embedded.line.break.preserve
`false`	Specifies whether embedded line breaks should be preserved in the output files. Line breaks can be problematic for some tools such as Hadoop. This property is set to `False` by default, which means that line breaks are escaped.
plugin.data.pipeline.embedded.line.break.escape.char
`\\n`	Escaping character for embedded line breaks. By default, we'll print `\n` for every embedded line break.

The following additional properties only apply to Bitbucket.

Default value	Description
plugin.data.pipeline.bitbucket.export.personal.forked.repository.commits
`false`	Specifies whether commits from forked repositories in personal projects should be exported. Set this property to `True` to include commits from forked repositories in personal projects.
plugin.data.pipeline.bitbucket.export.build.statuses
`false`	Specifies whether build statuses should be included in the export. Exporting build statuses can take a significant amount of time if you have a lot of builds. Set this property to `true` to export build statuses.
plugin.data.pipeline.bitbucket.commit.queue.polling.timeout.seconds
`20`	Time, in seconds, it takes to receive the first commit from git process. You should only need to change this if you see a `CommitStreamingException` (this error is usually caused by another underlying problem).
plugin.data.pipeline.bitbucket.commit.git.execution.timeout.seconds
`3600`	Sets the idle and execution timeout for the git ref-list command. You should only need to change this if you see "an error occurred while executing an external process: process timed out" error.
plugin.data.pipeline.bitbucket.export.pull.request.activities
`true`	Specifies whether historical data about pull request activity data should be included in the export. Exporting activity data will significantly increase your export duration. Set this property to `false` to exclude pull request activity from your export.

Check the status of an export

You can check the status of an export and view when your last export ran from within your application’s admin console. To view data export status:

Go to > System.
Select Data pipeline

There are a number of export statuses:

Not started - no export is currently running
Started - the export is currently running
Completed - the export has completed
Cancellation requested - a cancellation request has been sent
Cancelled - the export was cancelled
Failed - the export failed.

For help resolving failed or cancelled exports, see Data pipeline troubleshooting.

Output files

Each time you perform a data export, we assign a numerical job ID to the task (starting with 1 for your first ever data export). This job ID is used in the file name, and location of the files containing your exported data.

Location of exported files

Exported data is saved as separate CSV files. The files are saved to the following directory:

<shared-home>/data-pipeline/export/<job-id> if you run Bitbucket in a cluster
<local-home>/shared/data-pipeline/export/<job-id> you are using non-clustered Bitbucket.

Within the <job-id> directory you will see the following files:

build_statuses_<job_id>_<timestamp>.csv
commits_<job_id>_<timestamp>.csv
pull_request_activities_<job_id>_<timestamp>.csv
pull_requests_<job_id>_<timestamp>.csv
repositories_<job_id>_<timestamp>.csv
users_<job_id>_<timestamp>.csv

To load and transform the data in these files, you'll need to understand the schema. See Data pipeline export schema.

Set a custom export path

By default, the data pipeline exports the files to the home directory, but you can use the REST API to set a custom export path.

To change the root export path, make a PUT request to <base-url>/rest/datapipeline/1.0/config/export-path.

In the body of the request pass the absolute path to your preferred directory, for example:

{
  "path": "/tmp/new/path"
}

The PUT request returns the following response:

Sample response...

Code	Sample response
200	If the path is writable and accepted: `{ "exportPath":"/tmp/new/path/data-pipeline/export", "customPathSet":true }`

To check the export path, make a GET request to <base-url>/rest/datapipeline/1.0/config/export-path.

The GET request returns the following responses.

Sample responses...

Code	Sample response
200	When custom path set: `{ "exportPath":"/tmp/example/pipeline", "customPathSet":true }`
200	When custom path not set, the default shared home path will be returned `{ "exportPath":"/shared/home/export/path", "customPathSet":false }`

Revert to the default export path

To revert to the default path, make a DELETE request to <base-url>/rest/datapipeline/1.0/config/export-path.

Sample Spark and Hadoop import configurations

If you have an existing Spark or Hadoop instance, use the following references to configure how to import your data for further transformation:

Spark/Databricks

%python
# File location
file_location = "/FileStore/**/export_2020_09_24T03_32_18Z.csv" 

# Automatically set data type for columns
infer_schema = "true"
# Skip first row as it's a header
first_row_is_header = "true"
# Ignore multiline within double quotes
multiline_support = "true"

# The applied options are for CSV files. For other file types, these will be ignored. Note escape & quote options for RFC-4801 compliant files
df = spark.read.format("csv") \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("multiLine", multiline_support) \
  .option("quote", "\"") \
  .option("escape", "\"") \
  .option("encoding", "UTF-8").load(file_location)

display(df)

Hadoop

CREATE EXTERNAL TABLE IF NOT EXISTS some_db.datapipeline_export (
  `repository_id` string, 
  `instance_url` string,
  `url` string,
  `repository_name` string,
  `description` string,
  `hierarchy_id` string,
  `origin` string,
  `project_id` string,
  `project_key` string,
  `project_name` string,
  `project_type` string,
  `forkable` string,
  `fork` string,
  `public` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "escapeChar" = "\\",
  'quoteChar' = '"',
  'separatorChar' = ','
) LOCATION 's3://my-data-pipeline-bucket/test-exports/'
TBLPROPERTIES ('has_encrypted_data'='false');

Troubleshooting failed exports

Exports can fail for a number of reasons, for example if your search index isn’t up to date. For guidance on common failures, and how to resolve them, see Data pipeline troubleshooting in our knowledge base.

Page

Viewport

Confluence

Versions

Data pipeline

Administer Bitbucket Data Center and Server

On this page

In this section

Related content

Still need help?