Data pipeline
Requirements
To trigger data exports through the REST API, you’ll need:
A valid Bitbucket Data Center license
Bitbucket system admin global permissions
Considerations
There are a number of security and performance impacts you’ll need to consider before getting started.
Security
If you need to filter out data based on security and confidentiality, this must be done after the data is exported.
Exported files are saved in your shared home directory, so you’ll also want to check this is secured appropriately.
Performance impact
To minimize the risk of performance problems, we strongly recommend that you:
- Perform the data export during hours of low activity, or on a node with no activity.
- Limit the amount of data exported through the
fromDate
parameter, as a date further in the past will export more data, resulting in a longer data export.
Amount of data | Approximate export duration |
---|---|
Small data set
| 10 hours |
Large data set
| 35 hours |
Test performance vs production
The performance data presented here is based on our own internal testing. The actual duration and impact of a data export on your own environment will likely differ depending on:
- your infrastructure, configuration, and load
- amount of pull request activity to be exported.
Our tests were conducted on Data Center instances in AWS:
- Small - EC2 instance type
m5d.4xlarge
and RDS instance typedb.m4.4xlarge
- Large - EC2 instance type
c5.2xlarge
and RDS instance typedb.m5.large
We intentionally export data quite slowly to keep any performance degradation under a 5% threshold. If you run Bitbucket in a cluster, you could use your load balancer to redirect traffic away from the node performing the export.
Performing the data export
If your application is configured to use a context path, such as /jira, remember to include this in the <base-url>
in the examples below.
To export the current state data, make a POST
request to <base-url>/rest/datapipeline/latest/export
.
Use the fromDate
parameter to limit the data exported to just entities created or updated after the fromDate
value.
This parameter only accepts date values set in ISO 8601 format (yyyy-MM-ddTHH:mmTZD). For example:
- 2020-12-30T23:01Z
- 2020-12-30T22:01+01:00
(you'll need to use URL encoding in your request, for example2020-12-30T22%3A03%2B01%3A00
)
If you trigger an export without the fromDate parameter, all data from the last 365 days will be exported.
Here is an example request, using cURL and a personal access token for authentication:
curl -H "Authorization:Bearer ABCD1234" -H "X-Atlassian-Token: no-check"
-X POST https://myexamplesite.com/rest/datapipeline/latest/
export?fromDate=2020-10-22T01:30:11Z
The "X-Atlassian-Token: no-check"
header is only required for Confluence. You can omit this for Jira.
The POST
request returns the following responses.
Automatic data export cancellations
CANCELLED
.However, if the JVM is not notified after a crash or hardware-level failure occurs, the export process may get locked. This means you'll need to manually mark the export as CANCELLED by making a DELETE
request. This releases the process lock, allowing you to perform another data export.
Configuring the data export
You can configure the format of the export data through the following configuration properties.
Default value | Description |
---|---|
plugin.data.pipeline.embedded.line.break.preserve | |
false | Specifies whether embedded line breaks should be preserved in the output files. Line breaks can be problematic for some tools such as Hadoop. This property is set to |
plugin.data.pipeline.embedded.line.break.escape.char | |
\\n | Escaping character for embedded line breaks. By default, we'll print |
The following additional properties only apply to Bitbucket.
Default value | Description |
---|---|
plugin.data.pipeline.bitbucket.export.personal.forked.repository.commits | |
false | Specifies whether commits from forked repositories in personal projects should be exported. Set this property to |
plugin.data.pipeline.bitbucket.export.build.statuses | |
false | Specifies whether build statuses should be included in the export. Exporting build statuses can take a significant amount of time if you have a lot of builds. Set this property to |
plugin.data.pipeline.bitbucket.commit.queue.polling.timeout.seconds | |
20 | Time, in seconds, it takes to receive the first commit from git process. You should only need to change this if you see a |
plugin.data.pipeline.bitbucket.commit.git.execution.timeout.seconds | |
3600 | Sets the idle and execution timeout for the git ref-list command. You should only need to change this if you see "an error occurred while executing an external process: process timed out" error. |
plugin.data.pipeline.bitbucket.export.pull.request.activities | |
true | Specifies whether historical data about pull request activity data should be included in the export. Exporting activity data will significantly increase your export duration. Set this property to |
Check the status of an export
You can check the status of an export and view when your last export ran from within your application’s admin console. To view data export status:
- Go to > System.
- Select Data pipeline
- Not started - no export is currently running
- Started - the export is currently running
- Completed - the export has completed
- Cancellation requested - a cancellation request has been sent
- Cancelled - the export was cancelled
- Failed - the export failed.
For help resolving failed or cancelled exports, see Data pipeline troubleshooting.
Output files
Each time you perform a data export, we assign a numerical job ID to the task (starting with 1 for your first ever data export). This job ID is used in the file name, and location of the files containing your exported data.
Location of exported files
Exported data is saved as separate CSV files. The files are saved to the following directory:
<shared-home>/data-pipeline/export/<job-id>
if you run Bitbucket in a cluster<local-home>/shared/data-pipeline/export/<job-id>
you are using non-clustered Bitbucket.
Within the <job-id>
directory you will see the following files:
build_statuses_<job_id>_<timestamp>.csv
commits_<job_id>_<timestamp>.csv
pull_request_activities_<job_id>_<timestamp>.csv
pull_requests_<job_id>_<timestamp>.csv
repositories_<job_id>_<timestamp>.csv
users_<job_id>_<timestamp>.csv
To load and transform the data in these files, you'll need to understand the schema. See Data pipeline export schema.
Set a custom export path
To change the root export path, make a PUT
request to <base-url>/rest/datapipeline/1.0/config/export-path
.
In the body of the request pass the absolute path to your preferred directory, for example:
{
"path": "/tmp/new/path"
}
The PUT
request returns the following response:
To check the export path, make a GET
request to <base-url>/rest/datapipeline/1.0/config/export-path
.
The GET
request returns the following responses.
Revert to the default export path
To revert to the default path, make a DELETE
request to <base-url>/rest/datapipeline/1.0/config/export-path
.
Sample Spark and Hadoop import configurations
If you have an existing Spark or Hadoop instance, use the following references to configure how to import your data for further transformation: