Extract the latest changes from data shares

Still need help?

The Atlassian Community is here for you.

Ask the community

Did you know you can extract data from a specific point in time with data shares? This means you don’t need to fully extract from all your tables every time. This article will guide you on how to extract only the latest changes since the last time you extracted data or a point in time of your choice.

Extract the last changes from the Atlassian Data Lake with data shares

Data shares, powered by the Databricks Delta sharing protocol, support filtering the data to get only the data since a specific date. This is possible for Python and the Spark connectors (check the git community for the last status of these connectors).

Our recommended method is to use predicate hints to filter the data on the row_refreshed_at column. This column is available on all tables except for the jira_issue_enhanced_table table. You can use the row_refreshed_at column to pull rows that have been updated or added since a specific date. This filter will be applied and only the file with rows updated after the provided date will be extracted.

Example using the Python connector:

import json
import delta_sharing

# Set up client and get all tables
profile_file = "~/Path/To/credentials.share"
client = delta_sharing.SharingClient(profile_file)
tables = client.list_all_tables()

# Set up filtering
delta_column_name = "row_refreshed_at"
last_refreshed_date = "2024-01-01" # or alternatively you can use a ISO8601 format such as such as 2022-01-01T00:00:00Z
json_predicate_hints = {
    "op": "greaterThanOrEqual",
    "children": [
        {"op": "column", "name": delta_column_name, "valueType": "timestamp"},
        {"op": "literal", "value": last_refreshed_date, "valueType": "timestamp"}
    ]
}
json_predicate_hints_string = json.dumps(json_predicate_hints, indent=2)

# Load the data 
for table in tables:
    table_url = profile_file + f"#{table.share}.{table.schema}.{table.name}"
    dataFrame = delta_sharing.load_as_pandas(
        url = table_url, 
        jsonPredicateHints=json_predicate_hints_string
    )

Versions are available for seven days

As of January 2025, we increased the version retention from 48 hours to 7 days.

You might notice that the protocol suggests to use the version predicate. However, we recommend using row_refreshed_at, because we keep versions for only seven days. After seven days, versions are not available on the Atlassian Data Lake. We run optimization processes on the data in the Atlassian Data Lake, in which the versions are deleted.

Use the row_refreshed_at column, which is always available, to ensure your extracts and processes are accurate.

Resolving the FAILED_READ_FILE.DBR_FILE_NOT_EXIST error

If you encounter the error below in your script, make sure you’re only using row_refreshed_at and not another predicate.

[FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file {S3 File Link}.
[DELTA_CHANGE_DATA_FILE_NOT_FOUND] {S3 File Link} referenced in the transaction log cannot be found.
This can occur when data has been manually deleted from the file system rather than using the table `DELETE` statement.
This request appears to be targeting Change Data Feed, if that is the case, this error can occur when the change data 
file is out of the retention period and has been deleted by the `VACUUM` statement.
For more information, see https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions SQLSTATE: KD001

You’ll get this error when you attempt to read a version that’s no longer available. As mentioned previously, we remove versions after seven days. We recommend using the filter predicate row_refreshed_at to pull deltas.

No support for the jira_issue_enhanced_table table

The jira_issue_enhanced_table table is a materialized view in our schema. This view is provided to simplify the reporting for BI tools directly from the Atlassian Data Lake by combining multiple tables.

This materialized view is recalculated every time its data is refreshed (every 3-6 hours). This means the data is also recalculated every time.

There is no good way to get the last change from this table with predicates.

When rows of data don’t respect the predicates

The protocol extracts Data Lake files to maximize efficiency in transferring large amounts of data. Therefore, the protocol extracts all files with rows respecting the predicates.

For example, if the Data Lake has the three files below:

File 1File 2File 3
2024-12-252024-12-252024-11-01
2024-12-252024-11-012024-11-01

Let’s say our code asks for rows where row_refreshed_at is on or after '2024-12-25'. The protocol will return File 1 and File 2 because they both contain rows that match the specified predicate.

Scenarios that return all data in tables

We run operations on the Atlassian Data Lake that can refresh the data, which in turn updates the row_refreshed_at columns.

The most common scenarios include:

  • Loading the historical data when we release new columns or issue fields

  • Migrating data between regions following a data residency request

  • Fixing the data when we detect an error

Last modified on Jan 15, 2025

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.