Extract the latest changes from data shares
Did you know you can extract data from a specific point in time with data shares? This means you don’t need to fully extract from all your tables every time. This article will guide you on how to extract only the latest changes since the last time you extracted data or a point in time of your choice.
Extract the last changes from the Atlassian Data Lake with data shares
Data shares, powered by the Databricks Delta sharing protocol, support filtering the data to get only the data since a specific date. This is possible for Python and the Spark connectors (check the git community for the last status of these connectors).
Our recommended method is to use predicate hints to filter the data on the row_refreshed_at
column. This column is available on all tables except for the jira_issue_enhanced_table
table. You can use the row_refreshed_at
column to pull rows that have been updated or added since a specific date. This filter will be applied and only the file with rows updated after the provided date will be extracted.
Example using the Python connector:
import json
import delta_sharing
# Set up client and get all tables
profile_file = "~/Path/To/credentials.share"
client = delta_sharing.SharingClient(profile_file)
tables = client.list_all_tables()
# Set up filtering
delta_column_name = "row_refreshed_at"
last_refreshed_date = "2024-01-01" # or alternatively you can use a ISO8601 format such as such as 2022-01-01T00:00:00Z
json_predicate_hints = {
"op": "greaterThanOrEqual",
"children": [
{"op": "column", "name": delta_column_name, "valueType": "timestamp"},
{"op": "literal", "value": last_refreshed_date, "valueType": "timestamp"}
]
}
json_predicate_hints_string = json.dumps(json_predicate_hints, indent=2)
# Load the data
for table in tables:
table_url = profile_file + f"#{table.share}.{table.schema}.{table.name}"
dataFrame = delta_sharing.load_as_pandas(
url = table_url,
jsonPredicateHints=json_predicate_hints_string
)
Versions are available for seven days
As of January 2025, we increased the version retention from 48 hours to 7 days.
You might notice that the protocol suggests to use the version predicate. However, we recommend using row_refreshed_at
, because we keep versions for only seven days. After seven days, versions are not available on the Atlassian Data Lake. We run optimization processes on the data in the Atlassian Data Lake, in which the versions are deleted.
Use the row_refreshed_at
column, which is always available, to ensure your extracts and processes are accurate.
Resolving the FAILED_READ_FILE.DBR_FILE_NOT_EXIST
error
If you encounter the error below in your script, make sure you’re only using row_refreshed_at
and not another predicate.
[FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file {S3 File Link}.
[DELTA_CHANGE_DATA_FILE_NOT_FOUND] {S3 File Link} referenced in the transaction log cannot be found.
This can occur when data has been manually deleted from the file system rather than using the table `DELETE` statement.
This request appears to be targeting Change Data Feed, if that is the case, this error can occur when the change data
file is out of the retention period and has been deleted by the `VACUUM` statement.
For more information, see https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions SQLSTATE: KD001
You’ll get this error when you attempt to read a version that’s no longer available. As mentioned previously, we remove versions after seven days. We recommend using the filter predicate row_refreshed_at
to pull deltas.
No support for the jira_issue_enhanced_table
table
The jira_issue_enhanced_table
table is a materialized view in our schema. This view is provided to simplify the reporting for BI tools directly from the Atlassian Data Lake by combining multiple tables.
This materialized view is recalculated every time its data is refreshed (every 3-6 hours). This means the data is also recalculated every time.
There is no good way to get the last change from this table with predicates.
When rows of data don’t respect the predicates
The protocol extracts Data Lake files to maximize efficiency in transferring large amounts of data. Therefore, the protocol extracts all files with rows respecting the predicates.
For example, if the Data Lake has the three files below:
File 1 | File 2 | File 3 |
---|---|---|
2024-12-25 | 2024-12-25 | 2024-11-01 |
2024-12-25 | 2024-11-01 | 2024-11-01 |
Let’s say our code asks for rows where row_refreshed_at
is on or after '2024-12-25'
. The protocol will return File 1 and File 2 because they both contain rows that match the specified predicate.
Scenarios that return all data in tables
We run operations on the Atlassian Data Lake that can refresh the data, which in turn updates the row_refreshed_at
columns.
The most common scenarios include:
Loading the historical data when we release new columns or issue fields
Migrating data between regions following a data residency request
Fixing the data when we detect an error