Upsert migration by adampolomski · Pull Request #47 · relativityone/delta-rs

adampolomski · 2026-01-02T10:25:50Z

Description

The description of the main changes of your pull request

Related Issue(s)

Documentation

…elta-io#3737) # Description This PR aims to add per column parquet encoding support similar to what pyarrow has right now. Since pyarrow engine is being deprecated this is the best way to resolve the issue. It adds encoding spec same as what parquet spec has defined. https://parquet.apache.org/docs/file-format/data-pages/encodings/ Thanks to ghost or deleted-user who has carried out most of the work but it never got merged in. # Related Issue(s) - closes delta-io#3319 # Documentation This is implementing https://parquet.apache.org/docs/file-format/data-pages/encodings/ into ColumnProperties, as part of this it has also added Encoding enum. It will the same format as the other ColumnProperties arguments, PLAIN_DICTIONARY and RLE_DICTIONARY could not be implemented due to some issues, will be looked into later on. Here is a code snippet on how you can how use the delta tables and how you can compare against existing implementation ```python import os import pandas as pd import numpy as np from deltalake import write_deltalake, WriterProperties, ColumnProperties,DeltaTable from pathlib import Path import pyarrow.parquet as pq import pyarrow as pa # Make some fake time series data TOTAL_ROWS = 100_000_00 timestamps = pd.date_range(start=pd.Timestamp.now(), periods=TOTAL_ROWS, freq="5ms") timeline = np.linspace(0, len(timestamps), len(timestamps)) print("Generating data...") pat = pa.Table.from_pandas( pd.DataFrame( { # timestamp (auto-generated) "timestamp": timestamps, # Timeseries data as float32 "timeseries_data": (10 * np.sin(2 * np.pi * 50 * timeline)).astype( np.float32 ), "timeseries_int_data": (1000 * np.sin(2 * np.pi * 50 * timeline)).astype( np.int32 ), # 1 minute partitions "partition_label": timestamps.strftime("%H%M"), } ) ) print("Data generated.") output_path_normal = "example_deltalake" write_deltalake( output_path_normal, data=pat, partition_by=["partition_label"], # Enabled compression for equivalent comparison, dictionary enabled leads to a larger file size writer_properties=WriterProperties( compression="ZSTD", compression_level=1, default_column_properties=ColumnProperties(dictionary_enabled=True), ), mode="overwrite", # Can't specify per-column encoding ) print("Wrote normal delta table.") output_path_encoded = "encoded_example_deltalake" write_deltalake( output_path_encoded, data=pat, partition_by=["partition_label"], # Enabled compression for equivalent comparison, dictionary enabled leads to a larger file size writer_properties=WriterProperties( compression="ZSTD", compression_level=1, column_properties={ "timestamp": ColumnProperties(dictionary_enabled=False,encoding="DELTA_BINARY_PACKED"), "timeseries_data": ColumnProperties(dictionary_enabled=False,encoding="BYTE_STREAM_SPLIT"), "timeseries_int_data": ColumnProperties(dictionary_enabled=False,encoding="DELTA_BINARY_PACKED"), "partition_label": ColumnProperties(dictionary_enabled=False,encoding="DELTA_BINARY_PACKED"), }, ), mode="overwrite", # Can't specify per-column encoding ) print("Wrote encoded delta table.") output_path_default_encoded = "example_pyarrow_delta_default_encoding" pq.write_to_dataset( pat, output_path_default_encoded, partition_cols=["partition_label"], use_dictionary=False, use_byte_stream_split=True, compression="ZSTD", compression_level=1, ) output_path_delta_specifc_encoded = "example_pyarrow_delta_specifc_col_encoding" pq.write_to_dataset( pat, output_path_delta_specifc_encoded, partition_cols=["partition_label"], # Ability to specify column encodings here use_dictionary=False, use_byte_stream_split=False, column_encoding={ "timestamp": "DELTA_BINARY_PACKED", "timeseries_data": "BYTE_STREAM_SPLIT", "timeseries_int_data": "DELTA_BINARY_PACKED", "partition_label": "DELTA_BINARY_PACKED", }, compression="ZSTD", compression_level=1, ) print("Wrote delta table with pyarrow column encodings.") def get_folder_size(folder): return ByteSize( sum(file.stat().st_size for file in Path(folder).rglob("*")) ).megabytes class ByteSize(int): _KB = 1024 _suffixes = "B", "KB", "MB", "GB", "PB" def __new__(cls, *args, **kwargs): return super().__new__(cls, *args, **kwargs) def __init__(self, *args, **kwargs): self.bytes = self.B = int(self) self.kilobytes = self.KB = self / self._KB**1 self.megabytes = self.MB = self / self._KB**2 self.gigabytes = self.GB = self / self._KB**3 self.petabytes = self.PB = self / self._KB**4 *suffixes, last = self._suffixes suffix = next( (suffix for suffix in suffixes if 1 < getattr(self, suffix) < self._KB), last, ) self.readable = suffix, getattr(self, suffix) super().__init__() def __str__(self): return self.__format__(".2f") print(DeltaTable(output_path_encoded).to_pandas()) print(f"The File size of delta table is {get_folder_size(output_path_normal)} MB") print(f"The File size of delta table with parquet encoding is {get_folder_size(output_path_default_encoded)} MB") print( f"The File size of delta table with pyarrow default column encodings is {get_folder_size(output_path_default_encoded)} MB" ) print( f"The File size of delta table with pyarrow specific column encodings is {get_folder_size(output_path_delta_specifc_encoded)} MB" ) print("Deleting the folders now...") import shutil shutil.rmtree(output_path_normal) shutil.rmtree(output_path_encoded) shutil.rmtree(output_path_default_encoded) shutil.rmtree(output_path_delta_specifc_encoded) print("Deleted the folders.") ``` I have ran this script on my dev environment of Ubuntu 24.04 WSL2 and python 3.12.3 The outputs came to The File size of delta table is 122.21567821502686 MB The File size of delta table with parquet encoding is 25.041666984558105 MB The File size of delta table with pyarrow default column encodings is 25.041666984558105 MB The File size of delta table with pyarrow specific column encodings is 20.985719680786133 MB Which is roughly 75% improvement with mixed data. If you decide to further optimisations you can go down to 95%-98% (if all INT data) --------- Signed-off-by: Nilesh Zagade <Nilesh.Zagade@ukpowernetworks.co.uk> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

# Description Just generally check for non-200 response codes and give them back instead of a "token missing" error which is not helpful. # Related Issue(s) Fixes delta-io#3739 --------- Signed-off-by: Stephen Carman <shcarman@gmail.com>

…se it via SQL through DataSink Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

This comes from an error report in Slack. Since url::ParseError gives us a few different types, we can use those to determine if something is "file pathy" compared to just an outright incorrect URL Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

This change had far reaching consequences due to the excessive use of history() in our own tests Fixes delta-io#3753 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

# Description The description of the main changes of your pull request # Related Issue(s) - apache/datafusion#16799 # Documentation As part of preparing for the DataFusion 50 release I am making a PR to upgrade delta.rs Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

… of execution After some debate decided that SessionConfig is sufficient to pull through rather than SessionContext. Due to the wacky nature of some of our code-paths in this operation, the signatures of a number of internal functions had to change to allow SessionConfig to be pulled all the way down into the ZOrderExecContext Fixes delta-io#3751 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

…o#3769) # Description Add `StorageConfig` when building `UnityCatalog` to allow catalog credentials. # Related Issue(s) - closes delta-io#3757 Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

# Description - closes delta-io#3765 Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

With the recent MSRV change for the newest datafusion release, finally time to upgrade some of these. In the course of doing this, I decided to get rid of the unnecessary `maplit` crate dependency here. Making HashMaps isn't that hard 😄 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Committed a fix upstream [here](bnjbvr/cargo-machete#199) Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: juejinyuxitu <juejinyuxitu@outlook.com>

There were a number of symbols only used during `datafusion` usage Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

This dependency comes from the chrono org, so I kinda trust it, but it's not something we should have in the hierarchy unless it's being used Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

The StructType public API change was pretty invasive, oof. Fixes delta-io#3774 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

The kernel level is doing this validation now Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

get the delta table row count

Get the number of rows from the underlying list of parquet files Signed-off-by: ohadmata <ohad@pecan.ai>

Get the number of records from the add actions

Since count is approximate, I made sure to update the docstrings for the function to helpfully inform users Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

The check that caught this wasn't apparently "required" when I had refactored our repo rules a month or so ago. I had also made this change locally, there might even be video evidence of it 😄 but I must not have correctly pushed it Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

…utation The problem here is that when partition columns are present the RecordBatchWriter can/will mutate its internal schema in a way that becomes incompatible with the Delta table it was originally created with. I am quite fascinated by this bug because it only manifested when going from a development environment to production. Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

…ma evolution Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

# Description Inside the function `table.cleanup_metadata `look for the last checkpoint before the requested minimum cutoff time and version. Only delete files before that checkpoint. This is to fix the bug below and make sure that versions inside the retention period are still loadable. # Related Issue(s) Fixes delta-io#3692 --------- Signed-off-by: Corwin Joy <corwin.joy@gmail.com> Signed-off-by: R. Tyler Croy <rtyler@brokenco.de> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

Updates the requirements on [sqlparser](https://github.com/apache/datafusion-sqlparser-rs) to permit the latest version. - [Changelog](https://github.com/apache/datafusion-sqlparser-rs/blob/main/CHANGELOG.md) - [Commits](apache/datafusion-sqlparser-rs@v0.56.0...v0.59.0) --- updated-dependencies: - dependency-name: sqlparser dependency-version: 0.59.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

…lta-io#3793) # Description Add a step to clean the cache entries of GitHub actions: - Integrate cleanup-cache job into dev_pr.yml workflow - Delete up to 100 cache entries per closed PR targeting main - Prevent cache thrashing with proper error handling and logging # Documentation - https://docs.github.com/en/actions/how-tos/manage-workflow-runs/manage-caches#force-deleting-cache-entries Signed-off-by: Florian Valeye <florian.valeye@gmail.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

Updates the requirements on [foyer](https://github.com/foyer-rs/foyer) to permit the latest version. - [Release notes](https://github.com/foyer-rs/foyer/releases) - [Changelog](https://github.com/foyer-rs/foyer/blob/main/CHANGELOG.md) - [Commits](foyer-rs/foyer@v0.17.2...v0.20.0) --- updated-dependencies: - dependency-name: foyer dependency-version: 0.20.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

…ing execution Remove cache calls on large DataFrames while keeping small result caching: - Keep conflicts_df cache (small: join keys + file paths only) - Remove implicit materializations from target_df, filtered_target_df, non_conflicting_target, and result_df - All large DataFrames now use lazy streaming execution - Add schema normalization (cast Dictionary to Utf8) for file path column to fix compatibility - Add helper method find_conflicts_keys_only for clean anti-join logic Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

The target_df parameter was not used in the function body - it only selects keys from self.source. Removed the parameter and updated the call site. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

This reverts commit 9a5a201.

…hub.com>

… detection Instead of caching the conflicts DataFrame to work around DataFusion's Dictionary encoding schema mismatch, implement manual join logic: - Collect target DataFrame with join keys + file paths (small result) - Collect distinct source join keys (small result) - Perform join in memory using HashSet for efficient lookup - Extract file paths that have matching keys This avoids materializing large DataFrames while still handling the schema inconsistency by working entirely in memory on small, already-collected data. Memory impact: Only materializes join keys + file paths (one row per conflicting file), not full row data. Much more efficient than caching full DataFrames. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

The previous approach incorrectly materialized the entire target DataFrame which could be billions of rows. The corrected approach: 1. Keeps target_df and source lazy (not materialized) 2. Performs inner join in DataFusion (lazy operation) 3. Selects only minimal columns (join keys + file path, not full rows) 4. Collects ONLY the join result which is small (only conflicting rows) Memory footprint: For a table with billions of rows but only thousands of conflicts, we materialize only thousands of rows with minimal columns, not billions of full rows. The join result is inherently small because it contains only rows where join keys match between source and target (actual conflicts). Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

…rics - Changed extract_conflicting_filenames to extract_conflicts_dataframe to return a DataFrame - Added extract_file_paths_from_conflicts to extract file paths from the cached DataFrame - Cache the conflicts DataFrame for reuse in multiple places - Added num_conflicting_records field to UpsertMetrics - Count and report conflicting records in metrics - Updated tests to verify num_conflicting_records metric Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

niltecedu and others added 30 commits September 11, 2025 19:05

feat(datafusion): add insert_into operation to TableProvider and expo…

1d6ba3d

…se it via SQL through DataSink Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

chore: convert the history() API to return an impl Iterator

d86f64e

This change had far reaching consequences due to the excessive use of history() in our own tests Fixes delta-io#3753 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: bump all the versions for a breaking semantic version change

a47d4dc

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: update docs (delta-io#3761)

0ca1a3b

feat(unity-catalog): support credentials via storage options (delta-i…

bcb37b3

…o#3769) # Description Add `StorageConfig` when building `UnityCatalog` to allow catalog credentials. # Related Issue(s) - closes delta-io#3757 Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

fix: check if eligible to read (delta-io#3771)

5b9b642

# Description - closes delta-io#3765 Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

chore: pin cargo-machete action to the sha right before a regression

185b4a8

Committed a fix upstream [here](bnjbvr/cargo-machete#199) Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: fix some typos in comment

38d8d7a

Signed-off-by: juejinyuxitu <juejinyuxitu@outlook.com>

chore: reduce the number of warnings spewed during builds

94e5833

There were a number of symbols only used during `datafusion` usage Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: remote unnecessary maplit dependency from core

370d87a

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: make humantime optional unless it's actually used

15e12eb

This dependency comes from the chrono org, so I kinda trust it, but it's not something we should have in the hierarchy unless it's being used Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: upgrade to delta-kernel-rs 0.16

5b3b740

The StructType public API change was pretty invasive, oof. Fixes delta-io#3774 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: remove no longer valid pytest

9b35849

The kernel level is doing this validation now Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: ohadmata <ohad@pecan.ai>

8472638

get the delta table row count

Signed-off-by: ohadamta <ohad@pecan.ai>

76f7566

Get the number of rows from the underlying list of parquet files Signed-off-by: ohadmata <ohad@pecan.ai>

Signed-off-by: ohadmata <ohad@pecan.ai>

dd058e6

Get the number of records from the add actions

chore: add a regression test for the DeltaTable.count() method

d30b11f

Since count is approximate, I made sure to update the docstrings for the function to helpfully inform users Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

fix: only mutate the internal schema of RecordBatchWriter during sche…

4fbee1e

…ma evolution Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

adampolomski and others added 27 commits February 16, 2026 12:59

feat: More comprehensive tests

5aa3d0b

feat: More comprehensive tests

9415bf4

feat: additional test case for duplicate conflicts

08b8f4a

feat: handle schema reorderings

0d7a3c4

feat: handle schema reorderings

802979c

feat: handle cross-partition upserts

2c596b4

feat: multi-partition test case reworked

b232f56

Initial plan

b502aa2

feat: removed useless caching

d184051

Revert "feat: removed useless caching"

b1d8f1e

This reverts commit 9a5a201.

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.git…

ca26695

…hub.com>

feat: removed unnecessary columns

1f97484

feat: review comments

37b1058

feat: fetch only distinct files

6f5fd7e

feat: fetch only distinct files

d49058c

feat: optimised join

3f5c407

Initial plan

c076d37

Remove unnecessary clone when counting conflicts DataFrame

216d16a

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

feat: moved stuff around

0bfd6ce

feat: optimised join

51edbc1

feat: optimised join

2b4ea1e

chore: rebase to main (v0.51)

2976758

mandrush force-pushed the upsert branch from 7233ee0 to 2976758 Compare February 16, 2026 12:11

github-actions Bot added binding/python proofs labels Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsert migration#47

Upsert migration#47
adampolomski wants to merge 176 commits intoupsert-mainfrom
upsert

adampolomski commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

adampolomski commented Jan 2, 2026

Description

Related Issue(s)

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants