feat(rust): add upsert operation by adampolomski · Pull Request #37 · relativityone/delta-rs

adampolomski · 2025-09-09T22:54:47Z

Description

implements first, simplified version of Upsert operation

Related Issue(s)

Documentation

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

FREEZE! Looks like we were _really_ close to full multi-threaded support before, and just needed to sprinkle a little macro magic on the pyclass definition. See [pyo3 docs](https://pyo3.rs/v0.23.0/class/thread-safety) Fixes delta-io#3594 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

…nel default engine Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](la10736/rstest@v0.25.0...v0.26.1) --- updated-dependencies: - dependency-name: rstest dependency-version: 0.26.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

Ideally polars wouldn't be giving us `item` as the list field name, but it's more important to Just Work ™️ than be pedantic about these things Fixes delta-io#3566 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

These are not supposed to be JSON strings as per protocol Fixes delta-io#3326 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Fixes delta-io#3399 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: Corwin Joy <corwin.joy@gmail.com>

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

The additional benefit of running these tests in parallel is that more racey/timing related test failures are cropping up for me. Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

…to pyarrow dataset Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

…eltalake Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

…ing execution Remove cache calls on large DataFrames while keeping small result caching: - Keep conflicts_df cache (small: join keys + file paths only) - Remove implicit materializations from target_df, filtered_target_df, non_conflicting_target, and result_df - All large DataFrames now use lazy streaming execution - Add schema normalization (cast Dictionary to Utf8) for file path column to fix compatibility - Add helper method find_conflicts_keys_only for clean anti-join logic Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

The target_df parameter was not used in the function body - it only selects keys from self.source. Removed the parameter and updated the call site. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

This reverts commit 9a5a201.

…hub.com>

… detection Instead of caching the conflicts DataFrame to work around DataFusion's Dictionary encoding schema mismatch, implement manual join logic: - Collect target DataFrame with join keys + file paths (small result) - Collect distinct source join keys (small result) - Perform join in memory using HashSet for efficient lookup - Extract file paths that have matching keys This avoids materializing large DataFrames while still handling the schema inconsistency by working entirely in memory on small, already-collected data. Memory impact: Only materializes join keys + file paths (one row per conflicting file), not full row data. Much more efficient than caching full DataFrames. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

The previous approach incorrectly materialized the entire target DataFrame which could be billions of rows. The corrected approach: 1. Keeps target_df and source lazy (not materialized) 2. Performs inner join in DataFusion (lazy operation) 3. Selects only minimal columns (join keys + file path, not full rows) 4. Collects ONLY the join result which is small (only conflicting rows) Memory footprint: For a table with billions of rows but only thousands of conflicts, we materialize only thousands of rows with minimal columns, not billions of full rows. The join result is inherently small because it contains only rows where join keys match between source and target (actual conflicts). Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

…rics - Changed extract_conflicting_filenames to extract_conflicts_dataframe to return a DataFrame - Added extract_file_paths_from_conflicts to extract file paths from the cached DataFrame - Cache the conflicts DataFrame for reuse in multiple places - Added num_conflicting_records field to UpsertMetrics - Count and report conflicting records in metrics - Updated tests to verify num_conflicting_records metric Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

rtyler and others added 30 commits July 24, 2025 07:12

docs: fix daft links which were broken by Daft, oops

f6ddd32

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: bump Python for a fix release

765e129

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

fix: ensure openssl-sys doesn't creep into the dependency via the ker…

2846081

…nel default engine Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: Update to DataFusion 49.0.0

a26b311

Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

chore: Update to Rust version 1.85

f7df763

Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

chore: finish datafusion 49 upgrade

3215340

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

fix: coerce polars.Array into a suitable Arrow list type

62bf411

Ideally polars wouldn't be giving us `item` as the list field name, but it's more important to Just Work ™️ than be pedantic about these things Fixes delta-io#3566 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

fix: make the docs link checking more useful/less faily

da75777

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

fix: avoid parsing generationExpressions as JSON

b5c470b

These are not supposed to be JSON strings as per protocol Fixes delta-io#3326 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

feat: build musl wheels upon release

07b34c7

Fixes delta-io#3399 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: bump version for release

bc0fba8

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

Add keep_versions parameter to vacuum command for python

a51d16e

Signed-off-by: Corwin Joy <corwin.joy@gmail.com>

chore: remove deprecated use of kernel's Table

8b4810a

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

chore: use pytest-xdist for speeding up python tests

5ac08eb

The additional benefit of running these tests in parallel is that more racey/timing related test failures are cropping up for me. Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>

fix: aws special paths encoding

4c44133

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: handle checking partition filters in array/list when converting …

9f631b5

…to pyarrow dataset Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

Fix formatting

9ec167e

Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

ci: run integration tests against next branches

267034f

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

feat: support converting parquet with non-microsecond timestamps to d…

9aa87f4

…eltalake Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

refactor schema field conversion functions

6a5ffff

Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>

chore: fmt

3354e73

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

fix: use RFC3896 percent encoding

fa2478b

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

chore: bump polars version

cca1410

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

test: increase ASCII table range

30e155f

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

chore: remove print

ca2dfff

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

chore: fmt

2117620

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

chore: cleanup

774ffb9

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

chore: set protocol url docs

eebf4cd

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>

adampolomski and others added 24 commits February 16, 2026 12:59

feat: handle schema reorderings

0d7a3c4

feat: handle schema reorderings

802979c

feat: handle cross-partition upserts

2c596b4

feat: multi-partition test case reworked

b232f56

Initial plan

b502aa2

feat: removed useless caching

d184051

Revert "feat: removed useless caching"

b1d8f1e

This reverts commit 9a5a201.

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.git…

ca26695

…hub.com>

feat: removed unnecessary columns

1f97484

feat: review comments

37b1058

feat: fetch only distinct files

6f5fd7e

feat: fetch only distinct files

d49058c

feat: optimised join

3f5c407

Initial plan

c076d37

Remove unnecessary clone when counting conflicts DataFrame

216d16a

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

feat: moved stuff around

0bfd6ce

feat: optimised join

51edbc1

feat: optimised join

2b4ea1e

chore: rebase to main (v0.51)

2976758

mandrush force-pushed the upsert branch from 7233ee0 to 2976758 Compare February 16, 2026 12:11

github-actions Bot added binding/python delta-inspect documentation Improvements or additions to documentation proofs ci labels Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust): add upsert operation#37

feat(rust): add upsert operation#37
adampolomski wants to merge 368 commits intov0.26.2-mainfrom
upsert

adampolomski commented Sep 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

adampolomski commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

adampolomski commented Sep 9, 2025 •

edited

Loading