feat(rust): add upsert operation#37
Open
adampolomski wants to merge 368 commits intov0.26.2-mainfrom
Open
Conversation
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
FREEZE! Looks like we were _really_ close to full multi-threaded support before, and just needed to sprinkle a little macro magic on the pyclass definition. See [pyo3 docs](https://pyo3.rs/v0.23.0/class/thread-safety) Fixes delta-io#3594 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
…nel default engine Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](la10736/rstest@v0.25.0...v0.26.1) --- updated-dependencies: - dependency-name: rstest dependency-version: 0.26.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Ideally polars wouldn't be giving us `item` as the list field name, but it's more important to Just Work ™️ than be pedantic about these things Fixes delta-io#3566 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
These are not supposed to be JSON strings as per protocol Fixes delta-io#3326 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Fixes delta-io#3399 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: Corwin Joy <corwin.joy@gmail.com>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
The additional benefit of running these tests in parallel is that more racey/timing related test failures are cropping up for me. Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
…to pyarrow dataset Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>
Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
…eltalake Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>
Signed-off-by: Sam Meyer-Reed <smeyerreed@gmail.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
…ing execution Remove cache calls on large DataFrames while keeping small result caching: - Keep conflicts_df cache (small: join keys + file paths only) - Remove implicit materializations from target_df, filtered_target_df, non_conflicting_target, and result_df - All large DataFrames now use lazy streaming execution - Add schema normalization (cast Dictionary to Utf8) for file path column to fix compatibility - Add helper method find_conflicts_keys_only for clean anti-join logic Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>
The target_df parameter was not used in the function body - it only selects keys from self.source. Removed the parameter and updated the call site. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>
This reverts commit 9a5a201.
… detection Instead of caching the conflicts DataFrame to work around DataFusion's Dictionary encoding schema mismatch, implement manual join logic: - Collect target DataFrame with join keys + file paths (small result) - Collect distinct source join keys (small result) - Perform join in memory using HashSet for efficient lookup - Extract file paths that have matching keys This avoids materializing large DataFrames while still handling the schema inconsistency by working entirely in memory on small, already-collected data. Memory impact: Only materializes join keys + file paths (one row per conflicting file), not full row data. Much more efficient than caching full DataFrames. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>
The previous approach incorrectly materialized the entire target DataFrame which could be billions of rows. The corrected approach: 1. Keeps target_df and source lazy (not materialized) 2. Performs inner join in DataFusion (lazy operation) 3. Selects only minimal columns (join keys + file path, not full rows) 4. Collects ONLY the join result which is small (only conflicting rows) Memory footprint: For a table with billions of rows but only thousands of conflicts, we materialize only thousands of rows with minimal columns, not billions of full rows. The join result is inherently small because it contains only rows where join keys match between source and target (actual conflicts). Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>
…rics - Changed extract_conflicting_filenames to extract_conflicts_dataframe to return a DataFrame - Added extract_file_paths_from_conflicts to extract file paths from the cached DataFrame - Cache the conflicts DataFrame for reuse in multiple places - Added num_conflicting_records field to UpsertMetrics - Count and report conflicting records in metrics - Updated tests to verify num_conflicting_records metric Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>
Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Related Issue(s)
Documentation