Skip to content

PoC: Iceberg incremental read support#495

Open
cbb330 wants to merge 5 commits intolinkedin:mainfrom
cbb330:chbush/incremental-read-e2e-tests
Open

PoC: Iceberg incremental read support#495
cbb330 wants to merge 5 commits intolinkedin:mainfrom
cbb330:chbush/incremental-read-e2e-tests

Conversation

@cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Mar 11, 2026

Summary

PoC validating that Iceberg's incremental read APIs work correctly on the OpenHouse stack. Users have asked about incrementally consuming data between snapshots for OH tables — this PR confirms the feature is supported and documents version-specific behavior and UX caveats.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

New Features / Tests:

IncrementalReadTest.java (Spark 3.1 & 3.5) — tests IncrementalAppendScan via the DataFrame API:

spark.read.format("iceberg")
  .option("start-snapshot-id", startId)
  .option("end-snapshot-id", endId)
  .load("catalog.db.table")
  • Basic incremental read between two snapshots
  • Single-snapshot-range precision
  • Multi-snapshot spanning reads
  • Overwrite in range: Iceberg 1.2 throws UnsupportedOperationException; Iceberg 1.5 silently skips non-append snapshots

ChangelogViewTest.java (Spark 3.5 only) — tests create_changelog_view stored procedure (Iceberg 1.4+):

CALL catalog.system.create_changelog_view(
  table => 'db.tbl',
  options => map('start-snapshot-id', '1', 'end-snapshot-id', '2'))
  • Appends, overwrites, deletes, net changes, multi-snapshot span

Key UX caveats for users:

  1. DataFrame API only — no Spark SQL syntax for incremental reads
  2. IncrementalAppendScan is append-only; use create_changelog_view for overwrites/deletes
  3. create_changelog_view is only available on Spark 3.5 (Iceberg 1.5+)
  4. Does not work on views — only base Iceberg tables

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

All tests pass:

  • IncrementalReadTest: 4/4 on Spark 3.1, 4/4 on Spark 3.5
  • ChangelogViewTest: 5/5 on Spark 3.5
  • Spotless formatting applied

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

cbb330 added 2 commits March 11, 2026 13:35
Tests cover the DataFrame API with start-snapshot-id / end-snapshot-id
options for IncrementalAppendScan across both Spark versions:
- Incremental read between two snapshots
- Single snapshot range precision
- Multi-snapshot spanning reads
- Overwrite in range (Iceberg 1.2 rejects; 1.5 skips non-appends)
Tests the create_changelog_view stored procedure for CDC/incremental
change tracking, available in Iceberg 1.5 (Spark 3.5 only):
- Appends: verifies INSERT change types between snapshots
- Overwrite: captures both DELETE (old rows) and INSERT (new rows)
- Delete: verifies DELETE change type for removed rows (v2 format)
- Net changes: collapses intermediate changes with identifier columns
- Multi-snapshot span: changelog across multiple append snapshots
@cbb330 cbb330 force-pushed the chbush/incremental-read-e2e-tests branch from cdfb5f7 to 59493dc Compare March 11, 2026 21:22
@cbb330 cbb330 changed the title Add e2e tests for Iceberg incremental read Proof of concept for Iceberg incremental read Mar 11, 2026
@cbb330 cbb330 changed the title Proof of concept for Iceberg incremental read PoC: E2e tests for Iceberg incremental read support Mar 11, 2026
cbb330 added 2 commits March 11, 2026 15:30
Spark 3.1: use assertThrows for Iceberg 1.2 UnsupportedOperationException
Spark 3.5: add IncrementalReadTest verifying Iceberg 1.5 skips non-appends
Validate exact row count, all change types/values for each id,
and that unchanged id=2 does not appear. Fix misleading comment
about net UPDATE when compute_updates is disabled.
@cbb330 cbb330 changed the title PoC: E2e tests for Iceberg incremental read support PoC: Iceberg incremental read support Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant