PoC: Iceberg incremental read support by cbb330 · Pull Request #495 · linkedin/openhouse

cbb330 · 2026-03-11T06:49:39Z

Summary

PoC validating that Iceberg's incremental read APIs work correctly on the OpenHouse stack. Users have asked about incrementally consuming data between snapshots for OH tables — this PR confirms the feature is supported and documents version-specific behavior and UX caveats.

Changes

New Features / Tests:

IncrementalReadTest.java (Spark 3.1 & 3.5) — tests IncrementalAppendScan via the DataFrame API:

spark.read.format("iceberg")
  .option("start-snapshot-id", startId)
  .option("end-snapshot-id", endId)
  .load("catalog.db.table")

Basic incremental read between two snapshots
Single-snapshot-range precision
Multi-snapshot spanning reads
Overwrite in range: Iceberg 1.2 throws UnsupportedOperationException; Iceberg 1.5 silently skips non-append snapshots

ChangelogViewTest.java (Spark 3.5 only) — tests create_changelog_view stored procedure (Iceberg 1.4+):

CALL catalog.system.create_changelog_view(
  table => 'db.tbl',
  options => map('start-snapshot-id', '1', 'end-snapshot-id', '2'))

Appends, overwrites, deletes, net changes, multi-snapshot span

Key UX caveats for users:

DataFrame API only — no Spark SQL syntax for incremental reads
IncrementalAppendScan is append-only; use create_changelog_view for overwrites/deletes
create_changelog_view is only available on Spark 3.5 (Iceberg 1.5+)
Does not work on views — only base Iceberg tables

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

All tests pass:

IncrementalReadTest: 4/4 on Spark 3.1, 4/4 on Spark 3.5
ChangelogViewTest: 5/5 on Spark 3.5
Spotless formatting applied

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Tests cover the DataFrame API with start-snapshot-id / end-snapshot-id options for IncrementalAppendScan across both Spark versions: - Incremental read between two snapshots - Single snapshot range precision - Multi-snapshot spanning reads - Overwrite in range (Iceberg 1.2 rejects; 1.5 skips non-appends)

Tests the create_changelog_view stored procedure for CDC/incremental change tracking, available in Iceberg 1.5 (Spark 3.5 only): - Appends: verifies INSERT change types between snapshots - Overwrite: captures both DELETE (old rows) and INSERT (new rows) - Delete: verifies DELETE change type for removed rows (v2 format) - Net changes: collapses intermediate changes with identifier columns - Multi-snapshot span: changelog across multiple append snapshots

Spark 3.1: use assertThrows for Iceberg 1.2 UnsupportedOperationException Spark 3.5: add IncrementalReadTest verifying Iceberg 1.5 skips non-appends

Validate exact row count, all change types/values for each id, and that unchanged id=2 does not appear. Fix misleading comment about net UPDATE when compute_updates is disabled.

…erence)

cbb330 added 2 commits March 11, 2026 13:35

cbb330 force-pushed the chbush/incremental-read-e2e-tests branch from cdfb5f7 to 59493dc Compare March 11, 2026 21:22

cbb330 changed the title ~~Add e2e tests for Iceberg incremental read~~ Proof of concept for Iceberg incremental read Mar 11, 2026

cbb330 changed the title ~~Proof of concept for Iceberg incremental read~~ PoC: E2e tests for Iceberg incremental read support Mar 11, 2026

cbb330 added 2 commits March 11, 2026 15:30

Split overwrite test into version-specific assertions (Spark 3.1 & 3.5)

5c13cd4

Spark 3.1: use assertThrows for Iceberg 1.2 UnsupportedOperationException Spark 3.5: add IncrementalReadTest verifying Iceberg 1.5 skips non-appends

Strengthen assertions in testChangelogViewWithNetChanges

1f2d51d

Validate exact row count, all change types/values for each id, and that unchanged id=2 does not appear. Fix misleading comment about net UPDATE when compute_updates is disabled.

cbb330 changed the title ~~PoC: E2e tests for Iceberg incremental read support~~ PoC: Iceberg incremental read support Mar 11, 2026

Exclude Spark 3.1 overwrite test from Spark 3.5 (Iceberg version diff…

22baa29

…erence)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: Iceberg incremental read support#495

PoC: Iceberg incremental read support#495
cbb330 wants to merge 5 commits intolinkedin:mainfrom
cbb330:chbush/incremental-read-e2e-tests

cbb330 commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cbb330 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cbb330 commented Mar 11, 2026 •

edited

Loading