Skip to content

fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result writes performance#4293

Open
kunwp1 wants to merge 3 commits intoapache:mainfrom
kunwp1:chris-fix-4289
Open

fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result writes performance#4293
kunwp1 wants to merge 3 commits intoapache:mainfrom
kunwp1:chris-fix-4289

Conversation

@kunwp1
Copy link
Contributor

@kunwp1 kunwp1 commented Mar 13, 2026

What changes were proposed in this PR?

This PR addresses #4289 by optimizing IcebergDocument.hasNext() to minimize redundant catalog round-trips. By introducing a guard condition, we ensure seekToUsableFile() and its subsequent catalog calls are only triggered when the current record iterator is fully exhausted.

  1. If the current file has more records, return true immediately.
  2. Only if the current file is exhausted, check usableFileIterator.
  3. Only if usableFileIterator is also empty, call seekToUsableFile().

Any related issues, documentation, discussions?

Fix #4289

How was this PR tested?

  1. Import and use Untitled workflow (9).json.
  2. Use a CSV file containing 1M records.
  3. Set storage.iceberg.table.commit.batch-size to 1M (matching the total record count).
  4. Compare the performance before fix and after fix. For me it was 2m 45s vs 36s.

Was this PR authored or co-authored using generative AI tooling?

No.

@kunwp1 kunwp1 requested a review from bobbai00 March 13, 2026 21:49
@kunwp1 kunwp1 self-assigned this Mar 13, 2026
@chenlica
Copy link
Contributor

@Xiao-zhen-Liu @bobbai00 Please review it.

@bobbai00 bobbai00 changed the title fix: Optimized IcebergDocument.hasNext() to reduce catalog round-trips fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result writes performance Mar 14, 2026
Copy link
Contributor

@bobbai00 bobbai00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala side LGTM! Can you also check the python side's iceberg document's corresponding logic, change it and teset it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Excessive Catalog Round-trips in IcebergDocument.hasNext()

3 participants