-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Summary
The current implementation of IcebergDocument.hasNext() triggers a remote catalog lookup (PostgreSQL) and a storage listing (S3) for every single tuple while processing the final Parquet file in a snapshot. This leads to significant latency and unnecessary load on the metadata database.
Background
The IcebergDocument manages a usableFileIterator to support live-streaming (concurrent read/write). The current logic follows these steps:
- When the current file is empty, it pulls the next file from
usableFileIterator. - When
usableFileIteratoris empty, it callsseekToUsableFile()to check the catalog for newly committed files.
Problem
The check for usableFileIterator.isEmpty() occurs before checking if the current file still has records. As a result, as soon as the reader starts the last known file in the list, usableFileIterator becomes empty, triggering seekToUsableFile() for every subsequent call to hasNext().
Example Scenario
A result set consists of file1 and file2 (4,096 rows each).
- During
file1:usableFileIteratorcontains[file2].hasNext()returns true. - During
file2:usableFileIteratoris now empty.hasNext()is called 4,096 times to read the rows. Because the iterator is empty, 4,096 network requests are made to PostgreSQL and S3 to seek new files, even though the reader is still busy processing the current file.
Proposed Fix
Add a guard condition to ensure that seekToUsableFile() is only invoked when the current record iterator is actually exhausted.
- If the current file has more records, return true immediately.
- Only if the current file is exhausted, check
usableFileIterator. - Only if
usableFileIteratoris also empty, callseekToUsableFile().
Impact
- Without Fix: >4,096 catalog/S3 calls (Total Rows in last file + 1).
- With Fix: 2 calls (one at initialization, one at the very end when all records are truly exhausted).
- Result: Significant reduction in IOPS