Fix PollUtils.poll to use wall-clock time for timeout instead of accumulating sleep durations#1000
Merged
khandelwal-ayush merged 2 commits intolinkedin:masterfrom Feb 26, 2026
Conversation
…mulating sleep durations
akshayrai
reviewed
Feb 24, 2026
Collaborator
akshayrai
left a comment
There was a problem hiding this comment.
Nice. Can you include some unit tests for this?
kanishkjaiswal2015
approved these changes
Feb 24, 2026
akshayrai
approved these changes
Feb 26, 2026
Collaborator
akshayrai
left a comment
There was a problem hiding this comment.
Let's document the impact and what improvements we observe
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PollUtils.poll tracks elapsed time by summing Thread.sleep(periodMs) durations instead of measuring actual wall-clock time. Time spent executing the supplier — network calls, schema registry lookups,
database queries — is not counted toward the timeout. This causes the configured timeout to inflate by a factor proportional to the ratio of supplier execution time to the sleep interval.
This was discovered during MySQL bootstrap stream creation where the connector is configured with periodMs=10ms and timeoutMs=600,000ms (10 minutes). Each schema registry call takes approximately 165ms to
fail, making each iteration ~175ms of wall-clock time while only 10ms is counted toward the timeout. After 41,184 iterations and approximately 2 hours of continuous polling, PollUtils had tracked only ~6.9
minutes of "elapsed" time, still well short of the 10-minute timeout. The actual time required to trigger the timeout is approximately 3 hours — a 17x inflation.
This affects approximately 23 production call sites across the codebase including DatastreamRestClient (12 sites covering datastream lifecycle operations), schema registry polling (7+ sites across MySQL,
Oracle, TiDB connectors), Kafka commit retries, DMS REST.li polling, and Coordinator state transitions. Any caller whose supplier performs I/O is subject to the same inflation.
The fix replaces the elapsedMs accumulator with a System.currentTimeMillis() snapshot taken at method entry. The timeout check becomes a wall-clock comparison. Both the boolean overload (used by
predicate-based callers) and the Optional overload (used by supplier-based callers) are fixed. No behavioral change for callers with near-instantaneous predicates since their wall-clock time already closely
matched the tracked elapsed time.
Testing Done
Will deploy and test