Skip to content

Fix PollUtils.poll to use wall-clock time for timeout instead of accumulating sleep durations#1000

Merged
khandelwal-ayush merged 2 commits intolinkedin:masterfrom
khandelwal-ayush:fix-poll-utils
Feb 26, 2026
Merged

Fix PollUtils.poll to use wall-clock time for timeout instead of accumulating sleep durations#1000
khandelwal-ayush merged 2 commits intolinkedin:masterfrom
khandelwal-ayush:fix-poll-utils

Conversation

@khandelwal-ayush
Copy link
Collaborator

Summary

PollUtils.poll tracks elapsed time by summing Thread.sleep(periodMs) durations instead of measuring actual wall-clock time. Time spent executing the supplier — network calls, schema registry lookups,
database queries — is not counted toward the timeout. This causes the configured timeout to inflate by a factor proportional to the ratio of supplier execution time to the sleep interval.

This was discovered during MySQL bootstrap stream creation where the connector is configured with periodMs=10ms and timeoutMs=600,000ms (10 minutes). Each schema registry call takes approximately 165ms to
fail, making each iteration ~175ms of wall-clock time while only 10ms is counted toward the timeout. After 41,184 iterations and approximately 2 hours of continuous polling, PollUtils had tracked only ~6.9
minutes of "elapsed" time, still well short of the 10-minute timeout. The actual time required to trigger the timeout is approximately 3 hours — a 17x inflation.

This affects approximately 23 production call sites across the codebase including DatastreamRestClient (12 sites covering datastream lifecycle operations), schema registry polling (7+ sites across MySQL,
Oracle, TiDB connectors), Kafka commit retries, DMS REST.li polling, and Coordinator state transitions. Any caller whose supplier performs I/O is subject to the same inflation.

The fix replaces the elapsedMs accumulator with a System.currentTimeMillis() snapshot taken at method entry. The timeout check becomes a wall-clock comparison. Both the boolean overload (used by
predicate-based callers) and the Optional overload (used by supplier-based callers) are fixed. No behavioral change for callers with near-instantaneous predicates since their wall-clock time already closely
matched the tracked elapsed time.

Testing Done

Will deploy and test

Copy link
Collaborator

@akshayrai akshayrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Can you include some unit tests for this?

Copy link
Collaborator

@akshayrai akshayrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document the impact and what improvements we observe

@khandelwal-ayush khandelwal-ayush merged commit 11a7ca6 into linkedin:master Feb 26, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants