[feat] Add retransmission mechanism for MooncakeStoreClient#94
Open
0oshowero0 wants to merge 3 commits into
Open
[feat] Add retransmission mechanism for MooncakeStoreClient#940oshowero0 wants to merge 3 commits into
MooncakeStoreClient#940oshowero0 wants to merge 3 commits into
Conversation
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds retry logic to MooncakeStoreClient.get() to make transient MooncakeStore transfer failures less likely to crash training jobs or silently corrupt returned values.
Changes:
- Added retry-with-backoff for the tensor retrieval path (
batch_get_into) by retrying only the failed subset of keys. - Added retry-with-backoff for the non-tensor retrieval path (
get_batch) by detecting failures viab""and retrying only the failed subset. - Increased
BATCH_SIZE_LIMITfrom 200 to 400.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+37
to
+40
| BATCH_SIZE_LIMIT: int = 400 | ||
| MAX_WORKER_THREADS = 4 | ||
| MAX_RETRIES = 3 | ||
| RETRY_DELAY_SECONDS = 1.0 |
Comment on lines
+259
to
+275
| for attempt in range(1, MAX_RETRIES + 1): | ||
| # Reuse the originally allocated pointers; no need to allocate/register new buffers. | ||
| retry_ptrs = [batch_buffer_ptrs[i] for i in current_failed_indices] | ||
| retry_nbytes = [batch_nbytes[i] for i in current_failed_indices] | ||
|
|
||
| retry_codes = self._store.batch_get_into(current_failed_keys, retry_ptrs, retry_nbytes) | ||
|
|
||
| next_failed_indices = [] | ||
| next_failed_keys = [] | ||
| next_failed_codes = [] | ||
|
|
||
| for i, ret in enumerate(retry_codes): | ||
| if ret < 0: | ||
| next_failed_indices.append(current_failed_indices[i]) | ||
| next_failed_keys.append(current_failed_keys[i]) | ||
| next_failed_codes.append(ret) | ||
|
|
Comment on lines
+313
to
+327
| for attempt in range(1, MAX_RETRIES + 1): | ||
| retry_results = self._store.get_batch(current_failed_keys) | ||
|
|
||
| next_failed_keys = [] | ||
| next_failed_indices = [] | ||
|
|
||
| for i, result in enumerate(retry_results): | ||
| original_idx = current_failed_indices[i] | ||
| if result == b"": | ||
| next_failed_keys.append(current_failed_keys[i]) | ||
| next_failed_indices.append(original_idx) | ||
| else: | ||
| # Write the successfully retried value back to its original slot immediately. | ||
| raw_results[original_idx] = result | ||
|
|
Comment on lines
+337
to
+346
| else: | ||
| # All retries exhausted. | ||
| # FIXME: raise error here when we can distinguish transmission failures from empty values | ||
| logger.error( | ||
| f"get_batch failed for keys {current_failed_keys} after retrying {MAX_RETRIES} times. " | ||
| f"Please validate if the values corresponding to these keys are `None` during put." | ||
| ) | ||
|
|
||
| return results, indexes | ||
| deserialized_results = [pickle.loads(result) if result != b"" else None for result in raw_results] | ||
| return deserialized_results, indexes |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
|
|
||
| retry_results = self._store.batch_upsert_from( | ||
| current_failed_keys, retry_ptrs, retry_sizes, config=self.replica_config | ||
| ) |
| retry_ptrs = [batch_buffer_ptrs[i] for i in current_failed_indices] | ||
| retry_nbytes = [batch_nbytes[i] for i in current_failed_indices] | ||
|
|
||
| retry_codes = self._store.batch_get_into(current_failed_keys, retry_ptrs, retry_nbytes) |
Comment on lines
+392
to
+400
| for attempt in range(1, MAX_RETRIES + 1): | ||
| retry_results = self._store.get_batch(current_failed_keys) | ||
|
|
||
| next_failed_keys = [] | ||
| next_failed_indices = [] | ||
|
|
||
| for i, result in enumerate(retry_results): | ||
| original_idx = current_failed_indices[i] | ||
| if result == b"": |
Comment on lines
+37
to
+40
| BATCH_SIZE_LIMIT: int = 400 | ||
| MAX_WORKER_THREADS = 4 | ||
| MAX_RETRIES = 3 | ||
| RETRY_DELAY_SECONDS = 1.0 |
Comment on lines
+419
to
+423
| else: | ||
| # All retries exhausted. | ||
| raise RuntimeError( | ||
| f"get_batch failed for keys {current_failed_keys} after retrying {MAX_RETRIES} times." | ||
| ) |
Collaborator
|
Perhaps we need to add a debug mode to verify the consistency of data transmission and reception? WDYT? |
Collaborator
Author
|
I believe it should be done by the communication backend |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
When using
MooncakeStoreas TQ's backend, we observe occasional transmission errors during verl e2e runs:These
Connection timed out/TRANSFER_FAIL(error: -800) errors are transient network issues that typically resolve on a subsequent attempt. However, the previous client implementation had no retry logic whatsoever:RuntimeError, failing the entire batch and crashing the training job.get_batchreturnsb""for keys that encountered a transfer failure, and the client blindly passed these empty bytes throughpickle.loads(... if result != b"" else None), treating them as legitimateNonevalues.This leads to silent content corruption. A training worker could proceed with corrupted or missing data without ever knowing that a transmission failure occurred, compromising model correctness.
This PR addresses all failure modes on both the read (
get) and write (put) paths by adding controlled retries that isolate failed keys and attempt retransmission before giving up.Summary
This PR introduces a retry mechanism for transient failures in
MooncakeStoreClient, covering both read (get) and write (put) operations, for both tensor and non-tensor data paths.Previously, the client had zero tolerance for transient errors:
_get_tensors_thread_worker): a single key failure (ret < 0) would immediately raiseRuntimeError, causing the entire batch to fail._get_bytes_thread_worker): no failure detection at all. Empty byte strings (b"") — which MooncakeStore returns on transmission failures — were silently deserialized asNone, making it impossible for callers to distinguish between "value is None" and "transfer failed"._put_tensors_thread_worker): any single key returning a non-zero status would immediately abort the entire batch withRuntimeError._put_bytes_thread_worker): a singleupsert_batchfailure would immediately abort the batch withRuntimeError.This change adds up to 3 retries with 1-second backoff across all four paths. For paths that expose per-key status codes, only the failed subset of keys is retried on each attempt.
Future Work
b""heuristic in_get_bytes_thread_workerwith proper per-key error codes once MooncakeStore exposes them, then upgrade the exhausted-retry path fromlogger.errortoraise RuntimeError.upsert_batchandget_batch, switch the bytes write/read paths from whole-batch retry to per-key selective retry, matching the tensor-path behaviour.