Skip to content

[fix] Increase mooncake_master client_ttl to mitigate client heartbeat timeout #99

Merged
0oshowero0 merged 1 commit into
Ascend:mainfrom
0oshowero0:mooncake
May 14, 2026
Merged

[fix] Increase mooncake_master client_ttl to mitigate client heartbeat timeout #99
0oshowero0 merged 1 commit into
Ascend:mainfrom
0oshowero0:mooncake

Conversation

@0oshowero0
Copy link
Copy Markdown
Collaborator

Background

During the integration of MooncakeStore with verl, we encountered unexpected connection errors and segment unregistrations. After investigation, we identified that the issue is caused by the starvation or stalling of Mooncake's heartbeat mechanism.

Although Mooncake's heartbeat logic is implemented in C++ and theoretically should not be affected by Python-level concurrency models (such as ray, asyncio, or multi-threading/coroutines), our analysis indicates that the blockage likely occurs at the C++/Python interface layer (to be further verified). When the Python side is heavily loaded, it stalls the interface, which inadvertently blocks the underlying C++ heartbeat threads.

c5ad341a2bd0589bc070a0b3b1979da4 081875fe360b39aa1c61c62821e20e08

Solution

Increase -client_ttl from 10 to 30 during initializing mooncake_master process.

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Copilot AI review requested due to automatic review settings May 14, 2026 12:12
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the MooncakeStore auto-initialization path to reduce unexpected Mooncake client disconnects/segment unregistrations under heavy Python-side load by increasing the mooncake_master client TTL.

Changes:

  • Add -client_ttl=30 to the mooncake_master startup command used when MooncakeStore.auto_init is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 146 to 150
cmd = [
"mooncake_master",
"-client_ttl=30",
"-default_kv_lease_ttl=999999",
"-default_kv_soft_pin_ttl=999999",
@0oshowero0 0oshowero0 changed the title [fix] Increase mooncake_master client_ttl to prevent client segment unmount [fix] Increase mooncake_master client_ttl to relief client segment unmount May 14, 2026
@0oshowero0 0oshowero0 changed the title [fix] Increase mooncake_master client_ttl to relief client segment unmount [fix] Increase mooncake_master client_ttl to mitigate client heartbeat timeout May 14, 2026
@0oshowero0 0oshowero0 merged commit 768ae3a into Ascend:main May 14, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants