Add end-to-end profiling instrumentation to CUDA IPC store path by Copilot · Pull Request #372 · hlin99/LMCache

Copilot · 2026-06-17T03:04:56Z

Adds time.perf_counter() timing to the CUDA IPC (handle-based) store path to expose per-stage latency for both the vLLM forward thread and the GPU transfer server.

`HandleTransferContext.submit_store()` — `[FWD-IPC]` log

Times each step on the forward thread (the path that blocks vLLM):

event.ipc_handle() serialization
_send_request() MQ round-trip
.to_cuda_future() conversion

[FWD-IPC] req=cmpl-xxx ipc_handle=0.012 send_request=0.841 to_cuda_future=0.203 total=1.056 ms

`GPUTransferModule.store()` — `[GPU-STORE]` + `[GPU-STORE-CHUNK]` logs

Breaks down every stage in the server-side store:

resolve_obj_keys, copy_view_block_ids_to_gpu
Event.from_ipc_handle + vllm_event.wait
event_bus.publish / publish_on_stream
reserve_write
Per-chunk kernel launch (multi_layer_block_kv_transfer) and lmcache_memcpy_async_d2h
event.record() and submit_callback_to_stream

[GPU-STORE] req=cmpl-xxx resolve_keys=0.021 copy_block_ids=0.314 event_ipc_wait=0.182 event_publish=0.095 reserve_write=0.412 kernel_loop=48.231 event_record=0.031 submit_cb=2.104 total=51.390 ms (num_chunks=2, num_groups=1)
[GPU-STORE-CHUNK] req=cmpl-xxx chunk_idx=0 kernel=24.103 memcpy_d2h=0.041 ms
[GPU-STORE-CHUNK] req=cmpl-xxx chunk_idx=1 kernel=24.128 memcpy_d2h=0.038 ms

All timing variables are initialized before the try block so edge cases (empty obj_keys, all chunks skipped) never produce a NameError. No functional logic changed.

- worker_transfer.py: Add import time + timing to HandleTransferContext.submit_store() with [FWD-IPC] log covering ipc_handle, send_request, to_cuda_future, and total ms - gpu_transfer.py: Add granular timing to GPUTransferModule.store() with [GPU-STORE] summary log and per-chunk [GPU-STORE-CHUNK] logs covering kernel launch and memcpy_d2h

…eError risk

Initial plan

5c23520

Copilot AI assigned Copilot and hlin99 Jun 17, 2026

Copilot started work on behalf of hlin99 June 17, 2026 03:05 View session

Copilot AI added 2 commits June 17, 2026 03:11

Fix timing variable scoping: initialize before try block to avoid Nam…

5683dd0

…eError risk

Copilot AI changed the title ~~[WIP] Add comprehensive profiling instrumentation to CUDA IPC transfer path~~ Add end-to-end profiling instrumentation to CUDA IPC store path Jun 17, 2026

Copilot finished work on behalf of hlin99 June 17, 2026 03:13

Copilot AI requested a review from hlin99 June 17, 2026 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add end-to-end profiling instrumentation to CUDA IPC store path#372

Add end-to-end profiling instrumentation to CUDA IPC store path#372
Copilot wants to merge 3 commits into
copilot/ww24-pr-async-againfrom
copilot/add-comprehensive-profiling-instrumentation

Copilot AI commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HandleTransferContext.submit_store() — [FWD-IPC] log

GPUTransferModule.store() — [GPU-STORE] + [GPU-STORE-CHUNK] logs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 17, 2026 •

edited

Loading

`HandleTransferContext.submit_store()` — `[FWD-IPC]` log

`GPUTransferModule.store()` — `[GPU-STORE]` + `[GPU-STORE-CHUNK]` logs