Skip to content

Add end-to-end profiling instrumentation to CUDA IPC store path#372

Draft
Copilot wants to merge 3 commits into
copilot/ww24-pr-async-againfrom
copilot/add-comprehensive-profiling-instrumentation
Draft

Add end-to-end profiling instrumentation to CUDA IPC store path#372
Copilot wants to merge 3 commits into
copilot/ww24-pr-async-againfrom
copilot/add-comprehensive-profiling-instrumentation

Conversation

Copilot AI commented Jun 17, 2026

Copy link
Copy Markdown

Adds time.perf_counter() timing to the CUDA IPC (handle-based) store path to expose per-stage latency for both the vLLM forward thread and the GPU transfer server.

HandleTransferContext.submit_store()[FWD-IPC] log

Times each step on the forward thread (the path that blocks vLLM):

  • event.ipc_handle() serialization
  • _send_request() MQ round-trip
  • .to_cuda_future() conversion
[FWD-IPC] req=cmpl-xxx ipc_handle=0.012 send_request=0.841 to_cuda_future=0.203 total=1.056 ms

GPUTransferModule.store()[GPU-STORE] + [GPU-STORE-CHUNK] logs

Breaks down every stage in the server-side store:

  • resolve_obj_keys, copy_view_block_ids_to_gpu
  • Event.from_ipc_handle + vllm_event.wait
  • event_bus.publish / publish_on_stream
  • reserve_write
  • Per-chunk kernel launch (multi_layer_block_kv_transfer) and lmcache_memcpy_async_d2h
  • event.record() and submit_callback_to_stream
[GPU-STORE] req=cmpl-xxx resolve_keys=0.021 copy_block_ids=0.314 event_ipc_wait=0.182 event_publish=0.095 reserve_write=0.412 kernel_loop=48.231 event_record=0.031 submit_cb=2.104 total=51.390 ms (num_chunks=2, num_groups=1)
[GPU-STORE-CHUNK] req=cmpl-xxx chunk_idx=0 kernel=24.103 memcpy_d2h=0.041 ms
[GPU-STORE-CHUNK] req=cmpl-xxx chunk_idx=1 kernel=24.128 memcpy_d2h=0.038 ms

All timing variables are initialized before the try block so edge cases (empty obj_keys, all chunks skipped) never produce a NameError. No functional logic changed.

Copilot AI added 2 commits June 17, 2026 03:11
- worker_transfer.py: Add import time + timing to HandleTransferContext.submit_store()
  with [FWD-IPC] log covering ipc_handle, send_request, to_cuda_future, and total ms
- gpu_transfer.py: Add granular timing to GPUTransferModule.store() with [GPU-STORE]
  summary log and per-chunk [GPU-STORE-CHUNK] logs covering kernel launch and memcpy_d2h
Copilot AI changed the title [WIP] Add comprehensive profiling instrumentation to CUDA IPC transfer path Add end-to-end profiling instrumentation to CUDA IPC store path Jun 17, 2026
Copilot AI requested a review from hlin99 June 17, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants