nvcomp: Unnecessary sync events and host memory transfers

Please consider the sample code below.

In torch profiler, you can see that there are
 - sync events, even though everything is executed on the same stream
 - MemCpy (Device -> Pageable) and MemCpy (Device -> Pinned) even though nothing ever leaves GPU
 - as a result, gaps in the GPU stream. GPU becomes idle and waiting

This kills performance if the decompression is interleaved with any other GPU ops.

<img width="2077" height="185" alt="Image" src="https://github.com/user-attachments/assets/a5f097df-558d-4aed-aa98-c2ffa939f3dd" />

This is what it should look like, when I remove the nvcomp decode operation:

<img width="1303" height="248" alt="Image" src="https://github.com/user-attachments/assets/621cf085-5461-40f9-9629-81835f83be89" />

 - GPU operations are scheduled
 - command buffer is full
 - no gaps


```
import cupy
import torch
from nvidia import nvcomp
from modules.util.profiling_util import TorchProfiler
from torch import Tensor
from torch.utils.dlpack import to_dlpack
import tqdm

device='cuda'
act = torch.randint(-127, 127, (3072, 3072), device=device, dtype=torch.int8)
weight = torch.randint(-127, 127, (3072, 3072), device=device, dtype=torch.int8)

stream = torch.cuda.current_stream().cuda_stream
codec = nvcomp.Codec(algorithm="Zstd", uncomp_chunk_size=2048, cuda_stream=stream)


def encode(x: Tensor):
    array = nvcomp.as_array(x)
    compressed = codec.encode(array).cuda()
    tensor = torch.utils.dlpack.from_dlpack(compressed.to_dlpack()).clone()
    return tensor, x.shape, x.dtype



def decode(compressed_tensor: Tensor, shape: list[int], dtype: torch.dtype) -> Tensor:
    compressed_array = nvcomp.as_array(compressed_tensor, cuda_stream=stream)
    return codec.decode(compressed_array)

compressed = encode(weight)

with TorchProfiler("decompress.json", enabled=True):
    for _ in tqdm.tqdm(range(10000)):
        decode(*compressed)
        #note: this is just another GPU operation to demonstrate the sync. it doesn't even use the result of decode:
        torch._int_mm(act, weight.T)


```




this might be related to https://github.com/NVIDIA/nvcomp/issues/105 but I'm not entirely sure because I use a different API


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvcomp: Unnecessary sync events and host memory transfers #300

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvcomp: Unnecessary sync events and host memory transfers #300

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions