-
Notifications
You must be signed in to change notification settings - Fork 447
Open
Labels
Description
Please consider the sample code below.
In torch profiler, you can see that there are
- sync events, even though everything is executed on the same stream
- MemCpy (Device -> Pageable) and MemCpy (Device -> Pinned) even though nothing ever leaves GPU
- as a result, gaps in the GPU stream. GPU becomes idle and waiting
This kills performance if the decompression is interleaved with any other GPU ops.
This is what it should look like, when I remove the nvcomp decode operation:
- GPU operations are scheduled
- command buffer is full
- no gaps
import cupy
import torch
from nvidia import nvcomp
from modules.util.profiling_util import TorchProfiler
from torch import Tensor
from torch.utils.dlpack import to_dlpack
import tqdm
device='cuda'
act = torch.randint(-127, 127, (3072, 3072), device=device, dtype=torch.int8)
weight = torch.randint(-127, 127, (3072, 3072), device=device, dtype=torch.int8)
stream = torch.cuda.current_stream().cuda_stream
codec = nvcomp.Codec(algorithm="Zstd", uncomp_chunk_size=2048, cuda_stream=stream)
def encode(x: Tensor):
array = nvcomp.as_array(x)
compressed = codec.encode(array).cuda()
tensor = torch.utils.dlpack.from_dlpack(compressed.to_dlpack()).clone()
return tensor, x.shape, x.dtype
def decode(compressed_tensor: Tensor, shape: list[int], dtype: torch.dtype) -> Tensor:
compressed_array = nvcomp.as_array(compressed_tensor, cuda_stream=stream)
return codec.decode(compressed_array)
compressed = encode(weight)
with TorchProfiler("decompress.json", enabled=True):
for _ in tqdm.tqdm(range(10000)):
decode(*compressed)
#note: this is just another GPU operation to demonstrate the sync. it doesn't even use the result of decode:
torch._int_mm(act, weight.T)
this might be related to NVIDIA/nvcomp#105 but I'm not entirely sure because I use a different API