`cudaErrorMisalignedAddress` caused by `nvcompBatchedCascadedDecompressAsync`

Hello everyone, I met some problem when I tried to let cudf support non-Parquet standard Compressions (e.g. CASCADED from nvComp).

When I tried to let cudf able to read/write DEFLATE compressed Parquet, everything works find. But when I tried to replicate the same with CASCADED , the writing seems works fine, while I met an alignment error:

```
(cudf_dev) qchen@dgx01:\~/GPUFileFormat-cudf/cpp/examples/parquet_io$ srun --ntasks=1 --cpus-per-task=16 --gres=gpu:1 --pty ./build/parquet_io lineitem-sf1-l_orderkey-INT-duckdb.parquet CASCADED-PLAIN-tpch1-l_orderkey-nocomp.parquet PLAIN CASCADED

Reading lineitem-sf1-l_orderkey-INT-duckdb.parquet…
Note: Not timing the initial parquet read as it may include
times for nvcomp, cufile loading and RMM growth.

Decompression scratch size: 0 bytes
Writing CASCADED-PLAIN-tpch1-l_orderkey-nocomp.parquet with encoding, compression and no page stats..
nvcompBatchedCascadedCompressAsync called
Elapsed Time: 62ms

Reading CASCADED-PLAIN-tpch1-l_orderkey-nocomp.parquet…
Using batched_decompress_get_temp_size_sync for temp size calculation
Decompression scratch size: 0 bytes
Cascaded decompression alignments -
terminate called after throwing an instance of ‘thrust::system::system_error’
what():  reduce failed to synchronize: cudaErrorMisalignedAddress: misaligned address
```

Here’s the output from compute-sanitizer:

```
========= Invalid __global__ read of size 4 bytes
=========     at void <unnamed>::cascaded_decompression_kernel_type_check<(int)8, unsigned long, (int)128, (int)4096>(int, const void *const *, const T2 *, void *const *, const T2 *, T2 *, nvcompStatus_t *)+0x2b0
=========     by thread (2,0,0) in block (1,0,0)
=========     Access at 0x7f493e1ab72a is misaligned
=========     and is inside the nearest allocation at 0x7f493e000000 of size 20.915.191.808 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaLaunchKernel [0x20df9d] in libnvcomp.so.5
=========         Host Frame: nvcomp::cascadedDecompressAsyncPart2(void const* const*, unsigned long const*, unsigned long const*, unsigned long*, unsigned long, void* const*, nvcompStatus_t*, CUstream_st*) [0xc1937] in libnvcomp.so.5
=========         Host Frame: nvcompBatchedCascadedDecompressAsync [0x1019f5] in libnvcomp.so.5
=========         Host Frame: cudf::io::detail::nvcomp::batched_decompress(cudf::io::detail::nvcomp::compression_type, cudf::device_span<cudf::device_span<unsigned char const, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::device_span<unsigned char, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::io::detail::codec_exec_result, 18446744073709551615ul>, unsigned long, unsigned long, rmm::cuda_stream_view) [0xc86041] in libcudf.so
=========         Host Frame: cudf::io::detail::decompress(cudf::io::compression_type, cudf::device_span<cudf::device_span<unsigned char const, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::device_span<unsigned char, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::io::detail::codec_exec_result, 18446744073709551615ul>, unsigned long, unsigned long, rmm::cuda_stream_view) [0xc926b1] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::decompress_page_data(cudf::host_span<cudf::io::parquet::detail::ColumnChunkDesc const, 18446744073709551615ul>, cudf::host_span<cudf::io::parquet::detail::PageInfo, 18446744073709551615ul>, cudf::host_span<cudf::io::parquet::detail::PageInfo, 18446744073709551615ul>, cudf::host_span<bool const, 18446744073709551615ul>, rmm::cuda_stream_view, rmm::detail::cccl_async_resource_ref<cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible> >) [0xf1c29c] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::reader_impl::setup_next_subpass(cudf::io::parquet::detail::reader_impl::read_mode) [0xf1129d] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::reader_impl::read() [0xf09384] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::reader::read() [0xefdf0f] in libcudf.so
=========         Host Frame: cudf::io::read_parquet(cudf::io::parquet_reader_options const&, rmm::cuda_stream_view, rmm::detail::cccl_async_resource_ref<cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible> >) [0xcca9fe] in libcudf.so
=========         Host Frame: read_parquet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) in parquet_io.cpp:51 [0xaced] in parquet_io
=========         Host Frame: main in parquet_io.cpp:169 [0x732a] in parquet_io
========= 
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  reduce failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 12835 errors
========= ERROR SUMMARY: 12735 errors were not printed. Use --print-limit option to adjust the number of printed errors
srun: error: dgx01.lab.dm.informatik.tu-darmstadt.de: task 0: Exited with exit code 6
```

I suspect there might be an issue with buffer allocation, alignment requirements, or pointer arithmetic when passing data to the cascadedDecompressAsync function, but I haven’t been able to pinpoint it. Any guidance or insights would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cudaErrorMisalignedAddress` caused by `nvcompBatchedCascadedDecompressAsync` #296

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cudaErrorMisalignedAddress caused by nvcompBatchedCascadedDecompressAsync #296

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`cudaErrorMisalignedAddress` caused by `nvcompBatchedCascadedDecompressAsync` #296