Skip to content

cudaErrorMisalignedAddress caused by nvcompBatchedCascadedDecompressAsync #296

@qbacpey

Description

@qbacpey

Hello everyone, I met some problem when I tried to let cudf support non-Parquet standard Compressions (e.g. CASCADED from nvComp).

When I tried to let cudf able to read/write DEFLATE compressed Parquet, everything works find. But when I tried to replicate the same with CASCADED , the writing seems works fine, while I met an alignment error:

(cudf_dev) qchen@dgx01:\~/GPUFileFormat-cudf/cpp/examples/parquet_io$ srun --ntasks=1 --cpus-per-task=16 --gres=gpu:1 --pty ./build/parquet_io lineitem-sf1-l_orderkey-INT-duckdb.parquet CASCADED-PLAIN-tpch1-l_orderkey-nocomp.parquet PLAIN CASCADED

Reading lineitem-sf1-l_orderkey-INT-duckdb.parquet…
Note: Not timing the initial parquet read as it may include
times for nvcomp, cufile loading and RMM growth.

Decompression scratch size: 0 bytes
Writing CASCADED-PLAIN-tpch1-l_orderkey-nocomp.parquet with encoding, compression and no page stats..
nvcompBatchedCascadedCompressAsync called
Elapsed Time: 62ms

Reading CASCADED-PLAIN-tpch1-l_orderkey-nocomp.parquet…
Using batched_decompress_get_temp_size_sync for temp size calculation
Decompression scratch size: 0 bytes
Cascaded decompression alignments -
terminate called after throwing an instance of ‘thrust::system::system_error’
what():  reduce failed to synchronize: cudaErrorMisalignedAddress: misaligned address

Here’s the output from compute-sanitizer:

========= Invalid __global__ read of size 4 bytes
=========     at void <unnamed>::cascaded_decompression_kernel_type_check<(int)8, unsigned long, (int)128, (int)4096>(int, const void *const *, const T2 *, void *const *, const T2 *, T2 *, nvcompStatus_t *)+0x2b0
=========     by thread (2,0,0) in block (1,0,0)
=========     Access at 0x7f493e1ab72a is misaligned
=========     and is inside the nearest allocation at 0x7f493e000000 of size 20.915.191.808 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaLaunchKernel [0x20df9d] in libnvcomp.so.5
=========         Host Frame: nvcomp::cascadedDecompressAsyncPart2(void const* const*, unsigned long const*, unsigned long const*, unsigned long*, unsigned long, void* const*, nvcompStatus_t*, CUstream_st*) [0xc1937] in libnvcomp.so.5
=========         Host Frame: nvcompBatchedCascadedDecompressAsync [0x1019f5] in libnvcomp.so.5
=========         Host Frame: cudf::io::detail::nvcomp::batched_decompress(cudf::io::detail::nvcomp::compression_type, cudf::device_span<cudf::device_span<unsigned char const, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::device_span<unsigned char, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::io::detail::codec_exec_result, 18446744073709551615ul>, unsigned long, unsigned long, rmm::cuda_stream_view) [0xc86041] in libcudf.so
=========         Host Frame: cudf::io::detail::decompress(cudf::io::compression_type, cudf::device_span<cudf::device_span<unsigned char const, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::device_span<unsigned char, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::io::detail::codec_exec_result, 18446744073709551615ul>, unsigned long, unsigned long, rmm::cuda_stream_view) [0xc926b1] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::decompress_page_data(cudf::host_span<cudf::io::parquet::detail::ColumnChunkDesc const, 18446744073709551615ul>, cudf::host_span<cudf::io::parquet::detail::PageInfo, 18446744073709551615ul>, cudf::host_span<cudf::io::parquet::detail::PageInfo, 18446744073709551615ul>, cudf::host_span<bool const, 18446744073709551615ul>, rmm::cuda_stream_view, rmm::detail::cccl_async_resource_ref<cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible> >) [0xf1c29c] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::reader_impl::setup_next_subpass(cudf::io::parquet::detail::reader_impl::read_mode) [0xf1129d] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::reader_impl::read() [0xf09384] in libcudf.so
=========         Host Frame: cudf::io::parquet::detail::reader::read() [0xefdf0f] in libcudf.so
=========         Host Frame: cudf::io::read_parquet(cudf::io::parquet_reader_options const&, rmm::cuda_stream_view, rmm::detail::cccl_async_resource_ref<cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible> >) [0xcca9fe] in libcudf.so
=========         Host Frame: read_parquet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) in parquet_io.cpp:51 [0xaced] in parquet_io
=========         Host Frame: main in parquet_io.cpp:169 [0x732a] in parquet_io
========= 
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  reduce failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 12835 errors
========= ERROR SUMMARY: 12735 errors were not printed. Use --print-limit option to adjust the number of printed errors
srun: error: dgx01.lab.dm.informatik.tu-darmstadt.de: task 0: Exited with exit code 6

I suspect there might be an issue with buffer allocation, alignment requirements, or pointer arithmetic when passing data to the cascadedDecompressAsync function, but I haven’t been able to pinpoint it. Any guidance or insights would be greatly appreciated!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions