Skip to content

Conversation

@Shekharrajak
Copy link

@Shekharrajak Shekharrajak commented Dec 28, 2025

Handle Spark's full serialization format (12-byte header + bits) in merge_filter() to support Spark partial / Comet final execution. The fix automatically detects the format and extracts bits data accordingly.

Fixes #2889

Rationale for this change

Spark's serialize() returns full format: 12-byte header (version + numHashFunctions + numWords) + bits data
Comet's state_as_bytes() returns bits data only
When Spark partial sends full format, Comet's merge_filter() expects bits-only, causing mismatch

Ref https://github.com/apache/spark/blob/master/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java#L99

Ref https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L219

Spark format: BloomFilterImpl.writeTo() (4+4 bytes) + BitArray.writeTo() (4 bytes + bits)

What changes are included in this PR?

Detects Spark format (buffer size = 12 + expected_bits_size)
Extracts bits data by skipping 12-byte header if Spark format
Returns bits as-is if Comet format

How are these changes tested?

Spark SQL test

Handle Spark's full serialization format (12-byte header + bits) in merge_filter()
to support Spark partial / Comet final execution. The fix automatically detects
the format and extracts bits data accordingly.

Fixes apache#2889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bloom filter intermediate aggregate buffers are not compatible between Spark and Comet

1 participant