Skip to content

feat: C++ binary deserialization for CLP connector handles#159

Open
20001020ycx wants to merge 20 commits into
release-0.297-edge-10-clp-connectorfrom
feat/2026-04-27-binary-deserialization-cpp
Open

feat: C++ binary deserialization for CLP connector handles#159
20001020ycx wants to merge 20 commits into
release-0.297-edge-10-clp-connectorfrom
feat/2026-04-27-binary-deserialization-cpp

Conversation

@20001020ycx
Copy link
Copy Markdown

@20001020ycx 20001020ycx commented Apr 27, 2026

Summary

Implement C++ binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) under ConnectorProtocol.

Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.

Background & Motivation

#158 added Java codecs that serialize connector handles to binary via a customSerializedValue field in the JSON message sent from coordinator to worker. This PR adds the C++ deserialization counterpart: when the native Velox worker receives a customSerializedValue, the bytes are routed to ClpConnectorProtocol::deserialize() and reconstructed into C++ structs.

Some notes on the implementation:

Previously ClpConnectorProtocol was defined via ConnectorProtocolTemplate<...>. This PR follows the pattern from prestodb#26650 by making it directly inherit from ConnectorProtocol. (The reason we don't reuse ConnectorProtocolTemplate<...> is that its deserializeTemplate is not yet usable — will consult with upstream developers on this.) This means we must implement all communication helpers ourselves: to_json/from_json for the JSON path, serialize/ deserialize for the binary path. Note, to_json and serialize can be just a VELOX_NYI stubs because worker only concerns with the deserialization, serialization is performed at Java codex.

The from_json implementations are copied from the chevron-generated presto_protocol_clp.cpp to preserve the JSON fallback path. Tim leaves these as VELOX_NYI stubs; however, our forked Presto/Velox shall support both JSON and binary serialization, pending the user's configuration of use-connector-provided-serialization-codecs. When we eventually migrate to the standalone shared library, the JSON fallback will no longer be needed. At that point to_json/from_json will become VELOX_NYI stubs (matching Tim's pattern), and only deserialize() will remain.

Note for Reviewers

This PR contains two commits. Please only review changes from commit 2 that is related to Clp connector. Commit 1 is under review by Tim in upstream Presto; we include it here so CI can run:

  1. Upstream integration — cherry-picked from feat(native): Add customSerializedValue routing for binary connector deserialization prestodb/presto#27652 (not yet merged to upstream). Adds the routing from deserializeTemplate to ConnectorProtocol::deserialize().
  2. CLP deserialization — C++ deserialization changes for CLP handles.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • All unit tests pass
  • End to end test
    • Query: SELECT CLP_GET_JSON_STRING() from clp.default.default limit 100
    • Set up: official clp package v0.12.0, Presto with the changes in this PR, Velox with deserialization of customSerializedValue , and minio as the storage.
    • The result is returned as expected with the binary serialization.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added CLP (Columnar Log Processing) connector protocol support, enabling connectivity to CLP data sources.
    • Enhanced connector protocol framework with improved binary serialization and deserialization capabilities for more efficient data handling.
  • Tests

    • Comprehensive test coverage added for CLP connector protocol operations.

Implement ConnectorCodecProvider with DataOutputStream-based binary
codecs for all 5 CLP handle types (ColumnHandle, Split, TableHandle,
TableLayoutHandle, TransactionHandle). This enables the C++ Velox
worker to deserialize handles via customSerializedValue without
depending on the Presto JSON protocol type system.
…ch style

Rename ClpCodecProvider → ClpConnectorCodecProvider to follow Tim's
TpchConnectorCodecProvider naming convention. Consolidate all 5 codec
classes into inner static classes within a single file. Adopt Tim's
variable naming style (cast to type-specific variable) and error
handling pattern (UncheckedIOException with plain try blocks).

Follows the approach proposed in prestodb#26650.
Move each codec class out of ClpConnectorCodecProvider into its own
file. ClpConnectorCodecProvider now only provides the connector-level
factory methods. Each codec still follows Tim's style from prestodb#26650
(cast to type-specific variable, UncheckedIOException, plain try).
…remove byte fixtures

- Delete TestClpCodecByteFixtures.java
- Rename TestClpCodecRoundtrip → TestClpConnectorCodecProvider
- Change kql in testSplitRoundtripArchiveWithKql to LEVEL:ERROR
- Remove testTableLayoutHandleMinimalRoundtrip
- Remove testCodecProviderReturnsAllCodecs
- Remove testEmptyStringRoundtrip
- Remove testEmptyOptionalRoundtrip
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a concrete CLP connector protocol with binary deserialization, extends ConnectorProtocol with new binary serialize/deserialize APIs for several handle types, and adds a JSON fast-path that Base64-decodes customSerializedValue and delegates to connector-specific binary deserializers for multiple connector handle types.

Changes

Cohort / File(s) Summary
Build & CLP protocol
presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txt, presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h, presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp
Added ClpConnectorProtocol.cpp to build. Replaced the template alias with a concrete ClpConnectorProtocol class and implemented binary deserialize methods for ColumnHandle, ConnectorSplit, ConnectorTableHandle, ConnectorTableLayoutHandle, and ConnectorTransactionHandle (big-endian reads, optional fields handling).
Core protocol interface
presto-native-execution/presto_cpp/presto_protocol/core/ConnectorProtocol.h
Extended ConnectorProtocol with pure virtual serialize/deserialize for ColumnHandle, ConnectorPartitioningHandle, and ConnectorIndexHandle; updated ConnectorProtocolTemplate to override and forward these to existing template helpers.
Core generated logic & fast-path
presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp, presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.h
Added a customSerializedValue JSON fast-path that Base64-decodes payloads and calls connector protocol deserialize, short-circuiting existing from_json flows. Introduced stub serialize/deserialize static methods (VELOX_NYI) for ColumnHandle, ConnectorPartitioningHandle, and ConnectorIndexHandle in the header.
Handle specializations (JSON fast-paths & stubs)
presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.hpp.inc, .../ColumnHandle.cpp.inc, .../ConnectorPartitioningHandle.hpp.inc, .../ConnectorPartitioningHandle.cpp.inc, .../ConnectorIndexHandle.hpp.inc, .../ConnectorIndexHandle.cpp.inc, .../ConnectorSplit.cpp.inc, .../ConnectorTableHandle.cpp.inc, .../ConnectorTableLayoutHandle.cpp.inc, .../ConnectorTransactionHandle.cpp.inc, .../ConnectorInsertTableHandle.cpp.inc, .../ConnectorOutputTableHandle.cpp.inc, .../ConnectorDeleteTableHandle.cpp.inc
For many connector handle types added customSerializedValue support in from_json: validate non-internal $ types, Base64-decode, and delegate to connector protocol deserialize, returning early. Added JSON conversion declarations and VELOX_NYI static serialize/deserialize stubs for new handle types (column/partitioning/index).
Tests
presto-native-execution/presto_cpp/presto_protocol/tests/CMakeLists.txt, presto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp
Added test build entry and new unit tests exercising CLP binary deserialization for ColumnHandle, ConnectorTableHandle, ConnectorTableLayoutHandle (with/without optional fields), ConnectorSplit (various types), and ConnectorTransactionHandle decoding.

Sequence Diagram

sequenceDiagram
    participant Client
    participant JSONParser as JSON Parser
    participant Validator as Type Validator
    participant Base64Dec as Base64 Decoder
    participant ConnProtocol as Connector Protocol
    participant HandleObj as Handle Object

    Client->>JSONParser: from_json(jsonData, handle)
    alt customSerializedValue present
        JSONParser->>Validator: Extract and check `@type` (non-internal)
        alt invalid (starts with $)
            Validator-->>JSONParser: VELOX_CHECK fails
        else valid external type
            JSONParser->>Base64Dec: Decode customSerializedValue
            Base64Dec-->>JSONParser: Binary payload
            JSONParser->>ConnProtocol: deserialize(binaryPayload, handle)
            ConnProtocol->>HandleObj: Parse binary fields (big-endian reads, optionals)
            HandleObj-->>ConnProtocol: Populated handle
            ConnProtocol-->>JSONParser: handle populated
            JSONParser-->>Client: Return populated handle
        end
    else customSerializedValue absent
        JSONParser->>ConnProtocol: from_json(jsonData, handle)
        ConnProtocol-->>JSONParser: handle populated
        JSONParser-->>Client: Return populated handle
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding C++ binary deserialization for CLP connector handles, which is the primary objective of this PR.
Description check ✅ Passed The description comprehensively covers all required template sections: Summary, Background & Motivation, implementation notes, reviewer guidance with commit details, a filled checklist, and validation results demonstrating the feature works end-to-end.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/2026-04-27-binary-deserialization-cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`:
- Around line 39-60: The codec uses ClpSplit.SplitType.ordinal() when writing
and reads back by indexing SplitType.values()[ordinal], which couples wire
format to enum order and can throw ArrayIndexOutOfBoundsException on bad data;
update ClpSplitCodec.serialize to write a stable identifier (e.g., the enum name
via writeUTF(split.getType().name())) and update deserialize to read that
identifier and convert back using SplitType.valueOf(name) inside a try/catch
that throws a clear IOException/UncheckedIOException on invalid value, or if you
prefer keep ordinals then validate the int read against
SplitType.values().length and throw a descriptive IOException before indexing;
apply these changes in the serialize/deserialize methods referencing
ClpSplitCodec, serialize, deserialize, and ClpSplit.SplitType.

In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.java`:
- Around line 24-35: Ensure the codec enforces the empty-payload contract: in
ClpTransactionHandleCodec.serialize(ConnectorTransactionHandle handle) validate
that handle == ClpTransactionHandle.INSTANCE (or is instance of
ClpTransactionHandle and equal to the singleton) and throw an informative
IllegalArgumentException if not; in ClpTransactionHandleCodec.deserialize(byte[]
bytes) validate that bytes.length == 0 and throw an IllegalArgumentException (or
appropriate Presto exception) when non-empty payloads are received, otherwise
return ClpTransactionHandle.INSTANCE.

In
`@presto-clp/src/test/java/com/facebook/presto/plugin/clp/codec/TestClpConnectorCodecProvider.java`:
- Around line 95-118: Add two negative unit tests in
TestClpConnectorCodecProvider that assert deserialize rejects malformed
payloads: (1) create a ClpSplitCodec-encoded byte array with an invalid
split-type ordinal and assert ClpSplitCodec.deserialize throws (or returns an
error) instead of succeeding; (2) create a payload that includes a non-empty
transaction payload when decoded and assert ClpSplitCodec.deserialize rejects
it. Use the ClpSplitCodec.deserialize and ClpSplit/ClpSplit.SplitType symbols to
locate encoding/decoding logic and assert the expected exception/validation
failure for each malformed case to lock the protocol guards.

In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 57-66: The deserializers (e.g., ClpConnectorProtocol::deserialize
that constructs ClpColumnHandle and the other deserialize overloads for table,
split, transaction handles) currently ignore trailing bytes and must validate
that the read offset consumed all input; after all readUTF/readX calls in each
deserialize (e.g., after populating
ClpColumnHandle::columnName/originalColumnName/columnType) add a check if
(offset != data.size()) throw std::runtime_error("Trailing bytes in
ClpConnectorProtocol deserialize"); to fail-fast on malformed or
forward-compatible payloads so any extra/unexpected bytes are rejected rather
than silently ignored.
- Around line 23-33: The readUTF function currently copies raw bytes rather than
decoding Java's modified UTF-8; update readUTF in ClpConnectorProtocol.cpp to
(1) read the 2-byte big-endian length as before, then iterate over the next len
bytes decoding Java-modified-UTF: treat the two-byte sequence 0xC0 0x80 as
U+0000, decode standard 1-, 2-, and 3-byte modified-UTF sequences to Unicode
code points, handle surrogate pairs (decode high and low UTF-16 surrogates
emitted as two 3-byte sequences and combine into a single supplementary code
point), and re-encode resulting code points into proper UTF-8 bytes appended
into the result std::string; keep existing bounds checks using offset and ensure
offset advances exactly by len before returning.
- Around line 23-40: The code in readUTF and readInt uses reinterpret_cast to
load multi-byte integers (violating alignment/aliasing) and __builtin_bswap*
which assumes a particular host endianness; replace those with byte-wise loads
and portable assembly: in readUTF, read the two length bytes individually from
data[offset] and data[offset+1], assemble a uint16_t length in big-endian
(length = (b0<<8)|(b1)), advance offset, bounds-check length, then copy the UTF
bytes; in readInt, read four bytes individually and assemble an int32_t in
big-endian ((b0<<24)|(b1<<16)|(b2<<8)|b3)), advance offset; keep existing
VELOX_CHECK_LE checks and use uint8_t temporaries to avoid unaligned
dereferences. Ensure signedness conversion is handled when producing int32_t.

In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h`:
- Around line 74-222: The CLP protocol currently only supports binary
serialization (the to_json/from_json/serialize overrides in
ClpConnectorProtocol.h throw VELOX_NYI) so before enabling this protocol you
must update the Java coordinator to populate the customSerializedValue field on
all CLP handle types; coordinate deployment so every path that creates
ConnectorTableHandle, ConnectorTableLayoutHandle, ConnectorInsertTableHandle,
ConnectorOutputTableHandle, ConnectorSplit, ConnectorTransactionHandle,
ConnectorDeleteTableHandle, ConnectorIndexHandle,
ConnectorDistributedProcedureHandle, ColumnHandle, and
ConnectorPartitioningHandle sets customSerializedValue, run integration tests to
validate workers receive those binary payloads, and optionally add a defensive
startup/runtime validation in the C++ worker to detect missing
customSerializedValue and fail fast with a clear message referencing
ClpConnectorProtocol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ca7f75ed-b4dc-46a8-82f1-207ff3e723f6

📥 Commits

Reviewing files that changed from the base of the PR and between d10e6a4 and 128d69f.

📒 Files selected for processing (11)
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnector.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpConnectorCodecProvider.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.java
  • presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.java
  • presto-clp/src/test/java/com/facebook/presto/plugin/clp/codec/TestClpConnectorCodecProvider.java
  • presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txt
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h

@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch from 128d69f to 87153d6 Compare April 28, 2026 20:47
…HandleCodec

Add wire format comments to each codec class cross-referencing the C++
ClpConnectorProtocol::deserialize() counterpart. Remove redundant
comments from ClpTransactionHandleCodec.
@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch 4 times, most recently from 5bbc99c to 80ad8a2 Compare April 29, 2026 18:23
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp (1)

23-40: ⚠️ Potential issue | 🔴 Critical

Implement a real Java writeUTF reader and remove the unaligned integer loads.

Line 23 treats the writeUTF payload as already-UTF-8 and just copies the bytes, but Java uses modified UTF-8 here. That means \u0000 and supplementary characters will deserialize incorrectly. On top of that, Lines 26-39 still read uint16_t / uint32_t through reinterpret_cast from arbitrary byte offsets, which is undefined behaviour on strict-alignment targets and bakes in a host-endianness assumption. Please switch these helpers to byte-wise big-endian reads and a proper modified-UTF decoder before relying on this wire format.

What does Java DataOutputStream.writeUTF encode for NUL characters and supplementary characters (modified UTF-8), and is reading uint16_t/uint32_t from arbitrary char-buffer offsets via reinterpret_cast well-defined in C++?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`
around lines 23 - 40, readUTF currently assumes payload is standard UTF-8 and
also uses reinterpret_cast-based reads which are undefined on unaligned buffers;
change readUTF to parse the Java modified-UTF-8 format per DataInput.readUTF
(read the 2-byte big-endian length via byte-wise reads, then decode
modified-UTF-8 where NUL is encoded as 0xC0 0x80 and supplementary chars are
represented via surrogate pairs encoded as six bytes) producing a proper UTF-8
std::string (decode into UTF-16 code units then to UTF-8), and replace the
reinterpret_cast-based integer reads in readInt (and any similar helpers) with
explicit byte-wise big-endian assembly (e.g., (b0<<24)|(b1<<16)|(b2<<8)|b3)
while advancing offset safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h`:
- Around line 67-80: ClpConnectorProtocol is missing the required override for
ColumnHandle::serialize causing the class to remain abstract; add a method with
the signature serialize(const std::shared_ptr<ColumnHandle>&, std::string&)
const override to ClpConnectorProtocol (next to the existing deserialize,
to_json, from_json for ColumnHandle) and implement it either by performing the
actual serialization used by other handle types or stubbing with VELOX_NYI
(matching the pattern used by the other serialize overrides in this file) so the
class compiles and can be instantiated/registered.

In
`@presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp`:
- Around line 1134-1143: Centralize the repeated "customSerializedValue" branch
by extracting the guard/decode/dispatch logic into a shared helper used by the
codegen template (e.g., a function like handleCustomSerializedValue(json j,
const std::string& type, PrestoProtocolType& p) or a generator template partial)
so every overload calls that helper instead of duplicating the block; the helper
should check j.contains("customSerializedValue"), perform the
VELOX_CHECK(!type.empty() && type[0] != '$', ...), decode with
velox::encoding::Base64::decode(j["customSerializedValue"].get<std::string>()),
call getConnectorProtocol(type).deserialize(binaryData, p), and indicate (via
return bool or early return) that the branch was handled, then replace
occurrences in functions that currently inline the block (the ones around
getConnectorProtocol/type/deserialize uses) to call this helper.
- Around line 6227-6233: The ColumnHandle customSerializedValue path uses
j["@type"] directly for connector dispatch
(getConnectorProtocol(type).deserialize) but lacks the internal-handle guard
used elsewhere; modify the block that handles customSerializedValue for
ColumnHandle to check the extracted type (from j["@type"]) and reject or error
out when it represents an internal handle (e.g., starts with '$' or matches the
same internal-handle predicate used in other branches) before calling
getConnectorProtocol(type).deserialize, mirroring the guard and error handling
used in the other new branches to avoid routing internal handles into connector
protocol resolution.

In
`@presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc`:
- Around line 24-30: The current customSerializedValue branch directly reads
j["@type"] and calls getConnectorProtocol(type).deserialize(...), bypassing the
shared subtype-resolution and internal-type guard; update the ColumnHandle
deserialization branch to first run the existing subtype resolution/validation
routine (call getSubclassKey(...) or the shared helper used by other handle
deserializers) to obtain/validate the concrete subtype key, then use that
resolved key when calling
getConnectorProtocol(resolvedKey).deserialize(binaryData, p); this preserves the
common parsing/validation flow and reenables the internal-type guard while
keeping the customSerializedValue base64 decode and deserialize steps intact.

---

Duplicate comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 23-40: readUTF currently assumes payload is standard UTF-8 and
also uses reinterpret_cast-based reads which are undefined on unaligned buffers;
change readUTF to parse the Java modified-UTF-8 format per DataInput.readUTF
(read the 2-byte big-endian length via byte-wise reads, then decode
modified-UTF-8 where NUL is encoded as 0xC0 0x80 and supplementary chars are
represented via surrogate pairs encoded as six bytes) producing a proper UTF-8
std::string (decode into UTF-16 code units then to UTF-8), and replace the
reinterpret_cast-based integer reads in readInt (and any similar helpers) with
explicit byte-wise big-endian assembly (e.g., (b0<<24)|(b1<<16)|(b2<<8)|b3)
while advancing offset safely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 91d7b2f6-521d-4dc7-aa45-842df98d585c

📥 Commits

Reviewing files that changed from the base of the PR and between 87153d6 and 80ad8a2.

📒 Files selected for processing (19)
  • presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txt
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h
  • presto-native-execution/presto_cpp/presto_protocol/core/ConnectorProtocol.h
  • presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp
  • presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.h
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.hpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorDeleteTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.hpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorInsertTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorOutputTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.hpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorSplit.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableLayoutHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTransactionHandle.cpp.inc

@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch 3 times, most recently from ced2480 to 02e75a4 Compare April 29, 2026 19:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 25-33: The helper readers (readUTF, readInt, readBoolean)
currently perform in.read(...) calls without validating the stream, which allows
truncated input to produce partial/uninitialized data; update each function to
verify the stream read succeeded by checking in.fail() or in.gcount() after
every read and handle errors by throwing a descriptive exception (e.g.,
std::runtime_error) when the number of bytes read does not match the expected
size; specifically: in readUTF, after reading the length field verify
sizeof(length) bytes were read, then after reading the string bytes verify
gcount() == length before returning (still apply folly::Endian::big to length),
and in readInt/readBoolean verify their single/multi-byte reads succeeded before
converting/returning the value.

In
`@presto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp`:
- Around line 137-152: The test TestSplitDeserializationArchive in
ClpConnectorProtocolTest omits the kqlQuery presence boolean so the deserializer
reads past the payload; update the test payload construction (after
writeInt(oss, 0) // ARCHIVE) to write the kqlQuery presence flag (use the same
helper used elsewhere, e.g., writeBool or writeByte for kqlQueryPresent) so the
wire format matches ClpSplit's expected layout, then run protocol.deserialize
into handle and assert as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 777a9e39-c819-490b-a0a0-ef471a571dff

📥 Commits

Reviewing files that changed from the base of the PR and between 80ad8a2 and 02e75a4.

📒 Files selected for processing (5)
  • presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txt
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h
  • presto-native-execution/presto_cpp/presto_protocol/tests/CMakeLists.txt
  • presto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp

@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch from 02e75a4 to 7aa894a Compare April 29, 2026 20:08
Cherry-picked from upstream PR prestodb#27652 (Tim's approach for Tpch connector).

Adds serialize/deserialize pure virtuals to ConnectorProtocol for all
handle types and customSerializedValue routing in .cpp.inc files.
@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch from 7aa894a to b883177 Compare April 29, 2026 20:30
@20001020ycx 20001020ycx requested a review from gibber9809 April 29, 2026 21:25
Extract writeTableHandle/readTableHandle so ClpTableLayoutHandleCodec
can share the same DataOutputStream/DataInputStream instead of
duplicating the table handle serialization fields.
Add instanceof checks before casting in serialize() for clearer error
messages on type mismatch. Add trailing-byte checks in deserialize()
to reject unexpected extra data after valid encoded fields.
Java's writeUTF uses Modified UTF-8 which encodes \0 and supplementary
characters differently from standard UTF-8. Since clp-s assumes real
UTF-8 for string comparisons, using writeUTF would break searchability
for strings containing non-ASCII content.

Add CodecUtils with writeUtf8String/readUtf8String helpers that use
StandardCharsets.UTF_8, keeping the same wire format (2-byte BE length
prefix + encoded bytes).
@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch 2 times, most recently from 14625fc to bab7a69 Compare April 30, 2026 19:52
@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch from bab7a69 to 9c3bf3e Compare April 30, 2026 20:01
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc (1)

24-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the shared subtype-resolution path before custom binary deserialization.

At Line 24, this fast-path bypasses getSubclassKey(...) and directly trusts j["@type"]. That diverges from the other handle deserializers, skips the internal $... type guard, and can produce inconsistent parse failures when @type is missing or malformed.

Patch to align `ColumnHandle` with the common deserialization pattern
 void from_json(const json& j, std::shared_ptr<ColumnHandle>& p) {
-  if (j.contains("customSerializedValue")) {
-    String type = j["@type"];
-    std::string binaryData =
-        velox::encoding::Base64::decode(j["customSerializedValue"].get<std::string>());
-    getConnectorProtocol(type).deserialize(binaryData, p);
-    return;
-  }
   String type;
   try {
     type = p->getSubclassKey(j);
   } catch (json::parse_error& e) {
     throw ParseError(std::string(e.what()) + " ColumnHandle  ColumnHandle");
   }
+  if (j.contains("customSerializedValue")) {
+    VELOX_CHECK(
+        !type.empty() && type[0] != '$',
+        "Internal handle type '{}' should not have customSerializedValue",
+        type);
+    std::string binaryData =
+        velox::encoding::Base64::decode(j["customSerializedValue"].get<std::string>());
+    getConnectorProtocol(type).deserialize(binaryData, p);
+    return;
+  }
   getConnectorProtocol(type).from_json(j, p);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc`
around lines 24 - 30, The customSerializedValue branch in ColumnHandle.cpp.inc
bypasses the shared subtype resolution and directly reads j["@type"], which can
miss the internal $... guard used by getSubclassKey; change this branch to first
call getSubclassKey(j) (or the same shared subtype-resolution helper used by
other handle deserializers) to obtain the canonical connector/type key, then
pass that key to getConnectorProtocol(...) and call deserialize on the decoded
binary; ensure you still decode j["customSerializedValue"] via Base64 before
calling getConnectorProtocol(...).
presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp (1)

66-78: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject trailing bytes (and non-empty transaction payload) after deserialization.

Current decode paths accept extra bytes silently. That masks malformed payloads and schema drift. Also, transaction handle is documented as empty payload but currently doesn’t enforce that.

Suggested fix pattern
 namespace {
+void ensureFullyConsumed(
+    std::istringstream& in,
+    const std::string& binaryData,
+    const char* handleName) {
+  VELOX_CHECK(
+      in.tellg() == static_cast<std::streampos>(binaryData.size()),
+      "Trailing bytes in {} payload",
+      handleName);
+}
 } // namespace

 void ClpConnectorProtocol::deserialize(
     const std::string& binaryData,
     std::shared_ptr<ColumnHandle>& proto) const {
@@
   handle->_type = "clp";
+  ensureFullyConsumed(in, binaryData, "ClpColumnHandle");
   proto = handle;
 }

Apply the same check in split/table/table-layout paths, and in transaction path add:

+  VELOX_CHECK(binaryData.empty(), "ClpTransactionHandle payload must be empty");

Also applies to: 87-102, 110-120, 130-152, 157-165

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`
around lines 66 - 78, Ensure deserialization rejects trailing bytes and enforces
empty transaction payloads: after reading expected fields in
ClpConnectorProtocol::deserialize (and the analogous deserialize implementations
at the other ranges), check that the input stream has no remaining data (e.g.,
verify in.rdbuf()->in_avail() == 0 or in.peek() == EOF) and if not, throw/return
an error (e.g., throw std::runtime_error with a clear message about trailing
bytes). For the transaction-path deserialize, additionally enforce that
binaryData is empty (or stream contains zero bytes) and fail if any payload is
present. Update all referenced deserialize methods (the one creating
ClpColumnHandle and the other deserialize functions mentioned in the review) to
perform this post-read validation and emit a clear error when extra bytes are
found.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Line 28: Change the boolean style used in the VELOX_CHECK conditions to the
repository-preferred form: replace occurrences of "!in.fail()" with "false ==
in.fail()" (and similarly any other negated checks like "!someExpr" in the same
function) so the VELOX_CHECK calls (e.g., the VELOX_CHECK that currently reads
VELOX_CHECK(!in.fail(), ...) and the other instances noted around the same
block) use the "false == <expression>" style consistently.
- Around line 94-96: Validate the integer returned by readInt(in) before casting
to SplitType: after reading typeOrdinal in ClpConnectorProtocol.cpp, check that
it maps to a known SplitType value (e.g., within the enum's defined ordinal
range or via a helper like isValidSplitType) and if it's not valid, fail fast
(throw an exception or return an error) instead of assigning handle->type;
update the code paths that use typeOrdinal/handle->type (reading code around
readInt, typeOrdinal, and handle->type) to perform this guard so invalid wire
values cannot produce an undefined enum state.

In
`@presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp`:
- Around line 1135-1136: Replace negated emptiness checks using the '!' operator
with the repo-preferred form 'false == type.empty()' in the VELOX_CHECK guards.
Specifically, change occurrences like VELOX_CHECK(!type.empty() && type[0] !=
'$', ...) to use VELOX_CHECK(false == type.empty() && type[0] != '$', ...), and
apply the same pattern for all other instances where '!type.empty()' appears in
VELOX_CHECK (notably the other occurrences reported). Keep the rest of each
condition (e.g., checks against type[0]) unchanged so behavior is preserved.

In
`@presto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp`:
- Around line 63-65: The test verifies ClpColumnHandle names but omits asserting
the serialized columnType; update the test in ClpConnectorProtocolTest (where
clpHandle is inspected) to also EXPECT_EQ clpHandle->columnType with the
expected type value for the column you serialized (e.g., the same ColumnType
used when constructing/serializing the ClpColumnHandle); place this assertion
alongside the existing EXPECT_EQ checks for columnName and originalColumnName so
the deserialized columnType is validated as well.
- Around line 133-134: The test dereferences optional pointer members directly
(clpHandle->kqlQuery and clpHandle->metadataFilterQuery); add explicit
null/value assertions before comparing: first ASSERT_TRUE(clpHandle->kqlQuery)
(or ASSERT_TRUE(clpHandle->kqlQuery.has_value()) if using std::optional) and
ASSERT_TRUE(clpHandle->metadataFilterQuery) and then keep the EXPECT_EQ checks
comparing *clpHandle->kqlQuery to "level == 'error'" and
*clpHandle->metadataFilterQuery to "timestamp > 1000" so failures are
deterministic and clearly reported.

---

Duplicate comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 66-78: Ensure deserialization rejects trailing bytes and enforces
empty transaction payloads: after reading expected fields in
ClpConnectorProtocol::deserialize (and the analogous deserialize implementations
at the other ranges), check that the input stream has no remaining data (e.g.,
verify in.rdbuf()->in_avail() == 0 or in.peek() == EOF) and if not, throw/return
an error (e.g., throw std::runtime_error with a clear message about trailing
bytes). For the transaction-path deserialize, additionally enforce that
binaryData is empty (or stream contains zero bytes) and fail if any payload is
present. Update all referenced deserialize methods (the one creating
ClpColumnHandle and the other deserialize functions mentioned in the review) to
perform this post-read validation and emit a clear error when extra bytes are
found.

In
`@presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc`:
- Around line 24-30: The customSerializedValue branch in ColumnHandle.cpp.inc
bypasses the shared subtype resolution and directly reads j["@type"], which can
miss the internal $... guard used by getSubclassKey; change this branch to first
call getSubclassKey(j) (or the same shared subtype-resolution helper used by
other handle deserializers) to obtain the canonical connector/type key, then
pass that key to getConnectorProtocol(...) and call deserialize on the decoded
binary; ensure you still decode j["customSerializedValue"] via Base64 before
calling getConnectorProtocol(...).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 39a54195-a6ee-4d9c-b0ee-9d13956282ac

📥 Commits

Reviewing files that changed from the base of the PR and between 02e75a4 and bab7a69.

📒 Files selected for processing (21)
  • presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txt
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp
  • presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h
  • presto-native-execution/presto_cpp/presto_protocol/core/ConnectorProtocol.h
  • presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp
  • presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.h
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.hpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorDeleteTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.hpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorInsertTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorOutputTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.hpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorSplit.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableLayoutHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTransactionHandle.cpp.inc
  • presto-native-execution/presto_cpp/presto_protocol/tests/CMakeLists.txt
  • presto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp

@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch from 9c3bf3e to a9fe2dc Compare April 30, 2026 20:58
Implement binary deserialization for all 5 CLP connector handle types
(ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle,
ClpTransactionHandle) by inheriting ConnectorProtocol and overriding
deserialize().

Reads binary bytes (Java DataOutputStream big-endian wire format from
#158) directly into existing protocol structs using std::istringstream
and folly::Endian::big.

Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
@20001020ycx 20001020ycx force-pushed the feat/2026-04-27-binary-deserialization-cpp branch from a9fe2dc to 01291e9 Compare April 30, 2026 21:01
…CodecUtils.java

Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
Copy link
Copy Markdown

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, just a few questions/nits to clear up. Only reviewed changes from the second commit per sidebar.

handle->columnName = readUTF(in);
handle->originalColumnName = readUTF(in);
handle->columnType = readUTF(in);
handle->_type = "clp";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just judging by naming, it seems like _type is supposed to be private? If so, should we assign it in the constructor of ClpColumnHandle instead of here, since it doesn't seem to depend on the serialized data?

Copy link
Copy Markdown
Author

@20001020ycx 20001020ycx May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_type is a public member inherited from a base class for all protocol handle types in Presto's protocol layer.

In the current fork version, the ClpColumnHandle's constructor are auto-generated by the Presto Native Worker Protocol code generation (chevron templates, we briefly discussed this component in #158 ), so we can't currently customize the constructor. This is the exact motivation we have to use binary serialization: to decouple the Presto Native Worker Protocol codegen from the communication from coordinator to worker.

Once we are migrated into the standalone plugin where we have the freedom of defining these types ourselves, I agree the assignment should live in the constructor. Will make a note on this. Thanks for the good catch.

if (readBoolean(in)) {
handle->kqlQuery = std::make_shared<String>(readUTF(in));
}
handle->_type = "clp";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, wondering if this should be in the constructor of ClpSplit.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto :)

20001020ycx and others added 5 commits May 1, 2026 14:06
…lp/ClpConnectorProtocol.cpp

Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
…lp/ClpConnectorProtocol.cpp

Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
…x/presto into feat/2026-04-27-binary-deserialization-cpp
…om:y-scope/presto into feat/2026-04-27-binary-deserialization-cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants