feat: C++ binary deserialization for CLP connector handles#159
feat: C++ binary deserialization for CLP connector handles#15920001020ycx wants to merge 20 commits into
Conversation
Implement ConnectorCodecProvider with DataOutputStream-based binary codecs for all 5 CLP handle types (ColumnHandle, Split, TableHandle, TableLayoutHandle, TransactionHandle). This enables the C++ Velox worker to deserialize handles via customSerializedValue without depending on the Presto JSON protocol type system.
…ch style Rename ClpCodecProvider → ClpConnectorCodecProvider to follow Tim's TpchConnectorCodecProvider naming convention. Consolidate all 5 codec classes into inner static classes within a single file. Adopt Tim's variable naming style (cast to type-specific variable) and error handling pattern (UncheckedIOException with plain try blocks). Follows the approach proposed in prestodb#26650.
Move each codec class out of ClpConnectorCodecProvider into its own file. ClpConnectorCodecProvider now only provides the connector-level factory methods. Each codec still follows Tim's style from prestodb#26650 (cast to type-specific variable, UncheckedIOException, plain try).
…remove byte fixtures - Delete TestClpCodecByteFixtures.java - Rename TestClpCodecRoundtrip → TestClpConnectorCodecProvider - Change kql in testSplitRoundtripArchiveWithKql to LEVEL:ERROR - Remove testTableLayoutHandleMinimalRoundtrip - Remove testCodecProviderReturnsAllCodecs - Remove testEmptyStringRoundtrip - Remove testEmptyOptionalRoundtrip
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a concrete CLP connector protocol with binary deserialization, extends ConnectorProtocol with new binary serialize/deserialize APIs for several handle types, and adds a JSON fast-path that Base64-decodes Changes
Sequence DiagramsequenceDiagram
participant Client
participant JSONParser as JSON Parser
participant Validator as Type Validator
participant Base64Dec as Base64 Decoder
participant ConnProtocol as Connector Protocol
participant HandleObj as Handle Object
Client->>JSONParser: from_json(jsonData, handle)
alt customSerializedValue present
JSONParser->>Validator: Extract and check `@type` (non-internal)
alt invalid (starts with $)
Validator-->>JSONParser: VELOX_CHECK fails
else valid external type
JSONParser->>Base64Dec: Decode customSerializedValue
Base64Dec-->>JSONParser: Binary payload
JSONParser->>ConnProtocol: deserialize(binaryPayload, handle)
ConnProtocol->>HandleObj: Parse binary fields (big-endian reads, optionals)
HandleObj-->>ConnProtocol: Populated handle
ConnProtocol-->>JSONParser: handle populated
JSONParser-->>Client: Return populated handle
end
else customSerializedValue absent
JSONParser->>ConnProtocol: from_json(jsonData, handle)
ConnProtocol-->>JSONParser: handle populated
JSONParser-->>Client: Return populated handle
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`:
- Around line 39-60: The codec uses ClpSplit.SplitType.ordinal() when writing
and reads back by indexing SplitType.values()[ordinal], which couples wire
format to enum order and can throw ArrayIndexOutOfBoundsException on bad data;
update ClpSplitCodec.serialize to write a stable identifier (e.g., the enum name
via writeUTF(split.getType().name())) and update deserialize to read that
identifier and convert back using SplitType.valueOf(name) inside a try/catch
that throws a clear IOException/UncheckedIOException on invalid value, or if you
prefer keep ordinals then validate the int read against
SplitType.values().length and throw a descriptive IOException before indexing;
apply these changes in the serialize/deserialize methods referencing
ClpSplitCodec, serialize, deserialize, and ClpSplit.SplitType.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.java`:
- Around line 24-35: Ensure the codec enforces the empty-payload contract: in
ClpTransactionHandleCodec.serialize(ConnectorTransactionHandle handle) validate
that handle == ClpTransactionHandle.INSTANCE (or is instance of
ClpTransactionHandle and equal to the singleton) and throw an informative
IllegalArgumentException if not; in ClpTransactionHandleCodec.deserialize(byte[]
bytes) validate that bytes.length == 0 and throw an IllegalArgumentException (or
appropriate Presto exception) when non-empty payloads are received, otherwise
return ClpTransactionHandle.INSTANCE.
In
`@presto-clp/src/test/java/com/facebook/presto/plugin/clp/codec/TestClpConnectorCodecProvider.java`:
- Around line 95-118: Add two negative unit tests in
TestClpConnectorCodecProvider that assert deserialize rejects malformed
payloads: (1) create a ClpSplitCodec-encoded byte array with an invalid
split-type ordinal and assert ClpSplitCodec.deserialize throws (or returns an
error) instead of succeeding; (2) create a payload that includes a non-empty
transaction payload when decoded and assert ClpSplitCodec.deserialize rejects
it. Use the ClpSplitCodec.deserialize and ClpSplit/ClpSplit.SplitType symbols to
locate encoding/decoding logic and assert the expected exception/validation
failure for each malformed case to lock the protocol guards.
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 57-66: The deserializers (e.g., ClpConnectorProtocol::deserialize
that constructs ClpColumnHandle and the other deserialize overloads for table,
split, transaction handles) currently ignore trailing bytes and must validate
that the read offset consumed all input; after all readUTF/readX calls in each
deserialize (e.g., after populating
ClpColumnHandle::columnName/originalColumnName/columnType) add a check if
(offset != data.size()) throw std::runtime_error("Trailing bytes in
ClpConnectorProtocol deserialize"); to fail-fast on malformed or
forward-compatible payloads so any extra/unexpected bytes are rejected rather
than silently ignored.
- Around line 23-33: The readUTF function currently copies raw bytes rather than
decoding Java's modified UTF-8; update readUTF in ClpConnectorProtocol.cpp to
(1) read the 2-byte big-endian length as before, then iterate over the next len
bytes decoding Java-modified-UTF: treat the two-byte sequence 0xC0 0x80 as
U+0000, decode standard 1-, 2-, and 3-byte modified-UTF sequences to Unicode
code points, handle surrogate pairs (decode high and low UTF-16 surrogates
emitted as two 3-byte sequences and combine into a single supplementary code
point), and re-encode resulting code points into proper UTF-8 bytes appended
into the result std::string; keep existing bounds checks using offset and ensure
offset advances exactly by len before returning.
- Around line 23-40: The code in readUTF and readInt uses reinterpret_cast to
load multi-byte integers (violating alignment/aliasing) and __builtin_bswap*
which assumes a particular host endianness; replace those with byte-wise loads
and portable assembly: in readUTF, read the two length bytes individually from
data[offset] and data[offset+1], assemble a uint16_t length in big-endian
(length = (b0<<8)|(b1)), advance offset, bounds-check length, then copy the UTF
bytes; in readInt, read four bytes individually and assemble an int32_t in
big-endian ((b0<<24)|(b1<<16)|(b2<<8)|b3)), advance offset; keep existing
VELOX_CHECK_LE checks and use uint8_t temporaries to avoid unaligned
dereferences. Ensure signedness conversion is handled when producing int32_t.
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h`:
- Around line 74-222: The CLP protocol currently only supports binary
serialization (the to_json/from_json/serialize overrides in
ClpConnectorProtocol.h throw VELOX_NYI) so before enabling this protocol you
must update the Java coordinator to populate the customSerializedValue field on
all CLP handle types; coordinate deployment so every path that creates
ConnectorTableHandle, ConnectorTableLayoutHandle, ConnectorInsertTableHandle,
ConnectorOutputTableHandle, ConnectorSplit, ConnectorTransactionHandle,
ConnectorDeleteTableHandle, ConnectorIndexHandle,
ConnectorDistributedProcedureHandle, ColumnHandle, and
ConnectorPartitioningHandle sets customSerializedValue, run integration tests to
validate workers receive those binary payloads, and optionally add a defensive
startup/runtime validation in the C++ worker to detect missing
customSerializedValue and fail fast with a clear message referencing
ClpConnectorProtocol.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ca7f75ed-b4dc-46a8-82f1-207ff3e723f6
📒 Files selected for processing (11)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnector.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpConnectorCodecProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/codec/TestClpConnectorCodecProvider.javapresto-native-execution/presto_cpp/presto_protocol/CMakeLists.txtpresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpppresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h
128d69f to
87153d6
Compare
…HandleCodec Add wire format comments to each codec class cross-referencing the C++ ClpConnectorProtocol::deserialize() counterpart. Remove redundant comments from ClpTransactionHandleCodec.
5bbc99c to
80ad8a2
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp (1)
23-40:⚠️ Potential issue | 🔴 CriticalImplement a real Java
writeUTFreader and remove the unaligned integer loads.Line 23 treats the
writeUTFpayload as already-UTF-8 and just copies the bytes, but Java uses modified UTF-8 here. That means\u0000and supplementary characters will deserialize incorrectly. On top of that, Lines 26-39 still readuint16_t/uint32_tthroughreinterpret_castfrom arbitrary byte offsets, which is undefined behaviour on strict-alignment targets and bakes in a host-endianness assumption. Please switch these helpers to byte-wise big-endian reads and a proper modified-UTF decoder before relying on this wire format.What does Java DataOutputStream.writeUTF encode for NUL characters and supplementary characters (modified UTF-8), and is reading uint16_t/uint32_t from arbitrary char-buffer offsets via reinterpret_cast well-defined in C++?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp` around lines 23 - 40, readUTF currently assumes payload is standard UTF-8 and also uses reinterpret_cast-based reads which are undefined on unaligned buffers; change readUTF to parse the Java modified-UTF-8 format per DataInput.readUTF (read the 2-byte big-endian length via byte-wise reads, then decode modified-UTF-8 where NUL is encoded as 0xC0 0x80 and supplementary chars are represented via surrogate pairs encoded as six bytes) producing a proper UTF-8 std::string (decode into UTF-16 code units then to UTF-8), and replace the reinterpret_cast-based integer reads in readInt (and any similar helpers) with explicit byte-wise big-endian assembly (e.g., (b0<<24)|(b1<<16)|(b2<<8)|b3) while advancing offset safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.h`:
- Around line 67-80: ClpConnectorProtocol is missing the required override for
ColumnHandle::serialize causing the class to remain abstract; add a method with
the signature serialize(const std::shared_ptr<ColumnHandle>&, std::string&)
const override to ClpConnectorProtocol (next to the existing deserialize,
to_json, from_json for ColumnHandle) and implement it either by performing the
actual serialization used by other handle types or stubbing with VELOX_NYI
(matching the pattern used by the other serialize overrides in this file) so the
class compiles and can be instantiated/registered.
In
`@presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp`:
- Around line 1134-1143: Centralize the repeated "customSerializedValue" branch
by extracting the guard/decode/dispatch logic into a shared helper used by the
codegen template (e.g., a function like handleCustomSerializedValue(json j,
const std::string& type, PrestoProtocolType& p) or a generator template partial)
so every overload calls that helper instead of duplicating the block; the helper
should check j.contains("customSerializedValue"), perform the
VELOX_CHECK(!type.empty() && type[0] != '$', ...), decode with
velox::encoding::Base64::decode(j["customSerializedValue"].get<std::string>()),
call getConnectorProtocol(type).deserialize(binaryData, p), and indicate (via
return bool or early return) that the branch was handled, then replace
occurrences in functions that currently inline the block (the ones around
getConnectorProtocol/type/deserialize uses) to call this helper.
- Around line 6227-6233: The ColumnHandle customSerializedValue path uses
j["@type"] directly for connector dispatch
(getConnectorProtocol(type).deserialize) but lacks the internal-handle guard
used elsewhere; modify the block that handles customSerializedValue for
ColumnHandle to check the extracted type (from j["@type"]) and reject or error
out when it represents an internal handle (e.g., starts with '$' or matches the
same internal-handle predicate used in other branches) before calling
getConnectorProtocol(type).deserialize, mirroring the guard and error handling
used in the other new branches to avoid routing internal handles into connector
protocol resolution.
In
`@presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc`:
- Around line 24-30: The current customSerializedValue branch directly reads
j["@type"] and calls getConnectorProtocol(type).deserialize(...), bypassing the
shared subtype-resolution and internal-type guard; update the ColumnHandle
deserialization branch to first run the existing subtype resolution/validation
routine (call getSubclassKey(...) or the shared helper used by other handle
deserializers) to obtain/validate the concrete subtype key, then use that
resolved key when calling
getConnectorProtocol(resolvedKey).deserialize(binaryData, p); this preserves the
common parsing/validation flow and reenables the internal-type guard while
keeping the customSerializedValue base64 decode and deserialize steps intact.
---
Duplicate comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 23-40: readUTF currently assumes payload is standard UTF-8 and
also uses reinterpret_cast-based reads which are undefined on unaligned buffers;
change readUTF to parse the Java modified-UTF-8 format per DataInput.readUTF
(read the 2-byte big-endian length via byte-wise reads, then decode
modified-UTF-8 where NUL is encoded as 0xC0 0x80 and supplementary chars are
represented via surrogate pairs encoded as six bytes) producing a proper UTF-8
std::string (decode into UTF-16 code units then to UTF-8), and replace the
reinterpret_cast-based integer reads in readInt (and any similar helpers) with
explicit byte-wise big-endian assembly (e.g., (b0<<24)|(b1<<16)|(b2<<8)|b3)
while advancing offset safely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 91d7b2f6-521d-4dc7-aa45-842df98d585c
📒 Files selected for processing (19)
presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txtpresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpppresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.hpresto-native-execution/presto_cpp/presto_protocol/core/ConnectorProtocol.hpresto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpppresto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.hpresto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.hpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorDeleteTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.hpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorInsertTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorOutputTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.hpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorSplit.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableLayoutHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTransactionHandle.cpp.inc
ced2480 to
02e75a4
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 25-33: The helper readers (readUTF, readInt, readBoolean)
currently perform in.read(...) calls without validating the stream, which allows
truncated input to produce partial/uninitialized data; update each function to
verify the stream read succeeded by checking in.fail() or in.gcount() after
every read and handle errors by throwing a descriptive exception (e.g.,
std::runtime_error) when the number of bytes read does not match the expected
size; specifically: in readUTF, after reading the length field verify
sizeof(length) bytes were read, then after reading the string bytes verify
gcount() == length before returning (still apply folly::Endian::big to length),
and in readInt/readBoolean verify their single/multi-byte reads succeeded before
converting/returning the value.
In
`@presto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp`:
- Around line 137-152: The test TestSplitDeserializationArchive in
ClpConnectorProtocolTest omits the kqlQuery presence boolean so the deserializer
reads past the payload; update the test payload construction (after
writeInt(oss, 0) // ARCHIVE) to write the kqlQuery presence flag (use the same
helper used elsewhere, e.g., writeBool or writeByte for kqlQueryPresent) so the
wire format matches ClpSplit's expected layout, then run protocol.deserialize
into handle and assert as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 777a9e39-c819-490b-a0a0-ef471a571dff
📒 Files selected for processing (5)
presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txtpresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpppresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.hpresto-native-execution/presto_cpp/presto_protocol/tests/CMakeLists.txtpresto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp
02e75a4 to
7aa894a
Compare
Cherry-picked from upstream PR prestodb#27652 (Tim's approach for Tpch connector). Adds serialize/deserialize pure virtuals to ConnectorProtocol for all handle types and customSerializedValue routing in .cpp.inc files.
7aa894a to
b883177
Compare
Extract writeTableHandle/readTableHandle so ClpTableLayoutHandleCodec can share the same DataOutputStream/DataInputStream instead of duplicating the table handle serialization fields.
Add instanceof checks before casting in serialize() for clearer error messages on type mismatch. Add trailing-byte checks in deserialize() to reject unexpected extra data after valid encoded fields.
Java's writeUTF uses Modified UTF-8 which encodes \0 and supplementary characters differently from standard UTF-8. Since clp-s assumes real UTF-8 for string comparisons, using writeUTF would break searchability for strings containing non-ASCII content. Add CodecUtils with writeUtf8String/readUtf8String helpers that use StandardCharsets.UTF_8, keeping the same wire format (2-byte BE length prefix + encoded bytes).
14625fc to
bab7a69
Compare
bab7a69 to
9c3bf3e
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc (1)
24-30:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the shared subtype-resolution path before custom binary deserialization.
At Line 24, this fast-path bypasses
getSubclassKey(...)and directly trustsj["@type"]. That diverges from the other handle deserializers, skips the internal$...type guard, and can produce inconsistent parse failures when@typeis missing or malformed.Patch to align `ColumnHandle` with the common deserialization pattern
void from_json(const json& j, std::shared_ptr<ColumnHandle>& p) { - if (j.contains("customSerializedValue")) { - String type = j["@type"]; - std::string binaryData = - velox::encoding::Base64::decode(j["customSerializedValue"].get<std::string>()); - getConnectorProtocol(type).deserialize(binaryData, p); - return; - } String type; try { type = p->getSubclassKey(j); } catch (json::parse_error& e) { throw ParseError(std::string(e.what()) + " ColumnHandle ColumnHandle"); } + if (j.contains("customSerializedValue")) { + VELOX_CHECK( + !type.empty() && type[0] != '$', + "Internal handle type '{}' should not have customSerializedValue", + type); + std::string binaryData = + velox::encoding::Base64::decode(j["customSerializedValue"].get<std::string>()); + getConnectorProtocol(type).deserialize(binaryData, p); + return; + } getConnectorProtocol(type).from_json(j, p); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc` around lines 24 - 30, The customSerializedValue branch in ColumnHandle.cpp.inc bypasses the shared subtype resolution and directly reads j["@type"], which can miss the internal $... guard used by getSubclassKey; change this branch to first call getSubclassKey(j) (or the same shared subtype-resolution helper used by other handle deserializers) to obtain the canonical connector/type key, then pass that key to getConnectorProtocol(...) and call deserialize on the decoded binary; ensure you still decode j["customSerializedValue"] via Base64 before calling getConnectorProtocol(...).presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp (1)
66-78:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject trailing bytes (and non-empty transaction payload) after deserialization.
Current decode paths accept extra bytes silently. That masks malformed payloads and schema drift. Also, transaction handle is documented as empty payload but currently doesn’t enforce that.
Suggested fix pattern
namespace { +void ensureFullyConsumed( + std::istringstream& in, + const std::string& binaryData, + const char* handleName) { + VELOX_CHECK( + in.tellg() == static_cast<std::streampos>(binaryData.size()), + "Trailing bytes in {} payload", + handleName); +} } // namespace void ClpConnectorProtocol::deserialize( const std::string& binaryData, std::shared_ptr<ColumnHandle>& proto) const { @@ handle->_type = "clp"; + ensureFullyConsumed(in, binaryData, "ClpColumnHandle"); proto = handle; }Apply the same check in split/table/table-layout paths, and in transaction path add:
+ VELOX_CHECK(binaryData.empty(), "ClpTransactionHandle payload must be empty");Also applies to: 87-102, 110-120, 130-152, 157-165
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp` around lines 66 - 78, Ensure deserialization rejects trailing bytes and enforces empty transaction payloads: after reading expected fields in ClpConnectorProtocol::deserialize (and the analogous deserialize implementations at the other ranges), check that the input stream has no remaining data (e.g., verify in.rdbuf()->in_avail() == 0 or in.peek() == EOF) and if not, throw/return an error (e.g., throw std::runtime_error with a clear message about trailing bytes). For the transaction-path deserialize, additionally enforce that binaryData is empty (or stream contains zero bytes) and fail if any payload is present. Update all referenced deserialize methods (the one creating ClpColumnHandle and the other deserialize functions mentioned in the review) to perform this post-read validation and emit a clear error when extra bytes are found.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Line 28: Change the boolean style used in the VELOX_CHECK conditions to the
repository-preferred form: replace occurrences of "!in.fail()" with "false ==
in.fail()" (and similarly any other negated checks like "!someExpr" in the same
function) so the VELOX_CHECK calls (e.g., the VELOX_CHECK that currently reads
VELOX_CHECK(!in.fail(), ...) and the other instances noted around the same
block) use the "false == <expression>" style consistently.
- Around line 94-96: Validate the integer returned by readInt(in) before casting
to SplitType: after reading typeOrdinal in ClpConnectorProtocol.cpp, check that
it maps to a known SplitType value (e.g., within the enum's defined ordinal
range or via a helper like isValidSplitType) and if it's not valid, fail fast
(throw an exception or return an error) instead of assigning handle->type;
update the code paths that use typeOrdinal/handle->type (reading code around
readInt, typeOrdinal, and handle->type) to perform this guard so invalid wire
values cannot produce an undefined enum state.
In
`@presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp`:
- Around line 1135-1136: Replace negated emptiness checks using the '!' operator
with the repo-preferred form 'false == type.empty()' in the VELOX_CHECK guards.
Specifically, change occurrences like VELOX_CHECK(!type.empty() && type[0] !=
'$', ...) to use VELOX_CHECK(false == type.empty() && type[0] != '$', ...), and
apply the same pattern for all other instances where '!type.empty()' appears in
VELOX_CHECK (notably the other occurrences reported). Keep the rest of each
condition (e.g., checks against type[0]) unchanged so behavior is preserved.
In
`@presto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp`:
- Around line 63-65: The test verifies ClpColumnHandle names but omits asserting
the serialized columnType; update the test in ClpConnectorProtocolTest (where
clpHandle is inspected) to also EXPECT_EQ clpHandle->columnType with the
expected type value for the column you serialized (e.g., the same ColumnType
used when constructing/serializing the ClpColumnHandle); place this assertion
alongside the existing EXPECT_EQ checks for columnName and originalColumnName so
the deserialized columnType is validated as well.
- Around line 133-134: The test dereferences optional pointer members directly
(clpHandle->kqlQuery and clpHandle->metadataFilterQuery); add explicit
null/value assertions before comparing: first ASSERT_TRUE(clpHandle->kqlQuery)
(or ASSERT_TRUE(clpHandle->kqlQuery.has_value()) if using std::optional) and
ASSERT_TRUE(clpHandle->metadataFilterQuery) and then keep the EXPECT_EQ checks
comparing *clpHandle->kqlQuery to "level == 'error'" and
*clpHandle->metadataFilterQuery to "timestamp > 1000" so failures are
deterministic and clearly reported.
---
Duplicate comments:
In
`@presto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpp`:
- Around line 66-78: Ensure deserialization rejects trailing bytes and enforces
empty transaction payloads: after reading expected fields in
ClpConnectorProtocol::deserialize (and the analogous deserialize implementations
at the other ranges), check that the input stream has no remaining data (e.g.,
verify in.rdbuf()->in_avail() == 0 or in.peek() == EOF) and if not, throw/return
an error (e.g., throw std::runtime_error with a clear message about trailing
bytes). For the transaction-path deserialize, additionally enforce that
binaryData is empty (or stream contains zero bytes) and fail if any payload is
present. Update all referenced deserialize methods (the one creating
ClpColumnHandle and the other deserialize functions mentioned in the review) to
perform this post-read validation and emit a clear error when extra bytes are
found.
In
`@presto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.inc`:
- Around line 24-30: The customSerializedValue branch in ColumnHandle.cpp.inc
bypasses the shared subtype resolution and directly reads j["@type"], which can
miss the internal $... guard used by getSubclassKey; change this branch to first
call getSubclassKey(j) (or the same shared subtype-resolution helper used by
other handle deserializers) to obtain the canonical connector/type key, then
pass that key to getConnectorProtocol(...) and call deserialize on the decoded
binary; ensure you still decode j["customSerializedValue"] via Base64 before
calling getConnectorProtocol(...).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 39a54195-a6ee-4d9c-b0ee-9d13956282ac
📒 Files selected for processing (21)
presto-native-execution/presto_cpp/presto_protocol/CMakeLists.txtpresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.cpppresto-native-execution/presto_cpp/presto_protocol/connector/clp/ClpConnectorProtocol.hpresto-native-execution/presto_cpp/presto_protocol/core/ConnectorProtocol.hpresto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpppresto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.hpresto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ColumnHandle.hpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorDeleteTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorIndexHandle.hpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorInsertTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorOutputTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorPartitioningHandle.hpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorSplit.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTableLayoutHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/core/special/ConnectorTransactionHandle.cpp.incpresto-native-execution/presto_cpp/presto_protocol/tests/CMakeLists.txtpresto-native-execution/presto_cpp/presto_protocol/tests/ClpConnectorProtocolTest.cpp
9c3bf3e to
a9fe2dc
Compare
Implement binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) by inheriting ConnectorProtocol and overriding deserialize(). Reads binary bytes (Java DataOutputStream big-endian wire format from #158) directly into existing protocol structs using std::istringstream and folly::Endian::big. Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
a9fe2dc to
01291e9
Compare
…CodecUtils.java Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
gibber9809
left a comment
There was a problem hiding this comment.
Nice work, just a few questions/nits to clear up. Only reviewed changes from the second commit per sidebar.
| handle->columnName = readUTF(in); | ||
| handle->originalColumnName = readUTF(in); | ||
| handle->columnType = readUTF(in); | ||
| handle->_type = "clp"; |
There was a problem hiding this comment.
Just judging by naming, it seems like _type is supposed to be private? If so, should we assign it in the constructor of ClpColumnHandle instead of here, since it doesn't seem to depend on the serialized data?
There was a problem hiding this comment.
_type is a public member inherited from a base class for all protocol handle types in Presto's protocol layer.
In the current fork version, the ClpColumnHandle's constructor are auto-generated by the Presto Native Worker Protocol code generation (chevron templates, we briefly discussed this component in #158 ), so we can't currently customize the constructor. This is the exact motivation we have to use binary serialization: to decouple the Presto Native Worker Protocol codegen from the communication from coordinator to worker.
Once we are migrated into the standalone plugin where we have the freedom of defining these types ourselves, I agree the assignment should live in the constructor. Will make a note on this. Thanks for the good catch.
| if (readBoolean(in)) { | ||
| handle->kqlQuery = std::make_shared<String>(readUTF(in)); | ||
| } | ||
| handle->_type = "clp"; |
There was a problem hiding this comment.
Ditto, wondering if this should be in the constructor of ClpSplit.
…lp/ClpConnectorProtocol.cpp Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
…lp/ClpConnectorProtocol.cpp Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
…x/presto into feat/2026-04-27-binary-deserialization-cpp
…om:y-scope/presto into feat/2026-04-27-binary-deserialization-cpp
Summary
Implement C++ binary deserialization for all 5 CLP connector handle types (
ClpColumnHandle,ClpSplit,ClpTableHandle,ClpTableLayoutHandle,ClpTransactionHandle) underConnectorProtocol.Follows Tim's
TpchConnectorProtocolpattern from prestodb#26650.Background & Motivation
#158 added Java codecs that serialize connector handles to binary via a
customSerializedValuefield in the JSON message sent from coordinator to worker. This PR adds the C++ deserialization counterpart: when the native Velox worker receives acustomSerializedValue, the bytes are routed toClpConnectorProtocol::deserialize()and reconstructed into C++ structs.Some notes on the implementation:
Previously
ClpConnectorProtocolwas defined viaConnectorProtocolTemplate<...>. This PR follows the pattern from prestodb#26650 by making it directly inherit fromConnectorProtocol. (The reason we don't reuseConnectorProtocolTemplate<...>is that itsdeserializeTemplateis not yet usable — will consult with upstream developers on this.) This means we must implement all communication helpers ourselves:to_json/from_jsonfor the JSON path,serialize/deserializefor the binary path. Note,to_jsonandserializecan be just aVELOX_NYIstubs because worker only concerns with the deserialization, serialization is performed at Java codex.The
from_jsonimplementations are copied from the chevron-generatedpresto_protocol_clp.cppto preserve the JSON fallback path. Tim leaves these asVELOX_NYIstubs; however, our forked Presto/Velox shall support both JSON and binary serialization, pending the user's configuration ofuse-connector-provided-serialization-codecs. When we eventually migrate to the standalone shared library, the JSON fallback will no longer be needed. At that pointto_json/from_jsonwill becomeVELOX_NYIstubs (matching Tim's pattern), and onlydeserialize()will remain.Note for Reviewers
This PR contains two commits. Please only review changes from commit 2 that is related to Clp connector. Commit 1 is under review by Tim in upstream Presto; we include it here so CI can run:
deserializeTemplatetoConnectorProtocol::deserialize().Checklist
breaking change.
Validation performed
SELECT CLP_GET_JSON_STRING() from clp.default.default limit 100customSerializedValue, and minio as the storage.Summary by CodeRabbit
Release Notes
New Features
Tests