enhancement(codecs): replace serde_arrow with arrow-json#24661
enhancement(codecs): replace serde_arrow with arrow-json#24661benjamin-awd wants to merge 30 commits intovectordotdev:masterfrom
serde_arrow with arrow-json#24661Conversation
Replace manual `.map_err(|source| ArrowEncodingError::Variant { source })`
closures with idiomatic `.context(VariantSnafu)` calls throughout the
Arrow encoding module.
The schema is always required — passing None just returned an error. Encode this invariant in the type system by taking SchemaRef directly, eliminating the NoSchemaProvided error variant and the runtime check.
Replace the manual `From<std::io::Error>` impl with snafu's `#[snafu(context(false))]` attribute, which auto-generates the same conversion.
Remove the intermediate `Vec<&str>` allocation in `validate_non_nullable_fields` by iterating over the filtered schema fields directly.
…nfigurable toggle Move required-field validation from RequestBuilder::pre_encode to the sink's stream pipeline so event finalizers are properly marked Rejected on failure. Add `validate_schema` setting (default true) to ArrowStreamSerializerConfig to allow disabling per-batch validation for throughput. Remove unused pre_encode hook from RequestBuilder trait. Also remove DataType::Binary support from the Arrow encoder since Vector's Value::Bytes is always UTF-8 and cannot produce true binary.
Replace Option<Vec<String>> with a descriptive type alias for non-nullable field names that must be present before encoding.
The derived Default set validate_schema to false (bool's default), but the intended default is true. The serde default_true only applies during deserialization, not when using Default::default() in Rust code.
Clarify that the field is non-nullable in the Arrow schema.
…ink config The field was defined on `ArrowStreamSerializerConfig` but only consumed by the ClickHouse sink to decide whether to extract required fields. The encoder itself never read it. Moving it to `ClickhouseConfig` aligns ownership with usage.
serde_arrow with arrow-jsonserde_arrow with arrow-json
pront
left a comment
There was a problem hiding this comment.
Thank you for addressing all review comments. The build_record_batch() implementation looks much better now!
…ilder Add `on_encode_error` hook to the `RequestBuilder` trait so sinks can update event finalizers before metadata is dropped on encoding failure. Without this, finalizers default to `Delivered` even when encoding fails, causing incorrect batch status reporting. Implement the hook for ClickHouse to mark batches as `Rejected` on null-constraint violations.
|
The failing integration test is a bit of a weird one:
I could probably remove the test or modify it so that it doesn't check the Attempted a fix in 1426e1f -- let me know if that works. |
pront
left a comment
There was a problem hiding this comment.
Checked this out locally and got:
error[E0433]: failed to resolve: use of unresolved module or unlinked crate `toml`
--> lib/codecs/src/encoding/transformer.rs:303:13
Hmm this requires some thought, I need to consider alternative designs. In the meantime, if you want to unblock this PR, you can create an issue for this (i.e. add ability to modify event finalizers when building requests), modify the test failing case and add this new issue link as a comment on the hacky test code. |
1ba24be to
b209f6a
Compare
serde_arrow with arrow-jsonserde_arrow with arrow-json
Head branch was pushed to by a user without write access
I think this might be a pre-existing bug, I can reproduce by running |
We submitted a fix: #24766 |
# Conflicts: # src/sinks/clickhouse/integration_tests.rs
Summary
This PR replaces
serde_arrowwitharrow-jsonin the Arrow encoder - this removes a lot of the manual workarounds that we implemented i.e. custom timestamp conversion (which involves cloning LogEvents and looking for timestamp fields within the schema), post-processing and string-matching on error messages.arrow-jsonhandles RFC 3339 timestamps natively and avoids using an extra dependency outside of the native arrow ecosystem.Vector configuration
How did you test this PR?
Existing tests + ran locally
Change Type
Is this a breaking change?
Does this PR include user facing changes?
There are likely some differences, but
no-changeloglabel to this PR.References
Related: #24124, #24409
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.