Skip to content

enhancement(codecs): replace serde_arrow with arrow-json#24661

Open
benjamin-awd wants to merge 30 commits intovectordotdev:masterfrom
benjamin-awd:use-arrow-json
Open

enhancement(codecs): replace serde_arrow with arrow-json#24661
benjamin-awd wants to merge 30 commits intovectordotdev:masterfrom
benjamin-awd:use-arrow-json

Conversation

@benjamin-awd
Copy link
Contributor

@benjamin-awd benjamin-awd commented Feb 16, 2026

Summary

This PR replaces serde_arrow with arrow-json in the Arrow encoder - this removes a lot of the manual workarounds that we implemented i.e. custom timestamp conversion (which involves cloning LogEvents and looking for timestamp fields within the schema), post-processing and string-matching on error messages. arrow-json handles RFC 3339 timestamps natively and avoids using an extra dependency outside of the native arrow ecosystem.

Vector configuration

sinks:
  clickhouse:
    type: clickhouse
    inputs: [process]
    database: ...
    table: ...
    endpoint: ${CLICKHOUSE_HOST}:8123
    compression: zstd
    format: arrow_stream
    batch_encoding:
      codec: arrow_stream
      allow_nullable_fields: true

How did you test this PR?

Existing tests + ran locally

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

There are likely some differences, but

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Related: #24124, #24409

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

Replace manual `.map_err(|source| ArrowEncodingError::Variant { source })`
closures with idiomatic `.context(VariantSnafu)` calls throughout the
Arrow encoding module.
The schema is always required — passing None just returned an error.
Encode this invariant in the type system by taking SchemaRef directly,
eliminating the NoSchemaProvided error variant and the runtime check.
Replace the manual `From<std::io::Error>` impl with snafu's
`#[snafu(context(false))]` attribute, which auto-generates the
same conversion.
Remove the intermediate `Vec<&str>` allocation in
`validate_non_nullable_fields` by iterating over the filtered schema
fields directly.
…nfigurable toggle

Move required-field validation from RequestBuilder::pre_encode to the
sink's stream pipeline so event finalizers are properly marked Rejected
on failure. Add `validate_schema` setting (default true) to
ArrowStreamSerializerConfig to allow disabling per-batch validation for
throughput. Remove unused pre_encode hook from RequestBuilder trait.
Also remove DataType::Binary support from the Arrow encoder since
Vector's Value::Bytes is always UTF-8 and cannot produce true binary.
Replace Option<Vec<String>> with a descriptive type alias for
non-nullable field names that must be present before encoding.
The derived Default set validate_schema to false (bool's default),
but the intended default is true. The serde default_true only applies
during deserialization, not when using Default::default() in Rust code.
Clarify that the field is non-nullable in the Arrow schema.
@benjamin-awd benjamin-awd requested a review from a team as a code owner February 16, 2026 15:59
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Feb 16, 2026
…ink config

The field was defined on `ArrowStreamSerializerConfig` but only consumed
by the ClickHouse sink to decide whether to extract required fields. The
encoder itself never read it. Moving it to `ClickhouseConfig` aligns
ownership with usage.
@benjamin-awd benjamin-awd changed the title enhancement(codecs): replace serde_arrow with arrow-json enhancement(codecs)!: replace serde_arrow with arrow-json Feb 19, 2026
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all review comments. The build_record_batch() implementation looks much better now!

@pront pront enabled auto-merge February 23, 2026 16:00
@pront pront added this pull request to the merge queue Feb 23, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 23, 2026
…ilder

Add `on_encode_error` hook to the `RequestBuilder` trait so sinks can
update event finalizers before metadata is dropped on encoding failure.
Without this, finalizers default to `Delivered` even when encoding
fails, causing incorrect batch status reporting.

Implement the hook for ClickHouse to mark batches as `Rejected` on
null-constraint violations.
@benjamin-awd
Copy link
Contributor Author

benjamin-awd commented Feb 23, 2026

The failing integration test is a bit of a weird one:

2026-02-23T16:39:49.215866Z ERROR codecs::internal_events: Schema constraint violation. error=Null value for non-nullable field 'required_field' error_code="encoding_null_constraint" error_type="encoder_failed" stage="sending"
2026-02-23T16:39:49.216060Z ERROR vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Schema constraint violation."
2026-02-23T16:39:49.216415Z ERROR vector::internal_events::common: Failed to build request. error=SchemaConstraintViolation(Null value for non-nullable field 'required_field') error_type="encoder_failed" stage="processing"

thread 'sinks::clickhouse::integration_tests::test_missing_required_field_emits_null_constraint_error' (36740) panicked at src/sinks/clickhouse/integration_tests.rs:1231:5:
assertion left == right failed
left: Ok(Delivered)
right: Ok(Rejected)

I could probably remove the test or modify it so that it doesn't check the BatchStatus, but that feels a bit hacky -- I suspect it's related to the emit call moving back to the encoder level.

Attempted a fix in 1426e1f -- let me know if that works.

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked this out locally and got:

error[E0433]: failed to resolve: use of unresolved module or unlinked crate `toml`
   --> lib/codecs/src/encoding/transformer.rs:303:13

@pront
Copy link
Member

pront commented Feb 23, 2026

The failing integration test is a bit of a weird one:

2026-02-23T16:39:49.215866Z ERROR codecs::internal_events: Schema constraint violation. error=Null value for non-nullable field 'required_field' error_code="encoding_null_constraint" error_type="encoder_failed" stage="sending"
2026-02-23T16:39:49.216060Z ERROR vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Schema constraint violation."
2026-02-23T16:39:49.216415Z ERROR vector::internal_events::common: Failed to build request. error=SchemaConstraintViolation(Null value for non-nullable field 'required_field') error_type="encoder_failed" stage="processing"
thread 'sinks::clickhouse::integration_tests::test_missing_required_field_emits_null_constraint_error' (36740) panicked at src/sinks/clickhouse/integration_tests.rs:1231:5:
assertion left == right failed
left: Ok(Delivered)
right: Ok(Rejected)

I could probably remove the test or modify it so that it doesn't check the BatchStatus, but that feels a bit hacky -- I suspect it's related to the emit call moving back to the encoder level.

Attempted a fix in 1426e1f -- let me know if that works.

Hmm this requires some thought, I need to consider alternative designs. In the meantime, if you want to unblock this PR, you can create an issue for this (i.e. add ability to modify event finalizers when building requests), modify the test failing case and add this new issue link as a comment on the hacky test code.

@pront pront changed the title enhancement(codecs)!: replace serde_arrow with arrow-json enhancement(codecs): replace serde_arrow with arrow-json Feb 24, 2026
@pront pront enabled auto-merge February 24, 2026 20:59
auto-merge was automatically disabled February 25, 2026 07:25

Head branch was pushed to by a user without write access

@benjamin-awd
Copy link
Contributor Author

benjamin-awd commented Feb 25, 2026

Checked this out locally and got:

error[E0433]: failed to resolve: use of unresolved module or unlinked crate toml
--> lib/codecs/src/encoding/transformer.rs:303:13

I think this might be a pre-existing bug, I can reproduce by running cargo test -p codecs against master

@pront pront enabled auto-merge February 25, 2026 16:03
@pront pront added this pull request to the merge queue Feb 25, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Feb 25, 2026
@pront
Copy link
Member

pront commented Feb 25, 2026

Checked this out locally and got:
error[E0433]: failed to resolve: use of unresolved module or unlinked crate toml
--> lib/codecs/src/encoding/transformer.rs:303:13

I think this might be a pre-existing bug, I can reproduce by running cargo test -p codecs against master

We submitted a fix: #24766
This PR can go in after the fix is merged.

# Conflicts:
#	src/sinks/clickhouse/integration_tests.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants