feat(aws_s3 sink): Add Apache Parquet encoding support#24706
Open
szibis wants to merge 13 commits intovectordotdev:masterfrom
Open
feat(aws_s3 sink): Add Apache Parquet encoding support#24706szibis wants to merge 13 commits intovectordotdev:masterfrom
szibis wants to merge 13 commits intovectordotdev:masterfrom
Conversation
This was referenced Feb 21, 2026
…a modes Add ParquetSerializer that reuses the Arrow record batch building logic to encode Vec<Event> as complete Parquet files. Supports five compression codecs (None, Snappy, Zstd, Gzip, Lz4) and two schema modes (Relaxed drops extra fields silently, Strict rejects them with an error).
…zerConfig Add Parquet variant to BatchSerializer enum and BatchSerializerConfig, extending the batch encoding infrastructure to support Parquet output. Rename build() to build_batch_serializer() returning BatchSerializer directly, simplifying the API for all batch serializer consumers. Update ClickHouse sink to use the new method signature.
Refactor S3 sink to use EncoderKind, enabling both traditional framed encoding and batch-based columnar formats. Add batch_encoding config option (behind codecs-parquet feature) that supports Parquet output with automatic .parquet file extension and internal compression bypass.
Replace internal-only Arrow Schema with user-facing ParquetSchemaField and ParquetFieldType types that can be deserialized from TOML/JSON config. Supports boolean, int32/64, float32/64, utf8, binary, timestamp, and date types. Add config deserialization tests.
Box large enum variants (BatchSerializer::Parquet, EncoderKind::Batch) to reduce size difference between variants. Derive Default for ParquetSerializerConfig instead of manual impl. Apply rustfmt.
Add S3 Parquet integration test validating end-to-end encoding with LocalStack (magic bytes, row count, schema columns). Add Criterion benchmarks comparing Parquet (Snappy/Zstd/None) vs NDJSON baseline at 10/100/1000 event batch sizes.
…eness Add NDJSON+Snappy compressed baseline to Parquet benchmarks for apples-to-apples comparison. Fix non-exhaustive match in ClickHouse config for new Parquet variant.
- Eliminate double-buffering: write ArrowWriter directly into output BytesMut instead of intermediate Vec + put_slice copy - Wrap WriterProperties in Arc to avoid deep-cloning HashMap per batch - Remove redundant nullable schema transformation (to_arrow_schema already creates nullable fields) - Pre-build HashSet of schema field names at construction for O(1) strict-mode lookups instead of O(N) linear scan - Add 6 tests covering optimization invariants
Add Parquet encoding section to aws_s3.cue how_it_works with TOML and YAML examples. Keep CUE high-level per reviewer feedback from vectordotdev#24372 (field-level docs belong in Rust source, not CUE). Add docs::examples metadata to configurable fields and improve doc comments on ParquetSerializerConfig.
075f136 to
31596cb
Compare
|
Hi there, Thanks for your PR. Adding a |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds native Parquet encoding to the S3 sink so users can write columnar files directly queryable by Athena, Trino, and Spark — no ETL pipeline needed.
Uses the
BatchEncoder/EncoderKindinfrastructure from #24124 and sharesbuild_record_batch()with the ClickHouse Arrow IPC path. The whole thing is behind acodecs-parquetfeature gate, so zero cost when you don't use it.Configuration
TOML
YAML
Supported field types:
utf8,int32,int64,float32,float64,boolean,date32,timestamp_millis,timestamp_micros,binary,large_utf8Design notes
Schema is explicit — defined in config, not inferred at runtime. This gives type safety and avoids per-event reflection.
Compression is internal to Parquet, not at the S3 level. Parquet compresses per-column page, which gives much better ratios than compressing the whole file. The config validates this at startup: if you set
batch_encoding.parquetwithcompression != none, you get a clear error.Two schema modes:
The encoding pipeline:
graph LR A[Vec of Event] --> B[serde_arrow] B --> C[Arrow RecordBatch] C --> D[ArrowWriter] D --> E[Parquet bytes in BytesMut] subgraph "Shared with ClickHouse" B C end subgraph "Parquet-specific" D E endHow it fits into the S3 sink:
graph TD A[Source] --> B[Transform Pipeline] B --> C{batch_encoding?} C -->|parquet| D[ParquetSerializer::encode] C -->|none| E[Framed Encoder NDJSON/JSON/etc] D --> F[S3 Request Builder] E --> F F --> G[.parquet / .json extension] G --> H[S3 PutObject] H --> I[Athena / Spark / Trino Query]Performance
Benchmarks (Criterion, 50 samples, 10s measurement)
The NDJSON+Snappy column applies Snappy after JSON encoding — the fair apples-to-apples comparison since Parquet Snappy includes compression in the encoding step.
The interesting number — Parquet Snappy vs NDJSON+Snappy at scale:
There's roughly 25 us of fixed overhead (Arrow schema init, writer setup, RecordBatch creation) that amortizes away at larger batch sizes. At 1000 events Parquet is only 1.33x slower, and at production batch sizes (5k-10k) this converges toward ~1.1x.
The tradeoff is worth it: Parquet output is 70-90% smaller than NDJSON and natively queryable without any transformation.
Hot path optimizations
The serializer hot path was profiled and optimized before submission:
ArrowWriterwrites directly into the outputBytesMutvia.writer()instead of going through an intermediateVec+put_slicecopyArcinstead of deep-cloning aHashMapon every batchto_arrow_schema()already creates nullable fields — removed the second pass that did the same thingHashSetof schema field names at construction instead of linear scan per field per eventThese are structural improvements that pay off at scale (large batches, wide schemas, strict mode) rather than in micro-benchmarks with 4-column, 1000-event batches.
Memory
Parquet uses roughly 2x peak memory per batch due to the Arrow intermediate representation. At 1000 events that's ~400 KB (Snappy) vs ~200 KB (NDJSON). The
ParquetSerializerstruct itself is ~200 bytes. When the feature gate is off, there's zero overhead.Compression: internal vs external
encode()compression: NoneOverall tradeoffs
Compatibility
Fully backward compatible.
batch_encodingis an optional field that defaults toNone, so existing configs are untouched.BatchSerializerConfigParquet(...)variantBatchSerializerParquet(Box<ParquetSerializer>)S3SinkConfigbatch_encoding: Option<BatchSerializerConfig>build_record_batch()pub→pub(crate)build_batch_serializer(),Box<BatchEncoder>BatchEncoder/EncoderKindBox-wrappedWhen
codecs-parquetfeature is disabled, thebatch_encodingfield is compile-time eliminated. When it's enabled but not configured, the existing framed encoder path runs with zero change.Integration test (LocalStack S3)
End-to-end test sends 10 events through the full sink pipeline, downloads the S3 object, and validates:
.parquetfile extension — PASSPAR1magic bytes — PASSmessage,hostcolumns in schema — PASSTest plan
parquet.rs(serialization, compression variants, schema modes, edge cases, optimization invariants)cargo clippyclean across default,codecs-parquet, andcodecs-parquet,codecs-arrowfeature combinationsFuture work
All Parquet infrastructure (
BatchSerializerConfig,BatchEncoder,EncoderKind,ParquetSerializer) lives in the shared codecs crate. Addingbatch_encodingto GCS and Azure Blob sinks should be straightforward — both use the sameRequestBuilder<(Key, Vec<Event>)>pattern as S3, so the change is mostly wiring up the existingEncoderKinddispatch. The file sink would need more work due to its per-event streaming architecture.Change Type
Is this a breaking change?
Does this PR include user facing changes?
Related
serde_arrowwitharrow-json#24661 (serde_arrow → arrow-json) — either can merge first, the second rebases