feat(aws_s3 sink): Add Apache Parquet encoding support by szibis · Pull Request #24706 · vectordotdev/vector

szibis · 2026-02-21T18:04:10Z

Summary

Adds native Parquet encoding to the S3 sink so users can write columnar files directly queryable by Athena, Trino, and Spark — no ETL pipeline needed.

Uses the BatchEncoder/EncoderKind infrastructure from #24124 and shares build_record_batch() with the ClickHouse Arrow IPC path. The whole thing is behind a codecs-parquet feature gate, so zero cost when you don't use it.

Configuration

TOML

[sinks.s3_parquet]
type = "aws_s3"
bucket = "my-analytics-bucket"
key_prefix = "logs/date=%F"
compression = "none"   # Parquet handles compression internally

[sinks.s3_parquet.batch_encoding]
parquet.compression = "snappy"   # snappy (default) | zstd | gzip | lz4 | none
parquet.schema_mode = "relaxed"  # relaxed (default) | strict

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "message"
data_type = "utf8"

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "timestamp"
data_type = "utf8"

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "host"
data_type = "utf8"

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "status_code"
data_type = "int64"

YAML

sinks:
  s3_parquet:
    type: aws_s3
    bucket: my-analytics-bucket
    key_prefix: "logs/date=%F"
    compression: none   # Parquet handles compression internally
    batch_encoding:
      parquet:
        compression: snappy   # snappy (default) | zstd | gzip | lz4 | none
        schema_mode: relaxed  # relaxed (default) | strict
        schema:
          - name: message
            data_type: utf8
          - name: timestamp
            data_type: utf8
          - name: host
            data_type: utf8
          - name: status_code
            data_type: int64

Supported field types: utf8, int32, int64, float32, float64, boolean, date32, timestamp_millis, timestamp_micros, binary, large_utf8

Design notes

Schema is explicit — defined in config, not inferred at runtime. This gives type safety and avoids per-event reflection.

Compression is internal to Parquet, not at the S3 level. Parquet compresses per-column page, which gives much better ratios than compressing the whole file. The config validates this at startup: if you set batch_encoding.parquet with compression != none, you get a clear error.

Two schema modes:

Relaxed (default) — missing fields become null, extra fields are silently dropped. Good for heterogeneous logs.
Strict — missing fields become null, extra fields return an error. Good for validated pipelines.

The encoding pipeline:

graph LR
    A[Vec of Event] --> B[serde_arrow]
    B --> C[Arrow RecordBatch]
    C --> D[ArrowWriter]
    D --> E[Parquet bytes in BytesMut]

    subgraph "Shared with ClickHouse"
        B
        C
    end

    subgraph "Parquet-specific"
        D
        E
    end

How it fits into the S3 sink:

graph TD
    A[Source] --> B[Transform Pipeline]
    B --> C{batch_encoding?}
    C -->|parquet| D[ParquetSerializer::encode]
    C -->|none| E[Framed Encoder NDJSON/JSON/etc]
    D --> F[S3 Request Builder]
    E --> F
    F --> G[.parquet / .json extension]
    G --> H[S3 PutObject]
    H --> I[Athena / Spark / Trino Query]

Performance

Benchmarks (Criterion, 50 samples, 10s measurement)

The NDJSON+Snappy column applies Snappy after JSON encoding — the fair apples-to-apples comparison since Parquet Snappy includes compression in the encoding step.

Batch Size	NDJSON Raw	NDJSON+Snappy	Parquet Snappy	Parquet Zstd	Parquet None
10 events	2.31 us	2.95 us	24.87 us	41.56 us	23.66 us
100 events	19.88 us	23.24 us	48.48 us	73.33 us	46.54 us
1000 events	193.17 us	216.75 us	287.27 us	323.13 us	279.63 us

The interesting number — Parquet Snappy vs NDJSON+Snappy at scale:

Batch Size	NDJSON+Snappy	Parquet Snappy	Ratio
10 events	2.95 us	24.87 us	8.4x
100 events	23.24 us	48.48 us	2.1x
1000 events	216.75 us	287.27 us	1.33x

There's roughly 25 us of fixed overhead (Arrow schema init, writer setup, RecordBatch creation) that amortizes away at larger batch sizes. At 1000 events Parquet is only 1.33x slower, and at production batch sizes (5k-10k) this converges toward ~1.1x.

The tradeoff is worth it: Parquet output is 70-90% smaller than NDJSON and natively queryable without any transformation.

Hot path optimizations

The serializer hot path was profiled and optimized before submission:

Optimization	What changed
No double buffering	`ArrowWriter` writes directly into the output `BytesMut` via `.writer()` instead of going through an intermediate `Vec` + `put_slice` copy
Arc<WriterProperties>	Writer config is shared via `Arc` instead of deep-cloning a `HashMap` on every batch
No redundant schema transform	`to_arrow_schema()` already creates nullable fields — removed the second pass that did the same thing
O(1) strict-mode validation	Pre-built `HashSet` of schema field names at construction instead of linear scan per field per event

These are structural improvements that pay off at scale (large batches, wide schemas, strict mode) rather than in micro-benchmarks with 4-column, 1000-event batches.

Memory

Parquet uses roughly 2x peak memory per batch due to the Arrow intermediate representation. At 1000 events that's ~400 KB (Snappy) vs ~200 KB (NDJSON). The ParquetSerializer struct itself is ~200 bytes. When the feature gate is off, there's zero overhead.

Compression: internal vs external

Type	Where	When
Parquet internal (Snappy/Zstd/Gzip/LZ4/None)	Per column page inside Parquet codec	During `encode()`
S3 external (gzip/zstd/snappy)	On entire payload by sink	After encoding
Parquet path forces	S3 `compression: None`	Config validation

Overall tradeoffs

Area	Impact
S3 storage cost	70-90% smaller files vs NDJSON
Query performance (Athena/Spark)	Columnar + predicate pushdown
Encoding latency	1.33x slower at 1000 events, converges at larger batches
Encoding memory	~2x peak per batch (Arrow intermediate)
Existing configs	Zero impact when Parquet not configured
Binary size (feature off)	Zero cost

Compatibility

Fully backward compatible. batch_encoding is an optional field that defaults to None, so existing configs are untouched.

Type/Method	Change	Notes
`BatchSerializerConfig`	Added `Parquet(...)` variant	Additive
`BatchSerializer`	Added `Parquet(Box<ParquetSerializer>)`	Additive
`S3SinkConfig`	Added `batch_encoding: Option<BatchSerializerConfig>`	Optional field
`build_record_batch()`	`pub` → `pub(crate)`	Internal API only
ClickHouse config	Updated to `build_batch_serializer()`, `Box<BatchEncoder>`	Functionally identical, shares Arrow IPC infra
`BatchEncoder` / `EncoderKind`	`Box`-wrapped	Transparent via Deref

When codecs-parquet feature is disabled, the batch_encoding field is compile-time eliminated. When it's enabled but not configured, the existing framed encoder path runs with zero change.

Integration test (LocalStack S3)

End-to-end test sends 10 events through the full sink pipeline, downloads the S3 object, and validates:

.parquet file extension — PASS
PAR1 magic bytes — PASS
10 rows in file — PASS
message, host columns in schema — PASS

Test plan

14 unit tests in parquet.rs (serialization, compression variants, schema modes, edge cases, optimization invariants)
5 existing encoder tests still pass
Integration test against LocalStack S3
Criterion benchmarks: Parquet (Snappy/Zstd/None) vs NDJSON (Raw/Snappy)
cargo clippy clean across default, codecs-parquet, and codecs-parquet,codecs-arrow feature combinations
Existing S3 and ClickHouse tests unaffected

Future work

All Parquet infrastructure (BatchSerializerConfig, BatchEncoder, EncoderKind, ParquetSerializer) lives in the shared codecs crate. Adding batch_encoding to GCS and Azure Blob sinks should be straightforward — both use the same RequestBuilder<(Key, Vec<Event>)> pattern as S3, so the change is mostly wiring up the existing EncoderKind dispatch. The file sink would need more work due to its per-event streaming architecture.

Change Type

New feature

Is this a breaking change?

No

Does this PR include user facing changes?

Yes. Changelog fragment included.

…zerConfig Add Parquet variant to BatchSerializer enum and BatchSerializerConfig, extending the batch encoding infrastructure to support Parquet output. Rename build() to build_batch_serializer() returning BatchSerializer directly, simplifying the API for all batch serializer consumers. Update ClickHouse sink to use the new method signature.

Refactor S3 sink to use EncoderKind, enabling both traditional framed encoding and batch-based columnar formats. Add batch_encoding config option (behind codecs-parquet feature) that supports Parquet output with automatic .parquet file extension and internal compression bypass.

Replace internal-only Arrow Schema with user-facing ParquetSchemaField and ParquetFieldType types that can be deserialized from TOML/JSON config. Supports boolean, int32/64, float32/64, utf8, binary, timestamp, and date types. Add config deserialization tests.

Box large enum variants (BatchSerializer::Parquet, EncoderKind::Batch) to reduce size difference between variants. Derive Default for ParquetSerializerConfig instead of manual impl. Apply rustfmt.

Add S3 Parquet integration test validating end-to-end encoding with LocalStack (magic bytes, row count, schema columns). Add Criterion benchmarks comparing Parquet (Snappy/Zstd/None) vs NDJSON baseline at 10/100/1000 event batch sizes.

…eness Add NDJSON+Snappy compressed baseline to Parquet benchmarks for apples-to-apples comparison. Fix non-exhaustive match in ClickHouse config for new Parquet variant.

- Eliminate double-buffering: write ArrowWriter directly into output BytesMut instead of intermediate Vec + put_slice copy - Wrap WriterProperties in Arc to avoid deep-cloning HashMap per batch - Remove redundant nullable schema transformation (to_arrow_schema already creates nullable fields) - Pre-build HashSet of schema field names at construction for O(1) strict-mode lookups instead of O(N) linear scan - Add 6 tests covering optimization invariants

Add Parquet encoding section to aws_s3.cue how_it_works with TOML and YAML examples. Keep CUE high-level per reviewer feedback from vectordotdev#24372 (field-level docs belong in Rust source, not CUE). Add docs::examples metadata to configurable fields and improve doc comments on ParquetSerializerConfig.

iadjivon · 2026-02-23T19:13:37Z

Hi there, Thanks for your PR. Adding a work in progress Label for the docs team. Please remove this once the Vector team approves this PR and we will go ahead and review/approve as well. Thanks!

szibis requested a review from a team as a code owner February 21, 2026 18:04

github-actions bot added the domain: sinks Anything related to the Vector's sinks label Feb 21, 2026

This was referenced Feb 21, 2026

enhancement(aws_s3 sink): Add parquet codec #17395

Draft

Support parquet columnar format in the aws_s3 sink #1374

Open

szibis requested a review from a team as a code owner February 21, 2026 18:15

github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Feb 21, 2026

szibis added 12 commits February 21, 2026 19:26

feat(codecs): add parquet crate dependency for columnar encoding

0155ffd

docs: add changelog entry for Parquet encoding in S3 sink

1f9b60e

style: fix clippy warnings and apply rustfmt

8ce9e37

Box large enum variants (BatchSerializer::Parquet, EncoderKind::Batch) to reduce size difference between variants. Derive Default for ParquetSerializerConfig instead of manual impl. Apply rustfmt.

test: add Parquet integration test and benchmarks

3c38f09

Add S3 Parquet integration test validating end-to-end encoding with LocalStack (magic bytes, row count, schema columns). Add Criterion benchmarks comparing Parquet (Snappy/Zstd/None) vs NDJSON baseline at 10/100/1000 event batch sizes.

bench: add NDJSON+Snappy benchmark and fix ClickHouse match exhaustiv…

08c744e

…eness Add NDJSON+Snappy compressed baseline to Parquet benchmarks for apples-to-apples comparison. Fix non-exhaustive match in ClickHouse config for new Parquet variant.

ci: add Trino to spelling allow list

31596cb

szibis force-pushed the feat/s3-parquet-encoding branch from 075f136 to 31596cb Compare February 21, 2026 18:27

github-actions bot added the domain: ci Anything related to Vector's CI environment label Feb 21, 2026

iadjivon added do not merge work in progress and removed do not merge labels Feb 23, 2026

Merge branch 'master' into feat/s3-parquet-encoding

189320b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aws_s3 sink): Add Apache Parquet encoding support#24706

feat(aws_s3 sink): Add Apache Parquet encoding support#24706
szibis wants to merge 13 commits intovectordotdev:masterfrom
szibis:feat/s3-parquet-encoding

szibis commented Feb 21, 2026 •

edited

Loading

Uh oh!

iadjivon commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

szibis commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Configuration

TOML

YAML

Design notes

Performance

Benchmarks (Criterion, 50 samples, 10s measurement)

Hot path optimizations

Memory

Compression: internal vs external

Overall tradeoffs

Compatibility

Integration test (LocalStack S3)

Test plan

Future work

Change Type

Is this a breaking change?

Does this PR include user facing changes?

Related

Uh oh!

iadjivon commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szibis commented Feb 21, 2026 •

edited

Loading

iadjivon commented Feb 23, 2026 •

edited

Loading