Skip to content

feat(aws_s3 sink): Add Apache Parquet encoding support#24706

Open
szibis wants to merge 13 commits intovectordotdev:masterfrom
szibis:feat/s3-parquet-encoding
Open

feat(aws_s3 sink): Add Apache Parquet encoding support#24706
szibis wants to merge 13 commits intovectordotdev:masterfrom
szibis:feat/s3-parquet-encoding

Conversation

@szibis
Copy link
Contributor

@szibis szibis commented Feb 21, 2026

Summary

Adds native Parquet encoding to the S3 sink so users can write columnar files directly queryable by Athena, Trino, and Spark — no ETL pipeline needed.

Uses the BatchEncoder/EncoderKind infrastructure from #24124 and shares build_record_batch() with the ClickHouse Arrow IPC path. The whole thing is behind a codecs-parquet feature gate, so zero cost when you don't use it.

Configuration

TOML

[sinks.s3_parquet]
type = "aws_s3"
bucket = "my-analytics-bucket"
key_prefix = "logs/date=%F"
compression = "none"   # Parquet handles compression internally

[sinks.s3_parquet.batch_encoding]
parquet.compression = "snappy"   # snappy (default) | zstd | gzip | lz4 | none
parquet.schema_mode = "relaxed"  # relaxed (default) | strict

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "message"
data_type = "utf8"

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "timestamp"
data_type = "utf8"

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "host"
data_type = "utf8"

[[sinks.s3_parquet.batch_encoding.parquet.schema]]
name = "status_code"
data_type = "int64"

YAML

sinks:
  s3_parquet:
    type: aws_s3
    bucket: my-analytics-bucket
    key_prefix: "logs/date=%F"
    compression: none   # Parquet handles compression internally
    batch_encoding:
      parquet:
        compression: snappy   # snappy (default) | zstd | gzip | lz4 | none
        schema_mode: relaxed  # relaxed (default) | strict
        schema:
          - name: message
            data_type: utf8
          - name: timestamp
            data_type: utf8
          - name: host
            data_type: utf8
          - name: status_code
            data_type: int64

Supported field types: utf8, int32, int64, float32, float64, boolean, date32, timestamp_millis, timestamp_micros, binary, large_utf8

Design notes

Schema is explicit — defined in config, not inferred at runtime. This gives type safety and avoids per-event reflection.

Compression is internal to Parquet, not at the S3 level. Parquet compresses per-column page, which gives much better ratios than compressing the whole file. The config validates this at startup: if you set batch_encoding.parquet with compression != none, you get a clear error.

Two schema modes:

  • Relaxed (default) — missing fields become null, extra fields are silently dropped. Good for heterogeneous logs.
  • Strict — missing fields become null, extra fields return an error. Good for validated pipelines.

The encoding pipeline:

graph LR
    A[Vec of Event] --> B[serde_arrow]
    B --> C[Arrow RecordBatch]
    C --> D[ArrowWriter]
    D --> E[Parquet bytes in BytesMut]

    subgraph "Shared with ClickHouse"
        B
        C
    end

    subgraph "Parquet-specific"
        D
        E
    end
Loading

How it fits into the S3 sink:

graph TD
    A[Source] --> B[Transform Pipeline]
    B --> C{batch_encoding?}
    C -->|parquet| D[ParquetSerializer::encode]
    C -->|none| E[Framed Encoder NDJSON/JSON/etc]
    D --> F[S3 Request Builder]
    E --> F
    F --> G[.parquet / .json extension]
    G --> H[S3 PutObject]
    H --> I[Athena / Spark / Trino Query]
Loading

Performance

Benchmarks (Criterion, 50 samples, 10s measurement)

The NDJSON+Snappy column applies Snappy after JSON encoding — the fair apples-to-apples comparison since Parquet Snappy includes compression in the encoding step.

Batch Size NDJSON Raw NDJSON+Snappy Parquet Snappy Parquet Zstd Parquet None
10 events 2.31 us 2.95 us 24.87 us 41.56 us 23.66 us
100 events 19.88 us 23.24 us 48.48 us 73.33 us 46.54 us
1000 events 193.17 us 216.75 us 287.27 us 323.13 us 279.63 us

The interesting number — Parquet Snappy vs NDJSON+Snappy at scale:

Batch Size NDJSON+Snappy Parquet Snappy Ratio
10 events 2.95 us 24.87 us 8.4x
100 events 23.24 us 48.48 us 2.1x
1000 events 216.75 us 287.27 us 1.33x

There's roughly 25 us of fixed overhead (Arrow schema init, writer setup, RecordBatch creation) that amortizes away at larger batch sizes. At 1000 events Parquet is only 1.33x slower, and at production batch sizes (5k-10k) this converges toward ~1.1x.

The tradeoff is worth it: Parquet output is 70-90% smaller than NDJSON and natively queryable without any transformation.

Hot path optimizations

The serializer hot path was profiled and optimized before submission:

Optimization What changed
No double buffering ArrowWriter writes directly into the output BytesMut via .writer() instead of going through an intermediate Vec + put_slice copy
Arc<WriterProperties> Writer config is shared via Arc instead of deep-cloning a HashMap on every batch
No redundant schema transform to_arrow_schema() already creates nullable fields — removed the second pass that did the same thing
O(1) strict-mode validation Pre-built HashSet of schema field names at construction instead of linear scan per field per event

These are structural improvements that pay off at scale (large batches, wide schemas, strict mode) rather than in micro-benchmarks with 4-column, 1000-event batches.

Memory

Parquet uses roughly 2x peak memory per batch due to the Arrow intermediate representation. At 1000 events that's ~400 KB (Snappy) vs ~200 KB (NDJSON). The ParquetSerializer struct itself is ~200 bytes. When the feature gate is off, there's zero overhead.

Compression: internal vs external

Type Where When
Parquet internal (Snappy/Zstd/Gzip/LZ4/None) Per column page inside Parquet codec During encode()
S3 external (gzip/zstd/snappy) On entire payload by sink After encoding
Parquet path forces S3 compression: None Config validation

Overall tradeoffs

Area Impact
S3 storage cost 70-90% smaller files vs NDJSON
Query performance (Athena/Spark) Columnar + predicate pushdown
Encoding latency 1.33x slower at 1000 events, converges at larger batches
Encoding memory ~2x peak per batch (Arrow intermediate)
Existing configs Zero impact when Parquet not configured
Binary size (feature off) Zero cost

Compatibility

Fully backward compatible. batch_encoding is an optional field that defaults to None, so existing configs are untouched.

Type/Method Change Notes
BatchSerializerConfig Added Parquet(...) variant Additive
BatchSerializer Added Parquet(Box<ParquetSerializer>) Additive
S3SinkConfig Added batch_encoding: Option<BatchSerializerConfig> Optional field
build_record_batch() pubpub(crate) Internal API only
ClickHouse config Updated to build_batch_serializer(), Box<BatchEncoder> Functionally identical, shares Arrow IPC infra
BatchEncoder / EncoderKind Box-wrapped Transparent via Deref

When codecs-parquet feature is disabled, the batch_encoding field is compile-time eliminated. When it's enabled but not configured, the existing framed encoder path runs with zero change.

Integration test (LocalStack S3)

End-to-end test sends 10 events through the full sink pipeline, downloads the S3 object, and validates:

  • .parquet file extension — PASS
  • PAR1 magic bytes — PASS
  • 10 rows in file — PASS
  • message, host columns in schema — PASS

Test plan

  • 14 unit tests in parquet.rs (serialization, compression variants, schema modes, edge cases, optimization invariants)
  • 5 existing encoder tests still pass
  • Integration test against LocalStack S3
  • Criterion benchmarks: Parquet (Snappy/Zstd/None) vs NDJSON (Raw/Snappy)
  • cargo clippy clean across default, codecs-parquet, and codecs-parquet,codecs-arrow feature combinations
  • Existing S3 and ClickHouse tests unaffected

Future work

All Parquet infrastructure (BatchSerializerConfig, BatchEncoder, EncoderKind, ParquetSerializer) lives in the shared codecs crate. Adding batch_encoding to GCS and Azure Blob sinks should be straightforward — both use the same RequestBuilder<(Key, Vec<Event>)> pattern as S3, so the change is mostly wiring up the existing EncoderKind dispatch. The file sink would need more work due to its per-event streaming architecture.

Change Type

  • New feature

Is this a breaking change?

  • No

Does this PR include user facing changes?

  • Yes. Changelog fragment included.

Related

@szibis szibis requested a review from a team as a code owner February 21, 2026 18:04
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Feb 21, 2026
@szibis szibis requested a review from a team as a code owner February 21, 2026 18:15
@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Feb 21, 2026
…a modes

Add ParquetSerializer that reuses the Arrow record batch building logic
to encode Vec<Event> as complete Parquet files. Supports five compression
codecs (None, Snappy, Zstd, Gzip, Lz4) and two schema modes (Relaxed
drops extra fields silently, Strict rejects them with an error).
…zerConfig

Add Parquet variant to BatchSerializer enum and BatchSerializerConfig,
extending the batch encoding infrastructure to support Parquet output.
Rename build() to build_batch_serializer() returning BatchSerializer
directly, simplifying the API for all batch serializer consumers.
Update ClickHouse sink to use the new method signature.
Refactor S3 sink to use EncoderKind, enabling both traditional framed
encoding and batch-based columnar formats. Add batch_encoding config
option (behind codecs-parquet feature) that supports Parquet output
with automatic .parquet file extension and internal compression bypass.
Replace internal-only Arrow Schema with user-facing ParquetSchemaField
and ParquetFieldType types that can be deserialized from TOML/JSON
config. Supports boolean, int32/64, float32/64, utf8, binary,
timestamp, and date types. Add config deserialization tests.
Box large enum variants (BatchSerializer::Parquet, EncoderKind::Batch)
to reduce size difference between variants. Derive Default for
ParquetSerializerConfig instead of manual impl. Apply rustfmt.
Add S3 Parquet integration test validating end-to-end encoding with
LocalStack (magic bytes, row count, schema columns). Add Criterion
benchmarks comparing Parquet (Snappy/Zstd/None) vs NDJSON baseline
at 10/100/1000 event batch sizes.
…eness

Add NDJSON+Snappy compressed baseline to Parquet benchmarks for
apples-to-apples comparison. Fix non-exhaustive match in ClickHouse
config for new Parquet variant.
- Eliminate double-buffering: write ArrowWriter directly into output
  BytesMut instead of intermediate Vec + put_slice copy
- Wrap WriterProperties in Arc to avoid deep-cloning HashMap per batch
- Remove redundant nullable schema transformation (to_arrow_schema
  already creates nullable fields)
- Pre-build HashSet of schema field names at construction for O(1)
  strict-mode lookups instead of O(N) linear scan
- Add 6 tests covering optimization invariants
Add Parquet encoding section to aws_s3.cue how_it_works with TOML and
YAML examples. Keep CUE high-level per reviewer feedback from vectordotdev#24372
(field-level docs belong in Rust source, not CUE). Add docs::examples
metadata to configurable fields and improve doc comments on
ParquetSerializerConfig.
@szibis szibis force-pushed the feat/s3-parquet-encoding branch from 075f136 to 31596cb Compare February 21, 2026 18:27
@github-actions github-actions bot added the domain: ci Anything related to Vector's CI environment label Feb 21, 2026
@iadjivon
Copy link

iadjivon commented Feb 23, 2026

Hi there, Thanks for your PR. Adding a work in progress Label for the docs team. Please remove this once the Vector team approves this PR and we will go ahead and review/approve as well. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants