Skip to content

feat: Add CLP connector.#14

Merged
wraymo merged 31 commits into
y-scope:presto-0.293-clp-connectorfrom
wraymo:clp_integration_new
Jul 10, 2025
Merged

feat: Add CLP connector.#14
wraymo merged 31 commits into
y-scope:presto-0.293-clp-connectorfrom
wraymo:clp_integration_new

Conversation

@wraymo
Copy link
Copy Markdown

@wraymo wraymo commented Jun 25, 2025

Description

Overview

The current Presto–CLP connector PR introduces the coordinator-side implementation, along with a placeholder (dummy) worker implementation. Detailed information about the overall design is available in the corresponding RFC. This Velox PR focuses on the worker-side logic.

The Velox CLP connector enables query execution on CLP archives. The Velox worker receives split information and the associated KQL query from the Presto coordinator. For each split, it executes the KQL query against the relevant CLP archive to find matching messages and stores their indices.

To support lazy evaluation, the implementation creates lazy vectors that wrap a CLP column reader and the list of matching indices. When accessed during query execution, these vectors load and decode only the necessary data on demand.

Core Classes

ClpDataSource

This class extends DataSource and implements the addSplit and next methods. During initialization, it records the KQL query and archive source (S3 or local), then traverses the output type to map Presto fields to CLP projection fields. Only ARRAY(VARCHAR) and primitive leaf fields like BIGINT, DOUBLE, BOOLEAN and VARCHAR are projected.

When a split is added, a ClpCursor is created with the archive path and input source. The query is parsed and simplified into an AST. On next, the cursor finds matching row indices and, if any exist, returns a row vector composed of lazy vectors, which load data as needed during execution.

ClpCursor

This class manages the execution of a query over a CLP-S archive. It handles parsing and validation, loading schemas and archives, setting up projection fields, and filtering results. In CLP-S, records are partitioned by schemas. ClpCursor uses ClpQueryRunner to initialize the execution context for each schema and evaluate the filters. It will skip archives where dictionary lookups for string filters return no matches and only scan the relevant schemas of a specific archive. For example, consider a log dataset with the following records.

{"a": "Hello", "b": 2}
{"a": "World", "b": 0, "c": false}
{"a": "World", "c": true}

The three log messages have varying schemas. If we run a KQL query a: World AND b: 0, it will skip loading the third message because it's schema does not match the query (there's no b field). And if the query is a: random AND b: 0, it will even skip scanning the first two records, because random cannot be found in the dictionary.

ClpQueryRunner

This class extends the generic CLP QueryRunner to support ordered projection and row filtering. It initializes projected column readers and returns filtered row indices for each batch.

ClpVectorLoader

In CLP, values are decoded and read from a BaseColumnReader. The ClpVectorLoader is custom Velox VectorLoader that loads vectors from CLP column readers. It supports integers, floats, booleans, strings, and arrays of strings. It's used by lazy vectors to load data on demand using the previously stored row indices.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

Release Notes

en-CA

New Features

  • Added a CLP connector for reading and querying CLP archives from local file systems or S3.
  • Introduced CLP-specific configuration, including storage type selection (FS or S3).
  • Bundled new dependencies required for the CLP connector, managed via CMake.
  • Implemented lazy vector loading for efficient data access in CLP queries.
  • Added support for nested row types and predicate pushdown in CLP queries.

Tests

  • Added comprehensive test suite with example CLP archive data to validate connector functionality and predicate pushdown behavior.

Documentation

  • Updated connector documentation to include the CLP connector and its usage.
  • Added configuration documentation for CLP connector options.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants