feat: Add CLP connector.#14
Merged
wraymo merged 31 commits intoJul 10, 2025
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Overview
The current Presto–CLP connector PR introduces the coordinator-side implementation, along with a placeholder (dummy) worker implementation. Detailed information about the overall design is available in the corresponding RFC. This Velox PR focuses on the worker-side logic.
The Velox CLP connector enables query execution on CLP archives. The Velox worker receives split information and the associated KQL query from the Presto coordinator. For each split, it executes the KQL query against the relevant CLP archive to find matching messages and stores their indices.
To support lazy evaluation, the implementation creates lazy vectors that wrap a CLP column reader and the list of matching indices. When accessed during query execution, these vectors load and decode only the necessary data on demand.
Core Classes
ClpDataSourceThis class extends
DataSourceand implements theaddSplitandnextmethods. During initialization, it records the KQL query and archive source (S3 or local), then traverses the output type to map Presto fields to CLP projection fields. OnlyARRAY(VARCHAR)and primitive leaf fields likeBIGINT,DOUBLE,BOOLEANandVARCHARare projected.When a split is added, a
ClpCursoris created with the archive path and input source. The query is parsed and simplified into an AST. Onnext, the cursor finds matching row indices and, if any exist, returns a row vector composed of lazy vectors, which load data as needed during execution.ClpCursorThis class manages the execution of a query over a CLP-S archive. It handles parsing and validation, loading schemas and archives, setting up projection fields, and filtering results. In CLP-S, records are partitioned by schemas.
ClpCursorusesClpQueryRunnerto initialize the execution context for each schema and evaluate the filters. It will skip archives where dictionary lookups for string filters return no matches and only scan the relevant schemas of a specific archive. For example, consider a log dataset with the following records.The three log messages have varying schemas. If we run a KQL query
a: World AND b: 0, it will skip loading the third message because it's schema does not match the query (there's nobfield). And if the query isa: random AND b: 0, it will even skip scanning the first two records, becauserandomcannot be found in the dictionary.ClpQueryRunnerThis class extends the generic CLP
QueryRunnerto support ordered projection and row filtering. It initializes projected column readers and returns filtered row indices for each batch.ClpVectorLoaderIn CLP, values are decoded and read from a
BaseColumnReader. TheClpVectorLoaderis custom VeloxVectorLoaderthat loads vectors from CLP column readers. It supports integers, floats, booleans, strings, and arrays of strings. It's used by lazy vectors to load data on demand using the previously stored row indices.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Release Notes
en-CA
New Features
Tests
Documentation