ci: restructure workflows to use unified builder image on self-hosted runners#112
ci: restructure workflows to use unified builder image on self-hosted runners#112jackluo923 wants to merge 17 commits into
Conversation
…ilds Change Docker image tagging from static :dev tags to dynamic branch-based tags, and update all workflow conditions to accept any branch with the release-0.293-clp-connector-snapshot prefix. Changes: - Docker image tags now use actual branch name via type=ref,event=branch * Before: ghcr.io/repo/prestissimo-worker:dev * After: ghcr.io/repo/prestissimo-worker:release-0.293-clp-connector-snapshot-nov29 - Push conditions changed from exact match to prefix match using startsWith() - Cancel-in-progress logic updated to exclude all snapshot prefix branches - PR title checks now accept release-0.293-clp-connector-snapshot* pattern Example: Pushing to branch "release-0.293-clp-connector-snapshot-nov29" will: - Build and push images tagged with :release-0.293-clp-connector-snapshot-nov29 - Run all CI jobs to completion without cancellation - Enable PR title validation for PRs targeting this branch This allows multiple snapshot branches to coexist with unique image tags.
Add alternative provider implementations for metadata and split discovery,
enabling flexible deployment configurations without MySQL dependency.
New providers:
- ClpYamlMetadataProvider: Reads table/column metadata from YAML config files
instead of querying MySQL database. Supports nested schemas and polymorphic
types through hierarchical YAML structure.
- ClpPinotSplitProvider: Queries Apache Pinot database for archive file paths
via HTTP/JSON API instead of MySQL. Determines split types based on file
extensions (.clp.zst for IR, otherwise ARCHIVE).
Configuration:
- Add clp.metadata-provider-type=YAML option for YAML-based metadata
- Add clp.split-provider-type=PINOT option for Pinot-based split discovery
- Add clp.metadata-yaml-path config for YAML metadata file location
The YAML metadata provider parses a two-level structure:
1. Main metadata file: maps connector → schema → table → schema_file_path
2. Individual table schema files: define column names and types
This provides deployment flexibility - users can now choose between:
- Database-driven (MySQL for both metadata and splits)
- Hybrid (YAML metadata + Pinot splits)
- Or any combination based on their infrastructure
Dependencies: Adds jackson-dataformat-yaml and snakeyaml 2.1
Ensure all timestamp literals are converted to nanoseconds before generating KQL and SQL filters for pushdown, regardless of their original precision (milliseconds, microseconds, etc.). Problem: Presto supports multiple timestamp precisions (TIMESTAMP with millisecond precision, TIMESTAMP_MICROSECONDS, etc.), but CLP's query engine expects timestamps in a consistent nanosecond format. Previously, timestamp literals were passed through without precision normalization, causing incorrect results when different timestamp types were used in queries. Solution: Add timestamp normalization logic in ClpFilterToKqlConverter: - tryEnsureNanosecondTimestamp(): Detects TIMESTAMP and TIMESTAMP_MICROSECONDS types - ensureNanosecondTimestamp(): Converts timestamp values to nanoseconds using: 1. Extract epoch seconds via TimestampType.getEpochSecond() 2. Extract nanosecond fraction via TimestampType.getNanos() 3. Convert to total nanoseconds: SECONDS.toNanos(seconds) + nanosFraction Applied to: - BETWEEN operator: Both lower and upper bounds - Comparison operators: All timestamp literal values (=, !=, <, >, <=, >=) This ensures consistent timestamp representation in both KQL filters (sent to CLP engine) and SQL filters (used for metadata filtering), regardless of the timestamp precision used in the original Presto query. Example: Query: WHERE timestamp BETWEEN TIMESTAMP '2025-01-01' AND TIMESTAMP '2025-01-02' Before: Literal values passed as milliseconds or microseconds After: Both bounds normalized to nanosecond representation
Implement a plan optimizer that pushes down LIMIT + ORDER BY operations to reduce split scanning when querying on timestamp metadata columns. The optimizer detects TopN patterns in query plans and uses archive metadata (timestamp bounds, message counts) to select only the minimum set of archives needed to guarantee correct results. Key features: - Detects and rewrites TopN → [Project] → [Filter] → TableScan patterns - Pushes TopN specifications through ClpTableLayoutHandle - Handles overlapping archives by merging into non-overlapping groups - Supports both ASC (earliest-first) and DESC (latest-first) ordering - Uses worst-case analysis to ensure correctness when archive ranges overlap Implementation: - ClpTopNSpec: Data structure for TopN specifications (limit + orderings) - ClpComputePushDown: Enhanced plan optimizer with TopN detection - ClpMySqlSplitProvider: Archive selection logic using metadata queries - Comprehensive test coverage for various TopN scenarios Performance: For queries like "SELECT * FROM logs ORDER BY timestamp DESC LIMIT 100", this eliminates scanning of archives that cannot contain the top results, significantly reducing I/O.
Implement TopN pushdown optimization in ClpPinotSplitProvider to minimize archive scanning when executing LIMIT + ORDER BY queries on timestamp metadata columns. The optimization uses archive metadata (timestamp bounds and message counts) to intelligently select only the minimum set of archives needed to guarantee correct TopN results, significantly reducing I/O for time-range queries. Implementation details: Archive selection algorithm: - Fetches archive metadata (file path, creation time, modification time, message count) - Merges overlapping archives into non-overlapping groups by timestamp ranges - For DESC ordering: includes newest group + older groups until limit is covered - For ASC ordering: includes oldest group + newer groups until limit is covered - Uses worst-case analysis to ensure correctness when archives overlap Code structure: - ClpPinotSplitProvider.listSplits(): Detects TopN specs and routes to optimized path - selectTopNArchives(): Implements the archive selection algorithm - toArchiveGroups(): Merges overlapping archives into logical components - ArchiveMeta: Represents archive metadata with validation - ArchiveGroup: Represents merged archive groups for overlap handling ClpTopNSpec enhancements: - Made fields private with proper encapsulation - Added @JsonProperty annotations to getters for correct serialization - Maintains immutability for thread-safety Code quality improvements: - Proper exception handling with context-aware error messages - Input validation in constructors (bounds checking, null checks) - Extracted determineSplitType() helper to eliminate duplication - Made fields final where immutable (pinotDatabaseUrl, ArchiveMeta fields) - Improved logging with table names and SQL queries for debugging - Better encapsulation: private fields, getters-only access Performance impact: For queries like "SELECT * FROM logs ORDER BY timestamp DESC LIMIT 100", this eliminates scanning of archives outside the time range of interest, providing substantial I/O reduction for large archive sets.
Implements dynamic schema discovery from YAML metadata files, allowing the CLP connector
to support multiple schemas beyond just "default" (e.g., clp.dev.table, clp.prod.table).
Key changes:
- Add default listSchemaNames() method to ClpMetadataProvider interface for DRY principle
- Implement dynamic schema discovery in ClpYamlMetadataProvider with thread-safe caching
- Update ClpMetadata to delegate schema listing to metadata providers
- Add comprehensive error handling with graceful fallback to default schema
- Optimize performance with double-checked locking and ObjectMapper reuse
- Add 15+ unit tests covering all edge cases and error scenarios
- Fix checkstyle violations in UberClpPinotSplitProvider and ClpPinotSplitProvider
The implementation is backward compatible - existing single-schema setups continue to work
without configuration changes. Multi-schema support is activated automatically when the
YAML metadata file contains multiple schema entries.
Example YAML structure:
```yaml
clp:
default:
logs: /path/to/default/logs.yaml
dev:
test_logs: /path/to/dev/logs.yaml
prod:
production_logs: /path/to/prod/logs.yaml
```
Performance optimizations:
- Thread-safe caching prevents repeated YAML parsing
- Double-checked locking pattern for lazy initialization
- Reused ObjectMapper instances reduce object creation overhead
- Synchronized table schema map updates ensure thread safety
Testing:
- All 11 tests in TestClpYamlMetadataProvider pass
- All 4 tests in TestClpMultiSchema pass
- Fixed testListSchemaNamesNoCatalogField to handle missing catalog gracefully
…restodb#26254) ## Description Fix flakytests that uses `experimental.spiller-spill-path` as `/tmp/presto/spills/` in query runner ## Motivation and Context prestodb#25890 ## Impact <!---Describe any public API or user-facing feature change or any performance impact--> ## Test Plan <!---Please fill in how you tested your change--> ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == NO RELEASE NOTE == ```
Upgrade joda-time to 2.13.1 to support new timezone data including America/Coyhaique. Update tests to use dates where timezone gaps still exist in the updated tzdata: - testLocallyUnrepresentableDateLiterals: 1970-01-01 -> 1932-04-01 - testLocallyUnrepresentableTimeLiterals: use 2017-04-02 or 2012-04-01 Cherry-picked from upstream: prestodb/presto@0a504d20e6
… runners Key changes: - Add comprehensive CI architecture documentation explaining: - Terminology (presto, prestocpp, prestissimo) - Unified builder image strategy for ephemeral runners - Job dependency graph - Comparison with upstream (prestodb/presto) CI - Performance benefits of pre-warmed ccache and Maven deps - Consolidate presto-java8/java17 jobs into single matrix-based `presto` job - ARTIFACT_JAVA_VERSION controls which version uploads artifacts/images (default: '8') - Add prestissimo image building to prestocpp workflow - Downloads artifacts from build job, packages into runtime image - Uses same tagging strategy as presto image (immutable + SNAPSHOT tags) - Centralize IMAGE_VERSION_TYPE configuration (set to 'BETA') - Applied to both presto and prestissimo images - Document artifacts (presto-server, presto-cli, presto-native-build) and Docker images (unified-builder, presto, prestissimo)
|
Duplicate, updating PR #110 instead |
|
Caution Review failedThe pull request is closed. WalkthroughThis PR introduces a comprehensive CI/CD restructuring with a unified builder image strategy featuring dependency-based caching, extensive CLP plugin enhancements including YAML metadata support and Pinot split providers with TopN pushdown optimization, and test infrastructure improvements with UUID-based spill path isolation to prevent file collisions. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (70)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Restructure CI workflows to use a unified builder image on self-hosted runners, replacing the branch-specific CI in
release-0.293-clp-connector-snapshot.Comparison: Three CI Approaches
prestodb/presto-native-dependency:0.293-...(C++ deps only, no caches)release-0.293-clp-connector-snapshot)0.293-BETA-20250522140509-484b00e)release-0.293-clp-connector-snapshot*branchesDesign Improvements
1. Self-Hosted Runners with Pre-Built Dependencies
Switching from GitHub-hosted to self-hosted ephemeral runners with a unified builder image:
Benefits:
2. Branch-Agnostic CI
The old CI in
release-0.293-clp-connector-snapshotonly works for that specific branch:The new CI works on any branch, enabling feature branch testing and multi-version support.
3. Automatic Version Inference
Images are automatically tagged with meaningful versions extracted from
pom.xml:<version>-<TYPE>-<timestamp>-<hash>(e.g.,0.293-BETA-20250527143021-62303d8)<version>-<TYPE>-SNAPSHOT(e.g.,0.293-BETA-SNAPSHOT)No manual version management required. Each commit produces a uniquely identifiable image.
4. Configurable Version Type
Single configuration point for image version type:
Applied consistently to both presto and prestissimo images.
5. Unified Builder Image with Hash-Based Tagging
The builder image is fundamentally different from upstream's
presto-native-dependency:The builder image tag is computed from dependency file hashes:
Performance Benefits
ccache Pre-warming
The builder image contains a pre-warmed ccache from a previous successful build. Incremental builds achieve high cache hit rates, dramatically reducing C++ compilation time.
Artifact-Based Image Building
Instead of rebuilding prestissimo from scratch for the Docker image:
Limitations of Current CI (release-0.293-clp-connector-snapshot)
Branch Lock-in: Only works for
release-0.293-clp-connector-snapshot*branchesHardcoded Dependencies: Builder image version is hardcoded to upstream's
prestodb/presto-native-dependency:0.293-...from Docker Hub. This image is built by upstream prestodb/presto CI and requires manual updates when dependencies change. The snapshot branch is pinned to an older 0.293 version while upstream master has moved to 0.296+No Version Tracking: Images tagged only by branch name, no timestamp/hash for traceability
Slow Image Builds: Rebuilds everything from scratch for Docker images (~1 hour)
External Service Dependency: Relies on Apache Infrastructure stash service for ccache
Limited Resources: GitHub-hosted runners have limited CPU/RAM compared to dedicated hardware
Maintenance Benefits
Job Dependency Graph
Artifacts and Images
Artifacts (1-day retention)
Docker Images (ghcr.io)
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.