🎉 Add Owl lightweight pipeline runner#6038
Conversation
|
@codex review |
|
Quick links (staging server):
Login: chart-diff: ✅No charts for review.data-diff: ✅ No differences foundAutomatically updated datasets matching excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2026-05-19 16:15:46 UTC |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54d3105be0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for parquet_path in sorted(data_dir.rglob("*.parquet")): | ||
| rel = parquet_path.relative_to(data_dir) | ||
| parts = rel.with_suffix("").parts # e.g. ("worldbank", "population") | ||
| if len(parts) != 2: |
There was a problem hiding this comment.
Handle Owl's catalog dataset layout when indexing
When build_catalog() is used after owl run, it will skip every normal Owl output: Dataset.run() saves catalog datasets under data/garden/<namespace>/<version>/<dataset>/... via ds.add(table), while this loop only looks for *.parquet and then discards anything whose relative path is not exactly two components like source/name. With the default catalog format this finds no files at all, and even if parquet output is enabled the relative path has five components, so the generated catalog is empty for the datasets this commit writes.
Useful? React with 👍 / 👎.
26697c9 to
ef63275
Compare
Summary
Adds Owl, a lightweight pipeline runner that can live alongside the existing OWID ETL without taking over the
etlpackage name.The key design choice is that Owl uses its own code tree (
owl_steps/) but writes outputs into the existing ETL artifact layout, so existing tooling can consume them:data/snapshots/<namespace>/<snapshot-version>/<dataset>__<snapshot>.parquetdata/garden/<namespace>/<version>/<dataset>/Owl step versions use Python-friendly folders such as
v20260416, translated to ETL-style versions such as2026-04-16.Example layout:
What is included
owid-owlunderlib/owl.venv/bin/owlmake owlbiodiversity/cherry_blossom/v20260416space/near_earth_asteroids/v20260416How to try it
Expected generated outputs include:
A second
owl runshould report the datasets/actions as up to date.Validation performed
.venv/bin/owl snapshot.venv/bin/owl run.venv/bin/owl runagain to check staleness detection.venv/bin/owl --help.venv/bin/etl --helpto confirm the existing ETL CLI still resolvesTakeover notes
This is intended as a first working prototype rather than a final framework API.
Important implementation details:
owl.project.parse_step_file()owns the mapping fromowl_steps/<namespace>/<dataset>/vYYYYMMDD/step.pyto ETL path parts.owl.snapshot.Snapshotwrites raw data intodata/snapshots/as Parquet files.owl.dataset.Datasetwrites OWID catalog datasets withowid.catalog.Dataset.create_empty(...).add(...).save().datasets:/columns:shape from the prototype and maps columnroleintoVariableMeta.additional_info..cache/owl/stamps/so they do not pollutedata/.Likely follow-ups:
synccommand or rely entirely on existing ETL publish/sync tooling.@Dataset(channel="...")is enough for non-garden outputs or whether channel belongs in folder/config.