Skip to content

🎉 Add Owl lightweight pipeline runner#6038

Draft
Marigold wants to merge 5 commits into
masterfrom
feature/owl-lightweight-pipeline
Draft

🎉 Add Owl lightweight pipeline runner#6038
Marigold wants to merge 5 commits into
masterfrom
feature/owl-lightweight-pipeline

Conversation

@Marigold
Copy link
Copy Markdown
Collaborator

@Marigold Marigold commented May 6, 2026

Summary

Adds Owl, a lightweight pipeline runner that can live alongside the existing OWID ETL without taking over the etl package name.

The key design choice is that Owl uses its own code tree (owl_steps/) but writes outputs into the existing ETL artifact layout, so existing tooling can consume them:

  • snapshots → data/snapshots/<namespace>/<snapshot-version>/<dataset>__<snapshot>.parquet
  • datasets → data/garden/<namespace>/<version>/<dataset>/

Owl step versions use Python-friendly folders such as v20260416, translated to ETL-style versions such as 2026-04-16.

Example layout:

owl_steps/biodiversity/cherry_blossom/v20260416/step.py
owl_steps/biodiversity/cherry_blossom/v20260416/meta.yml

data/garden/biodiversity/2026-04-16/cherry_blossom/

What is included

  • New local editable package: owid-owl under lib/owl
  • New CLI entry point: .venv/bin/owl
  • New Make target: make owl
  • Two example Owl steps:
    • biodiversity/cherry_blossom/v20260416
    • space/near_earth_asteroids/v20260416

How to try it

.venv/bin/owl snapshot
.venv/bin/owl run
.venv/bin/owl run biodiversity/cherry_blossom
.venv/bin/owl run biodiversity/cherry_blossom/2026-04-16

Expected generated outputs include:

data/garden/biodiversity/2026-04-16/cherry_blossom/index.json
data/garden/biodiversity/2026-04-16/cherry_blossom/cherry_blossom.feather
data/garden/space/2026-04-16/near_earth_asteroids/index.json

A second owl run should report the datasets/actions as up to date.

Validation performed

  • .venv/bin/owl snapshot
  • .venv/bin/owl run
  • .venv/bin/owl run again to check staleness detection
  • .venv/bin/owl --help
  • .venv/bin/etl --help to confirm the existing ETL CLI still resolves
  • Commit hook ran successfully, including lint/format/type checks

Takeover notes

This is intended as a first working prototype rather than a final framework API.

Important implementation details:

  • owl.project.parse_step_file() owns the mapping from owl_steps/<namespace>/<dataset>/vYYYYMMDD/step.py to ETL path parts.
  • owl.snapshot.Snapshot writes raw data into data/snapshots/ as Parquet files.
  • owl.dataset.Dataset writes OWID catalog datasets with owid.catalog.Dataset.create_empty(...).add(...).save().
  • Owl metadata currently supports the lightweight datasets: / columns: shape from the prototype and maps column role into VariableMeta.additional_info.
  • Action stamps are kept under .cache/owl/stamps/ so they do not pollute data/.

Likely follow-ups:

  • Decide whether Owl should keep its own sync command or rely entirely on existing ETL publish/sync tooling.
  • Tighten metadata mapping from Owl-style metadata to full catalog metadata.
  • Decide whether @Dataset(channel="...") is enough for non-garden outputs or whether channel belongs in folder/config.
  • Add focused unit tests for path parsing, version translation, staleness, and catalog output shape.
  • Decide whether example steps should stay in this PR or move to docs/examples once the package shape is reviewed.

@Marigold
Copy link
Copy Markdown
Collaborator Author

Marigold commented May 6, 2026

@codex review

@owidbot
Copy link
Copy Markdown
Contributor

owidbot commented May 6, 2026

Quick links (staging server):

Site Dev Site Preview Admin Wizard Docs Docs Preview

Login: ssh owid@staging-site-feature-owl-lightweight-pipe

chart-diff: ✅ No charts for review.
data-diff: ✅ No differences found

Automatically updated datasets matching excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2026-05-19 16:15:46 UTC
Execution time: 6.03 seconds

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54d3105be0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/owl/owl/cli.py Outdated
Comment on lines +313 to +316
for parquet_path in sorted(data_dir.rglob("*.parquet")):
rel = parquet_path.relative_to(data_dir)
parts = rel.with_suffix("").parts # e.g. ("worldbank", "population")
if len(parts) != 2:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle Owl's catalog dataset layout when indexing

When build_catalog() is used after owl run, it will skip every normal Owl output: Dataset.run() saves catalog datasets under data/garden/<namespace>/<version>/<dataset>/... via ds.add(table), while this loop only looks for *.parquet and then discards anything whose relative path is not exactly two components like source/name. With the default catalog format this finds no files at all, and even if parquet output is enabled the relative path has five components, so the generated catalog is empty for the datasets this commit writes.

Useful? React with 👍 / 👎.

@Marigold Marigold force-pushed the feature/owl-lightweight-pipeline branch from 26697c9 to ef63275 Compare May 14, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants