A toolkit for Data SRE — bring SLOs, error budgets, and oncall runbooks to data pipelines.
Data quality tools (Soda, Great Expectations, dbt tests, Elementary) excel at rule-based validation: "this column should never be null", "this row count should be within range". They give you checks.
What's missing is the SRE-grade reliability layer on top:
- SLO + error budget as the unit of conversation, not raw checks
- Burn rate alerts so you only page when reliability is actually at risk
- Lineage-aware evaluation (OpenLineage-driven, not just static rules)
- Runbooks and oncall flow treated as first-class artifacts
data-sre-toolkit aims to be the missing piece: a vendor-neutral toolkit that lets data teams adopt the SRE playbook for batch and streaming data systems.
| Soda Core / GE / dbt tests | Elementary / DataKitchen | data-sre-toolkit | |
|---|---|---|---|
| Primary unit | Check / expectation | Anomaly / test | SLI → SLO → error budget |
| Alert model | Pass / fail | Anomaly score | Burn rate (multi-window) |
| Lineage | — | Limited | OpenLineage native |
| Oncall workflow | — | — | Runbook + escalation path |
| Error budget policy | — | — | First-class |
This project is pre-alpha. The MVP scope (M1) covers the SLO schema and the dbt adapter. APIs will change. See ROADMAP.md.
- Data SLOs explained — SLI / SLO / error budget for data, mapped from Google's SRE workbook
- ADR-0001: Why this exists — design rationale
# slo.yml — declarative Data SLO definition
version: 1
slos:
- name: orders_freshness
description: orders mart must be fresh within 30min of source
sli:
type: freshness
target: dbt:mart_orders
threshold: 30m
objective: 99.0 # %
window: 30d
burn_rate_alerts:
- severity: page
long_window: 1h
short_window: 5m
burn_rate: 14.4 # 2% budget in 1hSee CONTRIBUTING.md. Issues and PRs welcome — even (especially) "this is the wrong abstraction" feedback at this stage.
If you find a vulnerability, please follow the process in SECURITY.md.
Apache License 2.0 — see NOTICE for attributions.