diff --git a/REVIEW-NOTES-databricks-pipelines.md b/REVIEW-NOTES-databricks-pipelines.md new file mode 100644 index 0000000..592200c --- /dev/null +++ b/REVIEW-NOTES-databricks-pipelines.md @@ -0,0 +1,220 @@ +# `databricks-pipelines` Skill — Review Notes + +This document accompanies the audit and rework of `skills/databricks-pipelines/` after the initial port from `ai-dev-kit/databricks-skills/databricks-spark-declarative-pipelines`. It exists to make the merge review easy: every change has a "what was broken" framing and a "why this fix is correct" rationale, with Databricks docs cited where the behavior was non-obvious. + +The skill is consumed by an LLM in a fresh session — every duplication is a place we can drift, every conflict is a place the model emits wrong code. The goal of this pass was: **no duplicated information, no conflicting instructions, dense and SDP-specific only** (strip general SQL / streaming knowledge the model already has). + +## Summary of categories + +| Category | Items | +|---|---| +| Workflow restructure (DAB vs CLI iteration) | 4 | +| Reference-file deduping (parent stubs) | 1 | +| API-table reorganization (description, deduped deprecation column) | 1 | +| Canonical create JSON (dev defaults, retry overrides) | 1 | +| SQL syntax conflicts (legacy DLT `FROM STREAM table` vs `STREAM(table)`) | ~25 sites | +| `CREATE OR REFRESH` consistency | ~12 sites | +| Legacy API surfaced as current | 5 | +| Broken / placeholder links | 12 | +| Typos | 3 | + +## 1. Workflow restructure — A / B / C now self-contained + +**Before:** +- SKILL.md had a "Choose Your Workflow" table that listed A/B/C, then later a "Scaffolding a New Pipeline Project" section that only documented A (standalone `pipelines init`). B (existing bundle) and C (CLI iteration) had no entry-point. +- All workflow content was crammed into a single `references/workflows.md`. +- The skill had "Running Pipelines" + "Development Workflow" sections that were silently DAB-only — agents working in the CLI/no-bundle path got bundle commands. + +**After:** +- SKILL.md "Choose Your Workflow" has three compact blocks (A/B/C), each with a one-liner CLI sketch and a link to detail. +- Split `workflows.md` into two named files: `references/1-project-initialization-with-dab.md` (A + B, bundle paths) and `references/2-rapid-iteration-with-cli.md` (C, no-bundle path). Names self-document: file 1 covers "with DAB", file 2 is "rapid iteration with CLI" — the contrast is explicit. +- The new "Running a Pipeline" section in SKILL.md explicitly splits A/B (bundle deploy + run) and C (workspace import + start-update). Refresh-modes (selective vs full-refresh data-loss warning) and the polling rule are stated once with links to detail. +- The merged a-d-k `3-advanced-configuration.md` content lives in the existing `pipeline-configuration.md` — no new file added since the field-by-field JSON reference was already there. Avoided a duplicate. + +## 2. Canonical create JSON now matches a-d-k's tuned defaults + +**Before:** +- `pipeline-configuration.md` showed a minimal create JSON with `serverless`, `continuous: false`, `channel: PREVIEW`, `libraries`. No mention of `development: true` or the retry overrides. +- Agents creating a demo pipeline would hit the platform's default retry behavior (5 update retries × 2 flow retries) on a syntactically broken pipeline. A single typo could waste 10+ min retrying with the same root cause. + +**After:** +- Canonical create JSON now includes `development: true` plus `pipelines.numUpdateRetryAttempts: "0"` + `pipelines.maxFlowRetryAttempts: "0"`. A doomed update fails fast (~30s) instead of retrying for 10+ min. +- A clear note labels these as iteration defaults, with a one-line callout that production pipelines should drop the retry overrides (so the platform's retry defaults absorb transient infra failures). Avoids "demo settings leaking into prod." +- Same canonical create is mirrored in `2-rapid-iteration-with-cli.md` Step 3 and the Python SDK alternative — both now reference the per-field rationale in `pipeline-configuration.md` instead of duplicating it. +- The `"Development mode"` variant snippet in `pipeline-configuration.md` was reframed as `"Production mode (remove dev defaults)"` since the canonical IS dev mode — the relevant delta is the prod conversion, not the dev one. + +## 3. References reorganized — no parent stubs + +**Before:** +- The skill had a `references/streaming-table.md`, `materialized-view.md`, `temporary-view.md`, `view.md`, `auto-loader.md`, `auto-cdc.md`, `expectations.md`, `sink.md`, `foreach-batch-sink.md` — each one a ~15-line stub that just linked to `-python.md` and `-sql.md` variants. +- The SKILL.md API tables linked these parents using subdirectory paths (`streaming-table/streaming-table-python.md`) that **didn't exist** — every Skill (Py) / Skill (SQL) link in the API tables was broken. +- The decision tree, common traps, multi-schema patterns, and feature-coverage notes were partly duplicated between the parent stubs and the lang files. + +**After:** +- Deleted all 9 parent stubs. +- SKILL.md tables now link directly to `references/streaming-table-python.md` / `references/streaming-table-sql.md` etc. All Skill (Py) / Skill (SQL) links are valid. +- Standalone content that lived only in the parent stubs (format-options pointers from `auto-loader.md`) was inlined into the SKILL.md API Reference list. +- Same cleanup applied to a final stub `write-spark-declarative-pipelines.md` (8 lines, just two links) — deleted, SKILL.md updated. + +## 4. API tables: added Description column, removed Deprecation column, hoisted DLT migration + +**Before:** +- API tables (Dataset / Flow & Sink / CDC / Quality / Reading / Schema) had a `Python (deprecated)` and `SQL (deprecated)` column on every row. The deprecated column duplicated information that belongs in one canonical place. Agents had to read both columns side-by-side, but the recommendation in every row was always "use the modern one." +- No description column — rows said "Streaming Table" / "`@dp.table()` returning streaming DF" but didn't explain *what* it is, just the syntax. +- The `Import / Module APIs` table was a separate deprecation grid. + +**After:** +- Added a `Description` column to all 6 API tables. Rows now answer "what is this feature" in one line before showing syntax. +- Removed the `Python (deprecated)` / `SQL (deprecated)` columns from API tables. +- Added a single `## Legacy DLT Syntax — always migrate` table at the end of the API Reference section. One canonical place lists every legacy form (`import dlt`, `@dlt.*`, `dlt.read*`, `dlt.apply_changes`, `LIVE.` prefix, `CREATE LIVE TABLE`, `APPLY CHANGES INTO`, `partition_cols`, `input_file_name()`, `target=` parameter) → modern equivalent, with a "read `dlt-migration.md` first" instruction. +- Special-case carve-out for `CREATE TEMPORARY LIVE VIEW`: the docs explicitly retain `CREATE LIVE VIEW` because `CREATE TEMPORARY VIEW` does NOT support `CONSTRAINT` clauses. The Legacy DLT table calls this out so the model doesn't naively migrate it. Source: [Databricks Lakeflow Pipelines docs - CREATE TEMPORARY VIEW](https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-create-temporary-view). + +## 5. SDP-specific traps merged in from a-d-k + +After filtering against existing DAS content (avoiding duplicates), these gaps from a-d-k were merged in: + +### `CREATE OR REPLACE` is not SDP +- Agents trained on standard SQL try `CREATE OR REPLACE STREAMING TABLE`. SDP rejects this — the keyword is `CREATE OR REFRESH`. +- Added to SKILL.md Common Issues table with the exact error → fix mapping. + +### `dbfs:` prefix required on Volume paths +- `databricks fs ls /Volumes/...` fails; the CLI requires `dbfs:` even though it's a UC Volume. +- Added to SKILL.md Common Issues table. + +### `CLUSTER BY` type rules +- SDP doesn't pre-validate cluster keys. Pipeline runs, then fails on first write with `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED` for BOOLEAN / ARRAY / MAP / STRUCT / BINARY columns. +- Added to SKILL.md Common Issues. Points at the full type rules already documented in `performance.md` instead of duplicating them. + +### Common Issues table (5 new rows beyond what already existed) +Added concrete error → fix rows for: +- `Cannot create streaming table from batch query` → `FROM STREAM read_files(...)` (not `FROM read_files(...)`) +- `Column not found` at ingest time → `schemaHints` misalignment +- Streaming reads fail with parser error → file vs table-to-table STREAM forms +- Pipeline stuck `INITIALIZING` for serverless → normal, don't kill +- MV doesn't incrementally refresh → serverless + `delta.enableRowTracking = true` + +### SCD2 column-name pitfall +- Agents query `WHERE START_AT IS NULL` and get "column not found". Lakeflow uses double-underscore: `__START_AT` / `__END_AT`. Added to Common Issues. + +### `error.exceptions[0].message` extraction (the highest-leverage fix) +- The polling jq pattern in `2-rapid-iteration-with-cli.md` previously read `.message` from event-log entries — which is just "Update X is FAILED", useless. +- The real cause lives in `error.exceptions[0].message` (nested). Updated the jq to extract both summary and exception body. Added a self-check ("if you only see 'Update X is FAILED', you're not extracting `error.exceptions[0].message` — fix the jq and re-run"). +- Without this fix, the entire FAILED-debugging instruction was returning nothing useful. + +### Debug-trace-upstream protocol +- On data validation failures, trace upstream: bronze empty = path / files missing, silver empty = filter too aggressive, gold wrong counts = aggregation / duplicate keys. Added to Step 6 of `2-rapid-iteration-with-cli.md`. + +### Gold-layer preserve-dimensions rule +- When aggregating to Gold, agents tend to over-aggregate and lose dimensions analysts need for filters. Added a one-paragraph rule in SKILL.md Pipeline Structure: "if a dashboard is mentioned, every filter on it needs to be a column in the underlying Gold table." + +### Names: SDP / LDP / DLT equivalence +- Users mention SDP, LDP, Lakeflow Declarative Pipelines, and (older) DLT interchangeably. Added a one-line equivalence note to Common Traps so the model recognizes all four are the same product. + +## 6. SQL syntax conflicts fixed — `FROM STREAM table_name` was inconsistent everywhere + +**Before:** +- Per Databricks SDP SQL docs: `FROM STREAM(table_name)` (with parens, function form) is the canonical pattern for reading a sibling table. `FROM STREAM table_name` (without parens, legacy DLT bareword form) still parses but is the older syntax. +- The skill was inconsistent: SKILL.md Common Issues line said "prefer function form", but SKILL.md API Reading Data table showed the legacy bareword form, and ~25 example code blocks across `streaming-patterns.md`, `kafka.md`, `dlt-migration.md`, `performance.md` used the legacy bareword form. + +**After:** +- Normalized all SQL examples to the function form `FROM STREAM(table_name)`. +- Function calls (`STREAM read_files(...)`, `STREAM read_stream(...)`) correctly use the no-extra-parens form, matching what `auto-loader-sql.md` Pattern 1 already does and per the Databricks doc example for `CREATE STREAMING TABLE` reading from `read_files`. +- Fixed `auto-loader-sql.md` patterns that incorrectly wrapped `STREAM(read_files(...))` with extra parens. + +Source: [Databricks Lakeflow Pipelines docs - CREATE STREAMING TABLE](https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-create-streaming-table) shows both forms (`FROM STREAM(customers_bronze)` for tables and `FROM STREAM read_files(...)` for functions). + +## 7. `CREATE OR REFRESH` consistency — ~12 examples were bare `CREATE` + +**Before:** +- SKILL.md Common Traps rule says: "Prefer `CREATE OR REFRESH` over bare `CREATE` for SQL dataset definitions" — but `materialized-view-sql.md` Patterns 1-5, `streaming-table-sql.md` Patterns 2/3/4/5/7, `temporary-view-sql.md` Patterns 1-2, and `view-sql.md` Pattern 1 all use bare `CREATE MATERIALIZED VIEW` or `CREATE STREAMING TABLE`. + +**After:** +- Added `OR REFRESH` to every applicable SQL example in those four files. `CREATE VIEW` (persistent UC view) was left alone — it's a different SQL feature and `OR REFRESH` doesn't apply. + +## 8. Legacy API surfaced as current — three places + +**Before:** +- `temporary-view-python.md` line 5 said `@dp.temporary_view() (preferred) / @dp.view() (alias) / @dlt.view() (deprecated)` — claiming `@dp.view()` is a current alias. +- `expectations-python.md` listed `@dp.view()` and `@dlt.view()` as decorators expectations can be applied to, and had a "With Views" pattern using `@dp.view(...)` as the recommended form. +- `auto-cdc-python.md` lines 5 and 44 listed legacy and current API names as if they were all current: `**dp.create_auto_cdc_flow() / dp.apply_changes() / dlt.create_auto_cdc_flow() / dlt.apply_changes()**`. + +**After:** +- Verified against the Databricks docs ([temporary_view reference](https://docs.databricks.com/aws/en/ldp/developer/ldp-python-ref-view)): only `@dp.temporary_view()` is documented as current. `@dp.view()` doesn't appear anywhere in the modern API reference. `@dlt.view` is described as the older form. +- `temporary-view-python.md` updated: drops the "(alias)" framing for `@dp.view`, marks it legacy. +- `expectations-python.md` updated: removed `@dp.view` from the current-decorator list, renamed the "With Views" pattern to "With Temporary Views" using `@dp.temporary_view`. +- `auto-cdc-python.md` updated: headers now show only the modern API (`dp.create_auto_cdc_flow()`, `dp.create_auto_cdc_from_snapshot_flow()`). Legacy aliases mentioned in a one-liner pointing at the SKILL.md Legacy DLT table. + +## 9. `sequence_by` accepts string or Column — was misdocumented + +**Before:** +- SKILL.md Legacy DLT note and `dlt-migration.md` "Key differences" both said `sequence_by` "takes a Column object (`col("...")`) not a string" — but auto-cdc-python.md correctly documents it as `sequence_by (str | Column)` accepting both. + +**After:** +- SKILL.md and `dlt-migration.md` corrected: both string column name and `col(...)` work; `col(...)` is more idiomatic. The Common Issues row in `dlt-migration.md` updated to say "if you hit a type error, check that the column exists in the source" rather than recommending `col()` as the only valid form. + +## 10. `partition_cols` legacy flag + +**Before:** +- SKILL.md Table/Schema Features API table line 168 marks `partition_cols` / `PARTITIONED BY` as "Legacy fixed partitioning. Prefer Liquid Clustering." +- SKILL.md Legacy DLT Syntax line 192 also lists it as legacy. +- BUT `streaming-table-python.md` and `materialized-view-python.md` list `partition_cols=[""]` as a parameter with the description "Columns to partition the table by" — no legacy warning. Agents picking up the lang file without reading SKILL.md would treat it as a current option. + +**After:** +- Both lang files now flag `partition_cols` as "Legacy — prefer `cluster_by` (Liquid Clustering) for new tables" with a link to `performance.md`. Doesn't remove the parameter (it still works) but makes the recommendation visible at the parameter list. + +## 11. Broken & placeholder links — 12 fixes + +| File | Problem | Fix | +|---|---|---| +| `sink-python.md:27` | `[streaming-table-python.md](../streaming-table/streaming-table-python.md)` — nonexistent subdir | Changed to `(streaming-table-python.md)` | +| `dlt-migration.md:444-446` | Pointed at `python/1-syntax-basics.md` / `python/4-cdc-patterns.md` / `sql/4-cdc-patterns.md` — a-d-k pre-flatten layout | Updated to `python-basics.md` / `auto-cdc-python.md` / `auto-cdc-sql.md` | +| `streaming-table-python.md:3, 8, 236` | "the `materializedView` API guide" placeholders | Replaced with `[materialized-view-python.md](materialized-view-python.md)` | +| `streaming-table-sql.md:3, 8, 133, 287` | Multiple placeholder names (`materializedView`, `autoCdc`, `expectations` API guide) | Replaced with actual file links | +| `materialized-view-python.md:3, 77, 189` | "the `streamingTable` API guide" placeholders | Replaced with `[streaming-table-python.md](streaming-table-python.md)` | +| `materialized-view-sql.md:3, 45, 179` | Same placeholders | Replaced with file links | +| `auto-loader-python.md:18` | "look up `streamingTable` guide" placeholder | Replaced with file link | +| `temporary-view-sql.md:82` | "the 'expectations' API guide" placeholder | Replaced with file link | +| `kafka.md:207` | Referenced deleted parent stub `sink.md` | Updated to `sink-python.md` | +| `streaming-patterns.md:5, 87, 444` | Referenced deleted parents | Updated to `-python.md` / `-sql.md` | +| `scd-2-querying.md:5` | Referenced deleted `auto-cdc.md` parent | Updated to per-language files | +| `pipeline-configuration.md:272` | Linked to deleted `workflows.md` for polling pattern | Updated to `2-rapid-iteration-with-cli.md` anchor | + +## 12. Reference structure consistency — Traps vs Issues + +**Before:** +- The merge ended up with overlapping `Common Traps` and `Common Issues` sections in SKILL.md. Some rows belonged to both (e.g. `MV incremental refresh` requirement); others mixed design-time decisions with error-message lookups. + +**After:** +- Clear split: + - **Common Traps** = design-time decisions (when the user asks for X, respond with Y). Examples: dataset-type selection, intermediate-logic patterns, recommend-ONE-clear-approach. + - **Common Issues** = error message → fix mappings (concrete CLI / runtime failures). Examples: `Cannot create streaming table from batch query`, `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED`, `dbfs:` prefix. +- Moved 3 rows from Traps to Issues (`CREATE OR REPLACE`, `dbfs:` prefix, `CLUSTER BY` type rule — all are error→fix, not design rules). +- Merged 1 trap into existing issue row (`MV incremental refresh` serverless detail folded into the row-tracking issue row). +- Each row now lives in exactly one section. + +## 13. SKILL.md typos + +- "When ensure" → "When unsure" (Language Selection section) +- "exploration / demo scafolding" → "exploration / demo scaffolding" (Choose Your Workflow section) +- "Option C: Rapid CLI Iteration, interactive demo creation" → "Option C: Rapid CLI Iteration" + +## Acceptable remaining duplication + +Some duplication is intentional and kept: + +- **Polling rule** mentioned in SKILL.md "Running a Pipeline", `1-project-initialization-with-dab.md` "Running a Pipeline (Workflow A/B)", and detailed in `2-rapid-iteration-with-cli.md` Step 4. The first two link to the third — only one place has the actual jq code. ✅ Single source of truth, two index entries. +- **`Sinks are Python-only`** appears in the decision tree, Flow & Sink API table, Common Traps, API Reference list, Platform Constraints, sink-python.md, foreach-batch-sink-python.md, kafka.md. Each is contextually justified — agents will hit at least one of these while looking at sinks. Kept as-is; not worth deduping. +- **Incremental refresh requirements** are in SKILL.md Common Issues, `materialized-view-python.md`, `materialized-view-sql.md`, and `performance.md`. Each restates the rule in its context (one-liner index → operational detail → perf-framing). +- **`skipChangeCommits`** appears in 4-5 places as quick examples. Acceptable. + +## Authoritative sources used + +- [Databricks Lakeflow Pipelines docs (entry)](https://docs.databricks.com/aws/en/ldp/) +- [CREATE STREAMING TABLE (pipelines)](https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-create-streaming-table) — `STREAM(table)` vs `STREAM read_files(...)` syntax +- [CREATE TEMPORARY VIEW (pipelines)](https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-create-temporary-view) — `CONSTRAINT` clauses not supported; `CREATE LIVE VIEW` retained for expectations on temp views +- [temporary_view (Python)](https://docs.databricks.com/aws/en/ldp/developer/ldp-python-ref-view) — `@dp.view` not documented as current; `@dlt.view` is older form +- [Expectations](https://docs.databricks.com/aws/en/ldp/expectations) — supported on STs, MVs, and temp views + +## Manifest + +Validated with `python3 scripts/skills.py validate` → "Everything is up to date." diff --git a/manifest.json b/manifest.json index eb4b047..062879e 100644 --- a/manifest.json +++ b/manifest.json @@ -111,22 +111,19 @@ "agents/openai.yaml", "assets/databricks.png", "assets/databricks.svg", + "references/1-project-initialization-with-dab.md", + "references/2-rapid-iteration-with-cli.md", "references/auto-cdc-python.md", "references/auto-cdc-sql.md", - "references/auto-cdc.md", "references/auto-loader-python.md", "references/auto-loader-sql.md", - "references/auto-loader.md", "references/dlt-migration.md", "references/expectations-python.md", "references/expectations-sql.md", - "references/expectations.md", "references/foreach-batch-sink-python.md", - "references/foreach-batch-sink.md", "references/kafka.md", "references/materialized-view-python.md", "references/materialized-view-sql.md", - "references/materialized-view.md", "references/options-avro.md", "references/options-csv.md", "references/options-json.md", @@ -139,19 +136,13 @@ "references/python-basics.md", "references/scd-2-querying.md", "references/sink-python.md", - "references/sink.md", "references/sql-basics.md", "references/streaming-patterns.md", "references/streaming-table-python.md", "references/streaming-table-sql.md", - "references/streaming-table.md", "references/temporary-view-python.md", "references/temporary-view-sql.md", - "references/temporary-view.md", - "references/view-sql.md", - "references/view.md", - "references/workflows.md", - "references/write-spark-declarative-pipelines.md" + "references/view-sql.md" ] }, "databricks-serverless-migration": { diff --git a/skills/databricks-pipelines/SKILL.md b/skills/databricks-pipelines/SKILL.md index 78e05b3..a42ed69 100644 --- a/skills/databricks-pipelines/SKILL.md +++ b/skills/databricks-pipelines/SKILL.md @@ -48,23 +48,35 @@ User request → What kind of output? ## Common Traps -- **"Create a table"** without specifying type → ask whether the source is streaming or batch -- **Materialized View from streaming source** is an error → use a Streaming Table instead, or switch to a batch read -- **Streaming Table from batch source** is an error → use a Materialized View instead, or switch to a streaming read -- **Aggregation over streaming table** → use a Materialized View with batch read (`spark.read.table` / `SELECT FROM` without `STREAM`), NOT a Streaming Table. This is the correct pattern for Gold layer aggregation. -- **Aggregation over batch/historical data** → use a Materialized View, not a Streaming Table. MVs recompute or incrementally refresh aggregates to stay correct; STs are append-only and don't recompute when source data changes. -- **Preprocessing before Auto CDC** → use a Temporary View to filter/transform the source before feeding into the CDC flow. SQL: the CDC flow reads from the view via `STREAM(view_name)`. Python: use `spark.readStream.table("view_name")`. -- **Intermediate logic → default to Temporary View** → Use a Temporary View for intermediate/preprocessing logic, even when reused by multiple downstream tables. Only consider a Private MV/ST (`private=True` / `CREATE PRIVATE ...`) when the computation is expensive and materializing once would save significant reprocessing. -- **View vs Temporary View** → Persistent Views publish to Unity Catalog (SQL only), Temporary Views are pipeline-private -- **Union of streams** → use multiple Append Flows. Do NOT present UNION as an alternative — it is an anti-pattern for streaming sources. -- **Changing dataset type** → cannot change ST→MV or MV→ST without manually dropping the existing table first. Full refresh does NOT help. Rename the new dataset instead. -- **SQL `OR REFRESH`** → Prefer `CREATE OR REFRESH` over bare `CREATE` for SQL dataset definitions. Both work identically, but `OR REFRESH` is the idiomatic convention. For PRIVATE datasets: `CREATE OR REFRESH PRIVATE STREAMING TABLE` / `CREATE OR REFRESH PRIVATE MATERIALIZED VIEW`. -- **Kafka/Event Hubs sink serialization** → The `value` column is mandatory. Use `to_json(struct(*)) AS value` to serialize the entire row as JSON. Read the sink skill for details. -- **Multi-column sequencing** in Auto CDC → SQL: `SEQUENCE BY STRUCT(col1, col2)`. Python: `sequence_by=struct("col1", "col2")`. Read the auto-cdc skill for details. -- **Auto CDC supports TRUNCATE** (SCD Type 1 only) → SQL: `APPLY AS TRUNCATE WHEN condition`. Python: `apply_as_truncates=expr("condition")`. Do NOT say truncate is unsupported. -- **Python-only features** → Sinks, ForEachBatch Sinks, CDC from snapshots, and custom data sources are Python-only. When the user is working in SQL, explicitly clarify this and suggest switching to Python. -- **MV incremental refresh** → Materialized Views on **serverless** pipelines support automatic incremental refresh for aggregations. Mention the serverless requirement when discussing incremental refresh. -- **Recommend ONE clear approach** → Present a single recommended approach. Do NOT present anti-patterns or significantly inferior alternatives — it confuses users. Only mention alternatives if they are genuinely viable for different trade-offs. +- **Names** → SDP = LDP = Lakeflow Declarative Pipelines = (formerly) DLT. All interchangeable when the user mentions them. +- **"Create a table" without specifying type** → ask whether the source is streaming or batch. Streaming source → Streaming Table; batch source → Materialized View. Mismatched pairs error at validation. +- **Aggregation over a streaming source** → use a Materialized View with a batch read (`spark.read.table` / `SELECT FROM` without `STREAM`). STs are append-only and don't recompute aggregates when source rows change; MVs do. +- **Intermediate logic** → default to a Temporary View. Even for shared logic reused by multiple downstream tables. Use a Private MV/ST (`private=True` / `CREATE PRIVATE ...`) only when materializing once saves significant reprocessing. For preprocessing before Auto CDC, the temp view is required — the CDC flow reads from `STREAM(view_name)` (SQL) or `spark.readStream.table("view_name")` (Python). +- **Union of streams** → use multiple Append Flows. UNION across streaming sources is an anti-pattern. +- **Changing dataset type** → cannot change ST→MV or MV→ST in place. Full refresh does NOT help. Drop the existing table manually or rename the new dataset. +- **`CREATE OR REFRESH` vs `CREATE`** → both parse for SQL datasets, but `CREATE OR REFRESH` is the idiomatic convention. For PRIVATE datasets: `CREATE OR REFRESH PRIVATE STREAMING TABLE` / `... MATERIALIZED VIEW`. +- **Kafka/Event Hubs sink serialization** → the `value` column is mandatory; serialize the row with `to_json(struct(*)) AS value`. See [sink-python.md](references/sink-python.md). +- **Multi-column Auto CDC sequencing** → SQL: `SEQUENCE BY STRUCT(col1, col2)`. Python: `sequence_by=struct("col1", "col2")`. See the auto-cdc references. +- **Auto CDC TRUNCATE** (SCD Type 1 only) → SQL: `APPLY AS TRUNCATE WHEN condition`. Python: `apply_as_truncates=expr("condition")`. Do NOT claim truncate is unsupported. +- **Python-only features** → Sinks, ForEachBatch Sinks, CDC from snapshots, and custom data sources are Python-only. When the user is working in SQL, clarify this and suggest switching to Python. +- **Recommend ONE clear approach** → present a single recommended path. Don't list anti-patterns or inferior alternatives — they confuse. Only mention alternatives when they genuinely offer different trade-offs. + +## Common Issues + +Error → cause/fix mappings agents hit constantly. For DAB-bundle vs CLI-iteration deploy issues, see the workflow-specific reference files. + +| Error / symptom | Cause / fix | +|-----------------|-------------| +| Rejection of `CREATE OR REPLACE STREAMING TABLE` / `MATERIALIZED VIEW` | `CREATE OR REPLACE` is standard SQL, NOT SDP. Use `CREATE OR REFRESH STREAMING TABLE` / `CREATE OR REFRESH MATERIALIZED VIEW`. | +| CLI errors on `databricks fs ls /Volumes/...` | The `dbfs:` prefix is required even for UC Volume paths: `databricks fs ls dbfs:/Volumes////`. | +| `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED` at first write | A `CLUSTER BY` column is BOOLEAN / ARRAY / MAP / STRUCT / BINARY. SDP doesn't pre-validate — verify with `DESCRIBE` before submitting. Cluster keys must be numeric / string / date / timestamp. Full type rules in [references/performance.md](references/performance.md#cluster-key-data-types). | +| `Cannot create streaming table from batch query` | In a streaming-table query you wrote `FROM read_files(...)` (batch). Use `FROM STREAM read_files(...)` so Auto Loader kicks in. | +| `Column not found` at ingest time | `schemaHints` don't match the actual file schema. `DESCRIBE` a sample file and align the hints. | +| Streaming reads fail with parser error | Use `FROM STREAM read_files(...)` for file ingestion and `FROM stream(table)` (or `FROM STREAM table_name` — legacy DLT, prefer function form) for table-to-table streams. Don't mix. | +| Pipeline stuck `INITIALIZING` for serverless | Normal — first run takes a few minutes for cold start. Don't kill it. | +| Materialized View doesn't incrementally refresh | Automatic incremental refresh for aggregations requires **serverless** + Delta row tracking on the source (`delta.enableRowTracking = true`). Without both, falls back to full recompute. Mention the serverless requirement when the user asks about incremental refresh. | +| SCD2 query returns nothing / "column not found" on `START_AT` | Lakeflow uses `__START_AT` / `__END_AT` (double underscore). Current rows: `WHERE __END_AT IS NULL`. | +| `error.exceptions[0].message` missing from your events output | Your `jq` is reading `.message` (which is just "Update X is FAILED"). Read `error.exceptions[0].message` for the real cause — see [2-rapid-iteration-with-cli.md](references/2-rapid-iteration-with-cli.md#step-4-start-an-update-and-poll-that-update). | ## Publishing Modes @@ -75,259 +87,177 @@ Pipelines use a **default catalog and schema** configured in the pipeline settin - **LIVE prefix**: Deprecated. Ignored in the default publishing mode. - When reading or defining datasets within the pipeline, use the dataset name only — do NOT use fully-qualified names unless the pipeline already does so or the user explicitly requests a different target catalog/schema. -## Comprehensive API Reference +## API Reference -**MANDATORY:** Before implementing, editing, or suggesting any code for a feature, you MUST read the linked reference file for that feature. NO exceptions — always look up the reference before writing code. +**Before writing pipeline code for any feature, read the linked reference file.** Each table below maps the feature to the exact API and to the detail file for that (feature, language). -Some features require reading multiple skills together: +Some features sit on top of others — read both: -- **Auto Loader** → also read the streaming-table skill (Auto Loader produces a streaming DataFrame, so the target is a streaming table) and look up format-specific options for the file format being loaded -- **Auto CDC** → also read the streaming-table skill (Auto CDC always targets a streaming table) -- **Sinks** → also read the streaming-table skill (sinks use streaming append flows) -- **Expectations** → also read the corresponding dataset definition skill to ensure constraints are correctly placed +- **Auto Loader** / **Auto CDC** / **Sinks** target a streaming table → also read [streaming-table-python.md](references/streaming-table-python.md) / [streaming-table-sql.md](references/streaming-table-sql.md). +- **Expectations** attach to a dataset → also read the dataset definition file (streaming-table / materialized-view / temporary-view). ### Dataset Definition APIs -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| -------------------------- | ------------------------------------ | ------------------------------------- | ------------------------------------------- | ----------------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------- | -| Streaming Table | `@dp.table()` returning streaming DF | `@dlt.table()` returning streaming DF | `CREATE OR REFRESH STREAMING TABLE` | `CREATE STREAMING LIVE TABLE` | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | -| Materialized View | `@dp.materialized_view()` | `@dlt.table()` returning batch DF | `CREATE OR REFRESH MATERIALIZED VIEW` | `CREATE LIVE TABLE` (batch) | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Temporary View | `@dp.temporary_view()` | `@dlt.view()`, `@dp.view()` | `CREATE TEMPORARY VIEW` | `CREATE TEMPORARY LIVE VIEW` | [temporary-view-python](temporary-view/temporary-view-python.md) | [temporary-view-sql](temporary-view/temporary-view-sql.md) | -| Persistent View (UC) | N/A — SQL only | — | `CREATE VIEW` | — | — | [view-sql](view/view-sql.md) | -| Streaming Table (explicit) | `dp.create_streaming_table()` | `dlt.create_streaming_table()` | `CREATE OR REFRESH STREAMING TABLE` (no AS) | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | +| Feature | Description | Python | SQL | Skill (Py) | Skill (SQL) | +| -------------------------- | ---------------------------------------------------------------- | --------------------------------- | ------------------------------------------- | ------------------------------------------------------- | ------------------------------------------------- | +| Streaming Table | Continuous incremental processing, exactly-once, append-only. | `@dp.table()` returning streaming DF | `CREATE OR REFRESH STREAMING TABLE` | [streaming-table-python](references/streaming-table-python.md) | [streaming-table-sql](references/streaming-table-sql.md) | +| Materialized View | Physically stored query result, incrementally refreshed. | `@dp.materialized_view()` | `CREATE OR REFRESH MATERIALIZED VIEW` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Temporary View | Pipeline-private, not persisted to Unity Catalog. | `@dp.temporary_view()` | `CREATE TEMPORARY VIEW` | [temporary-view-python](references/temporary-view-python.md) | [temporary-view-sql](references/temporary-view-sql.md) | +| Persistent View (UC) | Published to UC; query runs on access (no storage). | N/A — SQL only | `CREATE VIEW` | — | [view-sql](references/view-sql.md) | +| Streaming Table (explicit) | Empty target, populated by separate flows (Append Flow, AUTO CDC). | `dp.create_streaming_table()` | `CREATE OR REFRESH STREAMING TABLE` (no AS) | [streaming-table-python](references/streaming-table-python.md) | [streaming-table-sql](references/streaming-table-sql.md) | ### Flow and Sink APIs -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| ---------------------------- | ---------------------------- | ----------------------------- | -------------------------------------- | ---------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------- | -| Append Flow | `@dp.append_flow()` | `@dlt.append_flow()` | `CREATE FLOW ... INSERT INTO` | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | -| Backfill Flow | `@dp.append_flow(once=True)` | `@dlt.append_flow(once=True)` | `CREATE FLOW ... INSERT INTO ... ONCE` | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | -| Sink (Delta/Kafka/EH/custom) | `dp.create_sink()` | `dlt.create_sink()` | N/A — Python only | — | [sink-python](sink/sink-python.md) | — | -| ForEachBatch Sink | `@dp.foreach_batch_sink()` | — | N/A — Python only | — | [foreach-batch-sink-python](foreach-batch-sink/foreach-batch-sink-python.md) | — | +| Feature | Description | Python | SQL | Skill (Py) | Skill (SQL) | +| ---------------------------- | ---------------------------------------------------------------- | ---------------------------- | -------------------------------------- | ----------------------------------------------------------- | ------------------------------------------------- | +| Append Flow | Fan-in: multiple sources → one streaming table. Use instead of UNION. | `@dp.append_flow()` | `CREATE FLOW ... INSERT INTO` | [streaming-table-python](references/streaming-table-python.md) | [streaming-table-sql](references/streaming-table-sql.md) | +| Backfill Flow | One-time historical load + ongoing live stream into same table. | `@dp.append_flow(once=True)` | `CREATE FLOW ... INSERT INTO ... ONCE` | [streaming-table-python](references/streaming-table-python.md) | [streaming-table-sql](references/streaming-table-sql.md) | +| Sink (Delta/Kafka/EH/custom) | Write streaming output to external Delta / Kafka / Event Hubs. | `dp.create_sink()` | N/A — Python only | [sink-python](references/sink-python.md) | — | +| ForEachBatch Sink | Custom per-batch Python logic (merge/upsert, multi-destination). Public Preview. | `@dp.foreach_batch_sink()` | N/A — Python only | [foreach-batch-sink-python](references/foreach-batch-sink-python.md) | — | ### CDC APIs -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| ---------------------------- | ----------------------------------------- | ------------------------------------------- | ------------------------------- | ------------------------------------ | ---------------------------------------------- | ---------------------------------------- | -| Auto CDC (streaming source) | `dp.create_auto_cdc_flow()` | `dlt.apply_changes()`, `dp.apply_changes()` | `AUTO CDC INTO ... FROM STREAM` | `APPLY CHANGES INTO ... FROM STREAM` | [auto-cdc-python](auto-cdc/auto-cdc-python.md) | [auto-cdc-sql](auto-cdc/auto-cdc-sql.md) | -| Auto CDC (periodic snapshot) | `dp.create_auto_cdc_from_snapshot_flow()` | `dlt.apply_changes_from_snapshot()` | N/A — Python only | — | [auto-cdc-python](auto-cdc/auto-cdc-python.md) | — | +| Feature | Description | Python | SQL | Skill (Py) | Skill (SQL) | +| ---------------------------- | -------------------------------------------------------------------- | ------------------------------------------- | ------------------------------- | ----------------------------------------- | ------------------------------------ | +| Auto CDC (streaming source) | SCD Type 1 (overwrite) or Type 2 (history) from a CDC feed. | `dp.create_auto_cdc_flow()` | `AUTO CDC INTO ... FROM STREAM` | [auto-cdc-python](references/auto-cdc-python.md) | [auto-cdc-sql](references/auto-cdc-sql.md) | +| Auto CDC (periodic snapshot) | Compare consecutive full snapshots to detect changes. | `dp.create_auto_cdc_from_snapshot_flow()` | N/A — Python only | [auto-cdc-python](references/auto-cdc-python.md) | — | + +For querying SCD Type 2 history tables (`__START_AT` / `__END_AT`, point-in-time, joining facts with historical dimensions), see [scd-2-querying.md](references/scd-2-querying.md). ### Data Quality APIs -| Feature | Python (current) | Python (deprecated) | SQL (current) | Skill (Py) | Skill (SQL) | -| ------------------ | ---------------------------- | ----------------------------- | ------------------------------------------------------ | ---------------------------------------------------------- | ---------------------------------------------------- | -| Expect (warn) | `@dp.expect()` | `@dlt.expect()` | `CONSTRAINT ... EXPECT (...)` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect or drop | `@dp.expect_or_drop()` | `@dlt.expect_or_drop()` | `CONSTRAINT ... EXPECT (...) ON VIOLATION DROP ROW` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect or fail | `@dp.expect_or_fail()` | `@dlt.expect_or_fail()` | `CONSTRAINT ... EXPECT (...) ON VIOLATION FAIL UPDATE` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect all (warn) | `@dp.expect_all({})` | `@dlt.expect_all({})` | Multiple `CONSTRAINT` clauses | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect all or drop | `@dp.expect_all_or_drop({})` | `@dlt.expect_all_or_drop({})` | Multiple constraints with `DROP ROW` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect all or fail | `@dp.expect_all_or_fail({})` | `@dlt.expect_all_or_fail({})` | Multiple constraints with `FAIL UPDATE` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | +| Feature | Description | Python | SQL | Skill (Py) | Skill (SQL) | +| ------------------ | ------------------------------------------ | ---------------------------- | ------------------------------------------------------ | ----------------------------------------------------- | ----------------------------------------------- | +| Expect (warn) | Log violations, keep all rows. | `@dp.expect()` | `CONSTRAINT ... EXPECT (...)` | [expectations-python](references/expectations-python.md) | [expectations-sql](references/expectations-sql.md) | +| Expect or drop | Drop violating rows. | `@dp.expect_or_drop()` | `CONSTRAINT ... EXPECT (...) ON VIOLATION DROP ROW` | [expectations-python](references/expectations-python.md) | [expectations-sql](references/expectations-sql.md) | +| Expect or fail | Fail the pipeline on first violation. | `@dp.expect_or_fail()` | `CONSTRAINT ... EXPECT (...) ON VIOLATION FAIL UPDATE` | [expectations-python](references/expectations-python.md) | [expectations-sql](references/expectations-sql.md) | +| Expect all (warn) | Multiple constraints at once, warn only. | `@dp.expect_all({})` | Multiple `CONSTRAINT` clauses | [expectations-python](references/expectations-python.md) | [expectations-sql](references/expectations-sql.md) | +| Expect all or drop | Multiple constraints, drop on violation. | `@dp.expect_all_or_drop({})` | Multiple constraints with `DROP ROW` | [expectations-python](references/expectations-python.md) | [expectations-sql](references/expectations-sql.md) | +| Expect all or fail | Multiple constraints, fail on violation. | `@dp.expect_all_or_fail({})` | Multiple constraints with `FAIL UPDATE` | [expectations-python](references/expectations-python.md) | [expectations-sql](references/expectations-sql.md) | ### Reading Data APIs -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| --------------------------------- | ---------------------------------------------- | --------------------------------------------------- | ------------------------------------------------ | ---------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------- | -| Batch read (pipeline dataset) | `spark.read.table("name")` | `dp.read("name")`, `dlt.read("name")` | `SELECT ... FROM name` | `SELECT ... FROM LIVE.name` | — | — | -| Streaming read (pipeline dataset) | `spark.readStream.table("name")` | `dp.read_stream("name")`, `dlt.read_stream("name")` | `SELECT ... FROM STREAM name` | `SELECT ... FROM STREAM LIVE.name` | — | — | -| Auto Loader (cloud files) | `spark.readStream.format("cloudFiles")` | — | `STREAM read_files(...)` | — | [auto-loader-python](auto-loader/auto-loader-python.md) | [auto-loader-sql](auto-loader/auto-loader-sql.md) | -| Kafka source | `spark.readStream.format("kafka")` | — | `STREAM read_kafka(...)` | — | — | — | -| Kinesis source | `spark.readStream.format("kinesis")` | — | `STREAM read_kinesis(...)` | — | — | — | -| Pub/Sub source | `spark.readStream.format("pubsub")` | — | `STREAM read_pubsub(...)` | — | — | — | -| Pulsar source | `spark.readStream.format("pulsar")` | — | `STREAM read_pulsar(...)` | — | — | — | -| Event Hubs source | `spark.readStream.format("kafka")` + EH config | — | `STREAM read_kafka(...)` + EH config | — | — | — | -| JDBC / Lakehouse Federation | `spark.read.format("postgresql")` etc. | — | Direct table ref via federation catalog | — | — | — | -| Custom data source | `spark.read[Stream].format("custom")` | — | N/A — Python only | — | — | — | -| Static file read (batch) | `spark.read.format("json"\|"csv"\|...).load()` | — | `read_files(...)` (no STREAM) | — | — | — | -| Skip upstream change commits | `.option("skipChangeCommits", "true")` | — | `read_stream("name", skipChangeCommits => true)` | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | +| Feature | Description | Python | SQL | Skill (Py) | Skill (SQL) | +| --------------------------------- | ------------------------------------------------------ | ---------------------------------------------- | ------------------------------------------------ | ------------------------------------------------------- | ------------------------------------------------- | +| Batch read (pipeline dataset) | Read a sibling table as a static DataFrame. | `spark.read.table("name")` | `SELECT ... FROM name` | — | — | +| Streaming read (pipeline dataset) | Read a sibling table as a streaming DataFrame. | `spark.readStream.table("name")` | `SELECT ... FROM STREAM name` | — | — | +| Auto Loader (cloud files) | Incrementally ingest new files from cloud storage. | `spark.readStream.format("cloudFiles")` | `STREAM read_files(...)` | [auto-loader-python](references/auto-loader-python.md) | [auto-loader-sql](references/auto-loader-sql.md) | +| Kafka source | Streaming read from Kafka topic. | `spark.readStream.format("kafka")` | `STREAM read_kafka(...)` | [kafka](references/kafka.md) | [kafka](references/kafka.md) | +| Kinesis source | Streaming read from AWS Kinesis. | `spark.readStream.format("kinesis")` | `STREAM read_kinesis(...)` | — | — | +| Pub/Sub source | Streaming read from GCP Pub/Sub. | `spark.readStream.format("pubsub")` | `STREAM read_pubsub(...)` | — | — | +| Pulsar source | Streaming read from Apache Pulsar. | `spark.readStream.format("pulsar")` | `STREAM read_pulsar(...)` | — | — | +| Event Hubs source | Streaming read from Azure Event Hubs (Kafka protocol). | `spark.readStream.format("kafka")` + EH config | `STREAM read_kafka(...)` + EH config | [kafka](references/kafka.md) | [kafka](references/kafka.md) | +| JDBC / Lakehouse Federation | Batch read from external systems via federation. | `spark.read.format("postgresql")` etc. | Direct table ref via federation catalog | — | — | +| Custom data source | User-defined Python data source. | `spark.read[Stream].format("custom")` | N/A — Python only | — | — | +| Static file read (batch) | One-shot load of files (no incremental tracking). | `spark.read.format("json"\|"csv"\|...).load()` | `read_files(...)` (no STREAM) | — | — | +| Skip upstream change commits | Ignore CDC commits on the upstream table. | `.option("skipChangeCommits", "true")` | `read_stream("name", skipChangeCommits => true)` | [streaming-table-python](references/streaming-table-python.md) | [streaming-table-sql](references/streaming-table-sql.md) | ### Table/Schema Feature APIs -| Feature | Python (current) | SQL (current) | Skill (Py) | Skill (SQL) | -| ---------------------------- | ----------------------------------------------------- | --------------------------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------- | -| Liquid clustering | `cluster_by=[...]` | `CLUSTER BY (col1, col2)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Auto liquid clustering | `cluster_by_auto=True` | `CLUSTER BY AUTO` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Partition columns | `partition_cols=[...]` | `PARTITIONED BY (col1, col2)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Table properties | `table_properties={...}` | `TBLPROPERTIES (...)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Explicit schema | `schema="col1 TYPE, ..."` | `(col1 TYPE, ...) AS` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Generated columns | `schema="..., col TYPE GENERATED ALWAYS AS (expr)"` | `col TYPE GENERATED ALWAYS AS (expr)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Row filter (Public Preview) | `row_filter="ROW FILTER fn ON (col)"` | `WITH ROW FILTER fn ON (col)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Column mask (Public Preview) | `schema="..., col TYPE MASK fn USING COLUMNS (col2)"` | `col TYPE MASK fn USING COLUMNS (col2)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Private dataset | `private=True` | `CREATE PRIVATE ...` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | - -### Import / Module APIs - -| Current | Deprecated | Notes | -| ------------------------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | -| `from pyspark import pipelines as dp` | `import dlt` | Both work. Prefer `dp`. Do NOT change existing `dlt` imports. | -| `spark.read.table()` / `spark.readStream.table()` | `dp.read()` / `dp.read_stream()` / `dlt.read()` / `dlt.read_stream()` | Deprecated reads still work. Prefer `spark.*`. | -| — | `LIVE.` prefix | Fully deprecated. NEVER use. Causes errors in newer pipelines. | -| — | `CREATE LIVE TABLE` / `CREATE LIVE VIEW` | Fully deprecated. Use `CREATE STREAMING TABLE` / `CREATE MATERIALIZED VIEW` / `CREATE TEMPORARY VIEW`. | - -## Language-specific guides - - -Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables / DLT) is a framework for building batch and streaming data pipelines. - -## Migrating from DLT - -If you have an existing DLT pipeline (`import dlt`, `@dlt.table`, `dlt.read(...)`, `dlt.apply_changes(...)`) and want to move to SDP, see [references/dlt-migration.md](references/dlt-migration.md). It covers both migration paths — DLT Python → SDP Python (`from pyspark import pipelines as dp`) and DLT Python → SDP SQL — with side-by-side conversions for the table decorators, reads, expectations, CDC/SCD, and partitioning → liquid clustering. - -## Choose Your Workflow - -Three project shapes exist — pick before scaffolding: - -| Situation | Workflow | -|-----------|----------| -| New standalone pipeline project with its own bundle | **A. Standalone bundle** | -| Pipeline added to an existing DAB project | **B. Existing bundle** | -| Quick prototyping, no bundle (yet) | **C. Rapid CLI iteration** | - -Default to A for production-bound work and C for exploration. Full details, generated structures, polling patterns, and edit/re-upload flow in [references/workflows.md](references/workflows.md). +| Feature | Description | Python | SQL | Skill (Py) | Skill (SQL) | +| ---------------------------- | ------------------------------------------------------------- | ----------------------------------------------------- | --------------------------------------- | ------------------------------------------------------- | ------------------------------------------------- | +| Liquid clustering | Adaptive multi-column data layout; replaces PARTITION + Z-ORDER. Prefer Auto clustering when possible | `cluster_by=[...]` | `CLUSTER BY (col1, col2)` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Auto liquid clustering | Databricks picks clustering keys from query patterns. | `cluster_by_auto=True` | `CLUSTER BY AUTO` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Partition columns | Legacy fixed partitioning. Prefer Liquid Clustering. | `partition_cols=[...]` | `PARTITIONED BY (col1, col2)` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Table properties | Delta table properties (auto-optimize, CDF, retention). | `table_properties={...}` | `TBLPROPERTIES (...)` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Explicit schema | Declare column types up front (vs inferred). | `schema="col1 TYPE, ..."` | `(col1 TYPE, ...) AS` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Generated columns | Columns computed from other columns at write time. | `schema="..., col TYPE GENERATED ALWAYS AS (expr)"` | `col TYPE GENERATED ALWAYS AS (expr)` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Row filter (Public Preview) | UC fine-grained access: filter rows by a function. | `row_filter="ROW FILTER fn ON (col)"` | `WITH ROW FILTER fn ON (col)` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Column mask (Public Preview) | UC fine-grained access: mask a column with a function. | `schema="..., col TYPE MASK fn USING COLUMNS (col2)"` | `col TYPE MASK fn USING COLUMNS (col2)` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | +| Private dataset | Materialized intermediate not published to UC. | `private=True` | `CREATE PRIVATE ...` | [materialized-view-python](references/materialized-view-python.md) | [materialized-view-sql](references/materialized-view-sql.md) | + +### Legacy DLT Syntax — always migrate + +The tables above show **only the modern API**. If you see any of the following in user code, it is the legacy DLT syntax — **always migrate to the modern form**, do not extend it. Read [references/dlt-migration.md](references/dlt-migration.md) before suggesting changes so the conversion is correct (especially around `apply_changes` → `create_auto_cdc_flow` semantics and `partition_cols` → `cluster_by`). + +| If you see… | …it's DLT. Migrate to | +| --------------------------------------------------------------------------- | -------------------------------------------------------------------- | +| `import dlt` | `from pyspark import pipelines as dp` | +| `@dlt.table(...)`, `@dlt.view(...)`, `@dlt.append_flow(...)`, `@dlt.expect*` | Same decorators on `dp.*` (e.g. `@dp.table`, `@dp.expect_or_drop`). | +| `dlt.read("name")` / `dlt.read_stream("name")` | `spark.read.table("name")` / `spark.readStream.table("name")` | +| `dp.read(...)` / `dp.read_stream(...)` | Also legacy — use `spark.read.table(...)` / `spark.readStream.table(...)`. | +| `dlt.apply_changes(...)` / `dp.apply_changes(...)` | `dp.create_auto_cdc_flow(...)`. `sequence_by` accepts a column name (string) or `col(...)`; `stored_as_scd_type` is integer `2` for Type 2 or string `"1"` for Type 1. | +| `dlt.apply_changes_from_snapshot(...)` | `dp.create_auto_cdc_from_snapshot_flow(...)` | +| `dlt.create_streaming_table(...)` | `dp.create_streaming_table(...)` | +| `LIVE.` prefix in SQL | Bare name (`SELECT FROM name`, `SELECT FROM STREAM name`). `LIVE.` will error in modern pipelines. | +| `CREATE LIVE TABLE` / `CREATE STREAMING LIVE TABLE` | `CREATE OR REFRESH MATERIALIZED VIEW` / `CREATE OR REFRESH STREAMING TABLE`. | +| `CREATE TEMPORARY LIVE VIEW` (a.k.a. `CREATE LIVE VIEW`) | `CREATE TEMPORARY VIEW`. **Exception**: `CREATE TEMPORARY VIEW` does NOT support `CONSTRAINT` clauses for expectations — for the rare case where you need expectations on a temp view, `CREATE LIVE VIEW` is retained. See [temporary-view-sql.md](references/temporary-view-sql.md#using-expectations-with-temporary-views) and [expectations-sql.md](references/expectations-sql.md). | +| `APPLY CHANGES INTO ... FROM STREAM ...` (SQL) | `AUTO CDC INTO ... FROM STREAM ...` | +| `partition_cols=[...]` / `PARTITIONED BY (...)` + `ZORDER` | `cluster_by=[...]` / `CLUSTER BY (...)` (Liquid Clustering). | +| `input_file_name()` | `_metadata.file_path` (SQL) / `F.col("_metadata.file_path")` (Python). | +| `target=...` parameter on `create_streaming_table` / pipeline config | `schema=...` | ## Language Selection (Python vs SQL) -Decide before scaffolding — the choice picks template files (`.py` vs `.sql`) and which reference docs apply. Both can coexist, but pick a primary. +Decide before scaffolding — the choice picks template files (`.py` vs `.sql`) and which reference docs apply. Both can coexist, but pick a primary. When unsure, default to SQL for simplicity. | User signal | Pick | |-------------|------| | "Python pipeline", UDF, pandas, ML inference, pyspark | **Python** | | "SQL pipeline", "SQL files" | **SQL** | -| "Simple pipeline", "create a table", "an aggregation" | **SQL** (simpler) | +| "Simple pipeline", "create a table", "an aggregation" | **SQL** (simpler, use it as default) | | Complex parameterized logic, custom UDFs, ML | **Python** | If ambiguous, ask. Stick with the chosen language unless the user explicitly switches. -## Scaffolding a New Pipeline Project - -The newer `databricks pipelines init` is focused on pipeline projects: - -```bash -databricks pipelines init --output-dir . --config-file init-config.json -``` - -`init-config.json`: - -```json -{ - "project_name": "my_pipeline", - "initial_catalog": "prod_catalog", - "use_personal_schema": "no", - "initial_language": "sql" -} -``` - -The template-based `databricks bundle init lakeflow-pipelines` also works: - -```bash -databricks bundle init lakeflow-pipelines --config-file <(echo '{"project_name": "my_pipeline", "language": "python", "serverless": "yes"}') --profile < /dev/null -``` - -Field constraints: - -- `project_name`: letters, numbers, underscores only -- `language` / `initial_language`: `python` or `sql` (lowercase) - - SQL: Recommended for straightforward transformations (filters, joins, aggregations) - - Python: Recommended for complex logic (custom UDFs, ML, advanced processing) - -See [references/workflows.md](references/workflows.md) for the full generated structure, `databricks.yml` essentials, and per-target catalog/schema patterns. - -After scaffolding, create `CLAUDE.md` and `AGENTS.md` in the project directory. These files are essential to provide agents with guidance on how to work with the project. Use this content: - -``` -# Declarative Automation Bundles Project - -This project uses Declarative Automation Bundles (formerly Databricks Asset Bundles) for deployment. +## Choose Your Workflow -## Prerequisites +Three project shapes exist — pick before scaffolding. Default to A for production-bound work and C for exploration / demo scaffolding. -Install the Databricks CLI (>= v0.288.0) if not already installed: -- macOS: `brew tap databricks/tap && brew install databricks` -- Linux: `curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh` -- Windows: `winget install Databricks.DatabricksCLI` +- **A: Standalone new pipeline project (DAB)** — pipeline IS the project, no existing `databricks.yml`. Scaffold with `databricks pipelines init --output-dir . --config-file init-config.json`. → [1-project-initialization-with-dab.md](references/1-project-initialization-with-dab.md) +- **B: Pipeline in an existing bundle (DAB)** — `databricks.yml` already exists. Add a `resources/.pipeline.yml` pointing at `src/`. → [1-project-initialization-with-dab.md#workflow-b-pipeline-in-existing-bundle](references/1-project-initialization-with-dab.md#workflow-b-pipeline-in-existing-bundle) +- **C: Rapid CLI iteration (no bundle)** — prototyping. `databricks pipelines create / start-update / list-pipeline-events`; formalise into a bundle later if the work goes to production. → [2-rapid-iteration-with-cli.md](references/2-rapid-iteration-with-cli.md) -Verify: `databricks -v` +## Pipeline Structure -## For AI Agents +- Follow the medallion pattern (Bronze → Silver → Gold) unless the user says otherwise. Keep it simple by default — just a few tables. +- One dataset per file, named after the dataset. Transformation files live in `src/` or `transformations/`. +- **Gold layer: preserve key business dimensions.** When aggregating into Gold, keep the dimensions analysts will filter / slice by (location, department, product line, customer segment, time period). Over-aggregating loses information that can't be recovered downstream. If a dashboard is mentioned, every filter on it needs to be a column in the Gold table. Easier to aggregate further in queries than to recover lost dimensions. -Read the `databricks-core` skill for CLI basics, authentication, and deployment workflow. -Read the `databricks-pipelines` skill for pipeline-specific guidance. -If skills are not available, install them: `databricks aitools install` -``` - -## Pipeline Structure +## Running a Pipeline -- Follow the medallion architecture pattern (Bronze → Silver → Gold) unless the user specifies otherwise -- Use the convention of 1 dataset per file, named after the dataset -- Place transformation files in a `src/` or `transformations/` folder +Picking the right run command depends on the workflow chosen above. -``` -my-pipeline-project/ -├── databricks.yml # Bundle configuration -├── resources/ -│ ├── my_pipeline.pipeline.yml # Pipeline definition -│ └── my_pipeline_job.job.yml # Scheduling job (optional) -└── src/ - ├── my_table.py (or .sql) # One dataset per file - ├── another_table.py (or .sql) - └── ... -``` - -## Scheduling Pipelines - -To schedule a pipeline, add a job that triggers it in `resources/.job.yml`: - -```yaml -resources: - jobs: - my_pipeline_job: - trigger: - periodic: - interval: 1 - unit: DAYS - tasks: - - task_key: refresh_pipeline - pipeline_task: - pipeline_id: ${resources.pipelines.my_pipeline.id} -``` +- **Workflow A / B (DAB)** — Code changes only take effect after `databricks bundle deploy`. Always deploy before any run, dry run, or selective refresh. + ```bash + databricks bundle validate --profile + databricks bundle deploy -t dev --profile + databricks bundle run -t dev --profile + databricks pipelines get --profile # status + ``` + → Full DAB run + iteration details: [references/1-project-initialization-with-dab.md#running-a-pipeline-workflow-a--b](references/1-project-initialization-with-dab.md#running-a-pipeline-workflow-a--b) -## Running Pipelines +- **Workflow C (CLI, no bundle)** — Upload files to the workspace, then drive the pipeline directly. Re-upload after every code change. + ```bash + databricks workspace import-dir ./my_pipeline /Workspace/Users//my_pipeline --overwrite + databricks pipelines start-update + ``` + → Full CLI run + polling pattern: [references/2-rapid-iteration-with-cli.md](references/2-rapid-iteration-with-cli.md) -**You must deploy before running.** In local development, code changes only take effect after `databricks bundle deploy`. Always deploy before any run, dry run, or selective refresh. +**Refresh modes (both workflows):** -- Selective refresh is preferred when you only need to run one table. For selective refresh it is important that dependencies are already materialized. -- **Full refresh is the most expensive and dangerous option, and can lead to data loss**, so it should be used only when really necessary. Always suggest this as a follow-up that the user explicitly needs to select. +- **Selective refresh** is preferred when you only need to run one table. Dependencies must already be materialized. +- **Full refresh** is the most expensive and dangerous option and **can lead to data loss** (it reprocesses streaming sources from scratch, destroying streaming state). Use only when really necessary. Always suggest it as a follow-up the user must explicitly approve. -## Development Workflow +**Always poll the update**, not top-level pipeline state — see the polling rationale in [2-rapid-iteration-with-cli.md#step-4-start-an-update-and-poll-that-update](references/2-rapid-iteration-with-cli.md#step-4-start-an-update-and-poll-that-update). Same rule applies to bundle runs. -1. **Validate**: `databricks bundle validate --profile ` -2. **Deploy**: `databricks bundle deploy -t dev --profile ` -3. **Run pipeline**: `databricks bundle run -t dev --profile ` -4. **Check status**: `databricks pipelines get --pipeline-id --profile ` +## Reference Index -## Pipeline API Reference +Project & lifecycle: -Detailed reference guides for each pipeline API. **Read the relevant guide before writing pipeline code.** +- [1-project-initialization-with-dab.md](references/1-project-initialization-with-dab.md) — Workflows A and B. +- [2-rapid-iteration-with-cli.md](references/2-rapid-iteration-with-cli.md) — Workflow C; start-update + polling + error-extraction. +- [pipeline-configuration.md](references/pipeline-configuration.md) — Full create/update JSON reference + variant snippets + multi-schema + platform constraints. +- [performance.md](references/performance.md) — Liquid Clustering, state management, joins, pre-aggregation, monitoring. +- [dlt-migration.md](references/dlt-migration.md) — DLT → SDP conversions. -### Project & Lifecycle +Cross-cutting patterns: -- [Workflows](references/workflows.md) — Standalone bundle / existing bundle / rapid CLI iteration; language selection; `pipelines init`; start-update + poll-the-update pattern; edit/re-upload/restart flow -- [Pipeline Configuration](references/pipeline-configuration.md) — Full JSON config reference (top-level, clusters, event_log, notifications, configuration, restart_window, environment) + variant snippets (dev mode, non-serverless, continuous, notifications, autoscaling, custom event log, serverless Python deps) + multi-schema patterns + platform constraints -- [Performance Tuning](references/performance.md) — Liquid Clustering by layer (bronze/silver/gold), key-type rules, state-management strategies for streaming, join optimization, pre-aggregation, monitoring -- [Migrating from DLT](references/dlt-migration.md) — Side-by-side conversions (decorators, reads, expectations, CDC/SCD, partitioning → liquid clustering) +- [streaming-patterns.md](references/streaming-patterns.md) — Dedup, windowed aggregations, late data, rescue-data quarantine, anomaly detection, lag monitoring. +- [scd-2-querying.md](references/scd-2-querying.md) — Current-state, point-in-time, joining facts with historical dims. +- [kafka.md](references/kafka.md) — Kafka / Event Hubs ingestion. -### Datasets, Flows & Quality +Auto Loader format-specific options: [JSON](references/options-json.md) · [CSV](references/options-csv.md) · [XML](references/options-xml.md) · [Parquet](references/options-parquet.md) · [Avro](references/options-avro.md) · [Text](references/options-text.md) · [ORC](references/options-orc.md). -- [Write Spark Declarative Pipelines](references/write-spark-declarative-pipelines.md) — Core syntax and rules ([Python](references/python-basics.md), [SQL](references/sql-basics.md)) -- [Streaming Tables](references/streaming-table.md) — Continuous data stream processing ([Python](references/streaming-table-python.md), [SQL](references/streaming-table-sql.md)) -- [Materialized Views](references/materialized-view.md) — Physically stored query results with incremental refresh ([Python](references/materialized-view-python.md), [SQL](references/materialized-view-sql.md)) -- [Views](references/view.md) — Reusable query logic published to Unity Catalog ([SQL](references/view-sql.md)) -- [Temporary Views](references/temporary-view.md) — Pipeline-private views ([Python](references/temporary-view-python.md), [SQL](references/temporary-view-sql.md)) -- [Auto Loader](references/auto-loader.md) — Incrementally ingest files from cloud storage ([Python](references/auto-loader-python.md), [SQL](references/auto-loader-sql.md)) -- [Kafka Ingestion](references/kafka.md) — Read from Kafka / Event Hubs with JSON parsing, Secrets-based auth -- [Auto CDC](references/auto-cdc.md) — Process Change Data Capture feeds, SCD Type 1 & 2 ([Python](references/auto-cdc-python.md), [SQL](references/auto-cdc-sql.md)) -- [SCD Type 2 Querying](references/scd-2-querying.md) — Current-state views, point-in-time queries, joining facts with historical dimensions -- [Streaming Patterns](references/streaming-patterns.md) — Deduplication, windowed aggregations (tumbling/multi-size/session), late-arriving data, rescue-data quarantine, monitoring lag, anomaly detection -- [Expectations](references/expectations.md) — Define and enforce data quality constraints ([Python](references/expectations-python.md), [SQL](references/expectations-sql.md)) -- [Sinks](references/sink.md) — Write to Kafka, Event Hubs, external Delta tables ([Python](references/sink-python.md)) -- [ForEachBatch Sinks](references/foreach-batch-sink.md) — Custom streaming sink with per-batch Python logic ([Python](references/foreach-batch-sink-python.md)) +Dataset, flow, CDC, expectation, Auto Loader, and sink references are listed per (feature, language) in the [API Reference tables above](#api-reference). diff --git a/skills/databricks-pipelines/references/workflows.md b/skills/databricks-pipelines/references/1-project-initialization-with-dab.md similarity index 50% rename from skills/databricks-pipelines/references/workflows.md rename to skills/databricks-pipelines/references/1-project-initialization-with-dab.md index 671157c..792ef52 100644 --- a/skills/databricks-pipelines/references/workflows.md +++ b/skills/databricks-pipelines/references/1-project-initialization-with-dab.md @@ -1,37 +1,17 @@ -# Pipeline Project Workflows +# Project Initialization with DAB -Three workflows for building Spark Declarative Pipelines, depending on what already exists in the project and how much DAB scaffolding the user wants. +Two DAB-based workflows for creating Spark Declarative Pipelines: -## Choose Your Workflow +- **Workflow A**: Standalone new project (the pipeline *is* the project). +- **Workflow B**: Adding a pipeline to an existing bundle (the pipeline is part of a larger app + jobs + dashboards). -| Situation | Workflow | -|-----------|----------| -| New, standalone pipeline project with its own bundle | **A. Standalone bundle** | -| Pipeline added to an existing DAB project | **B. Existing bundle** | -| Quick prototyping, no bundle (yet) | **C. Rapid CLI iteration** | - -If the user is unsure, default to A for production-bound work and C for exploration. - ---- - -## Language Selection (Python vs SQL) - -Decide before scaffolding — the choice picks the template files (`.py` vs `.sql`) and pulls in different reference docs. Both languages can coexist in the same project, but pick one primary. - -| User signal | Pick | -|-------------|------| -| "Python pipeline", "use Python", UDF, pandas, ML inference, pyspark | **Python** | -| "SQL pipeline", "SQL files", "use SQL" | **SQL** | -| "Create a simple pipeline", "create a table", "an aggregation" | **SQL** (simpler) | -| Complex parameterized logic, custom UDFs, ML, advanced processing | **Python** | - -If the request is ambiguous, ask. Stick with the chosen language unless the user explicitly switches. +For prototyping without a bundle, see [2-rapid-iteration-with-cli.md](2-rapid-iteration-with-cli.md). --- ## Workflow A: Standalone Bundle (`pipelines init`) -Use when the user wants a new project where the pipeline *is* the project. +Use when the user wants a new project where the pipeline *is* the project (no existing `databricks.yml`). ### Non-interactive (recommended for agents) @@ -65,6 +45,18 @@ databricks pipelines init --output-dir . Prompts for the same fields. +### Alternative: `databricks bundle init lakeflow-pipelines` + +The older template-based scaffolding also works: + +```bash +databricks bundle init lakeflow-pipelines \ + --config-file <(echo '{"project_name": "my_pipeline", "language": "python", "serverless": "yes"}') \ + --profile < /dev/null +``` + +Both produce DAB-shaped projects; `pipelines init` is the newer, more focused command. + ### Generated structure ``` @@ -91,18 +83,6 @@ project_root/ 3. Edit `resources/_etl.pipeline.yml` for pipeline-level settings (serverless on by default). 4. `databricks bundle validate` → `databricks bundle deploy [-t ]` → `databricks bundle run `. -### Alternative: `databricks bundle init lakeflow-pipelines` - -The older template-based scaffolding also works: - -```bash -databricks bundle init lakeflow-pipelines \ - --config-file <(echo '{"project_name": "my_pipeline", "language": "python", "serverless": "yes"}') \ - --profile < /dev/null -``` - -Both produce DAB-shaped projects; `pipelines init` is the newer, more focused command. - ### `databricks.yml` essentials ```yaml @@ -160,9 +140,36 @@ resources: - --editable ${workspace.file_path} ``` +### Scheduling Pipelines + +To schedule a pipeline, add a job that triggers it in `resources/.job.yml`: + +```yaml +resources: + jobs: + my_pipeline_job: + trigger: + periodic: + interval: 1 + unit: DAYS + tasks: + - task_key: refresh_pipeline + pipeline_task: + pipeline_id: ${resources.pipelines.my_pipeline.id} +``` + + ### Python project dependencies -Python projects ship a `pyproject.toml`. Runtime deps go in `[project].dependencies`; dev-only in `[project.optional-dependencies].dev`. The `--editable ${workspace.file_path}` line in the pipeline resource installs the package on serverless compute at deploy time. +Python projects ship a standard `pyproject.toml`. Runtime deps in `[project].dependencies`, dev-only in `[project.optional-dependencies].dev` (e.g. `databricks-connect>=15.4,<15.5`, `pytest`, `ruff`). The `--editable ${workspace.file_path}` line in the pipeline resource installs the package on serverless compute at deploy time. + +### Multi-environment workflow + +```bash +databricks bundle deploy # dev (default target) — resources prefixed [dev ] +databricks bundle deploy --target prod # prod — no prefix, schedules active +databricks bundle run customer_pipeline_etl [--target prod] +``` --- @@ -208,140 +215,45 @@ The pipeline picks up the bundle's existing targets / variables / permissions. --- -## Workflow C: Rapid CLI Iteration (no bundle) - -Use for prototyping when bundle scaffolding would slow the user down. Skip when the work is production-bound — workflow A or B is better long-term. - -### Step 1: Write files locally - -`.sql` or `.py` files in a folder. See [python-basics.md](python-basics.md) or [sql-basics.md](sql-basics.md) for syntax. - -### Step 2: Upload to workspace - -```bash -databricks workspace import-dir ./my_pipeline /Workspace/Users//my_pipeline -``` - -Re-upload with `--overwrite` after every code change. - -### Step 3: Create the pipeline - -```bash -databricks pipelines create --json '{ - "name": "my_pipeline", - "catalog": "my_catalog", - "schema": "my_schema", - "serverless": true, - "continuous": false, - "channel": "PREVIEW", - "libraries": [{"glob": {"include": "/Workspace/Users//my_pipeline/**"}}] -}' -``` - -`libraries` field notes: - -- `"glob"` — directory of files. Recommended. -- `"file"` — single `.sql` / `.py`. A `"file"` pointing at a folder fails with `Paths must end with .py or .sql`. -- `"notebook"` — **deprecated**, never use. - -Use enumerated `"file"` entries instead of `"glob"` only when explicit ordering matters. - -Capture the returned `pipeline_id`. +## Running a Pipeline (Workflow A / B) -### Step 4: Start an update and poll *that update* +**You must deploy before running.** In local development, code changes only take effect after `databricks bundle deploy`. Always deploy before any run, dry run, or selective refresh. -```bash -UPDATE_ID=$(databricks pipelines start-update | jq -r .update_id) -# Or with full refresh (destructive on streaming state — omit for incremental): -# UPDATE_ID=$(databricks pipelines start-update --full-refresh | jq -r .update_id) - -while :; do - STATE=$(databricks pipelines get-update "$UPDATE_ID" | jq -r '.update.state') - echo "$(date +%H:%M:%S) update=$UPDATE_ID state=$STATE" - case "$STATE" in COMPLETED|FAILED|CANCELED) break;; esac - sleep 30 -done -``` - -**Why poll the update, not the pipeline.** Top-level pipeline `state` flips back to `RUNNING` on `RETRY_ON_FAILURE`, so a loop watching the pipeline (or `latest_updates[0]`) can spin past a real `FAILED` update forever. Poll the captured `update_id` and stop on the first terminal state — including `FAILED`. - -**On `FAILED`**: read the events log, don't re-run. +### Development workflow ```bash -databricks pipelines list-pipeline-events \ - | jq '[.[] | select(.level=="ERROR") | {event_type, message: (.message // "")[0:300]}] | .[0:5]' -``` +# 1. Validate the bundle config +databricks bundle validate --profile -If the pipeline is already `RUNNING`, `start-update` queues the new update. Force-stop with `databricks pipelines stop ` first if needed. +# 2. Deploy to a target (dev is default) +databricks bundle deploy -t dev --profile -### Step 5: Edit → re-upload → restart +# 3. Trigger the pipeline +databricks bundle run -t dev --profile -```bash -# Re-upload (whole dir) -databricks workspace import-dir ./my_pipeline /Workspace/Users//my_pipeline --overwrite - -# Or a single file -databricks workspace import /Workspace/Users//my_pipeline/gold.sql \ - --file ./my_pipeline/gold.sql --format RAW --overwrite - -# Restart -databricks pipelines start-update +# 4. Check status (capture the update_id from step 3 and poll it — not top-level state) +databricks pipelines get --profile +databricks pipelines get-update --profile ``` -**Use `--format RAW`** for raw `.sql` / `.py` FILE entries. `--format SOURCE --language SQL|PYTHON` uploads a workspace *notebook* — and **notebooks are deprecated for pipelines**. Mixing the two on the same path fails with `Cannot overwrite the asset ... due to type mismatch (asked: NOTEBOOK, actual: FILE)`. +For the rationale on polling the update (not the pipeline) and the FAILED-extraction `jq` pattern, see [2-rapid-iteration-with-cli.md#step-4-start-an-update-and-poll-that-update](2-rapid-iteration-with-cli.md#step-4-start-an-update-and-poll-that-update). It applies to bundle runs too. -### Step 6: Validate output data +### Refresh modes -Even on `COMPLETED`, verify the data: +- **Selective refresh** is preferred when only one table needs to run. Dependencies must already be materialized. +- **Full refresh** is the most expensive option and **can lead to data loss** — it reprocesses streaming sources from scratch and destroys streaming state. Use only when necessary, and always surface it as a follow-up the user must explicitly approve. CLI: `databricks bundle run --full-refresh-all` or `--refresh ` for selective. -```bash -databricks experimental aitools tools discover-schema \ - my_catalog.my_schema.bronze_orders \ - my_catalog.my_schema.silver_orders \ - my_catalog.my_schema.gold_summary -``` +### Editing pipeline code -Returns columns/types, 5 sample rows, total row count, and null counts per column per table. - -Check for: empty tables (ingestion or filter problems), unexpected row counts (broken joins), missing columns (schema mismatch), nulls in key columns (data quality). - -### Python SDK alternative - -```python -from databricks.sdk import WorkspaceClient -import time - -w = WorkspaceClient() - -pipeline = w.pipelines.create( - name="my_pipeline", - catalog="my_catalog", - schema="my_schema", - serverless=True, - continuous=False, - libraries=[{"glob": {"include": "/Workspace/Users//my_pipeline/**"}}], - development=True, -) - -update = w.pipelines.start_update( - pipeline_id=pipeline.pipeline_id, - full_refresh=True, -) - -while True: - u = w.pipelines.get_update(pipeline_id=pipeline.pipeline_id, - update_id=update.update_id).update - if str(u.state) in ("COMPLETED", "FAILED", "CANCELED"): - print(f"Update {u.update_id}: {u.state}") - break - time.sleep(10) -``` +Edit `.sql` / `.py` files under `src/`, then re-run `databricks bundle deploy` + `databricks bundle run`. Bundle deploy uploads changed files as raw `FILE` entries. Don't mix `databricks workspace import --format SOURCE` into a bundle-managed pipeline — that creates a NOTEBOOK entry and subsequent bundle deploys fail with `type mismatch (asked: FILE, actual: NOTEBOOK)`. --- ## Migrating from a Manual Folder Structure -If the user already has `bronze/`, `silver/`, `gold/` folders without a bundle, migrate to workflow A by wrapping them in a `databricks.yml` and a pipeline resource pointing at the existing folders via a `glob`. No file moves required — the medallion folders work as-is under `transformations/**`. +If the user already has `bronze/`, `silver/`, `gold/` folders without a bundle, migrate to Workflow A by wrapping them in a `databricks.yml` and a pipeline resource pointing at the existing folders via a `glob`. No file moves required — the medallion folders work as-is under `transformations/**`. + +For detailed pipeline configuration options (development mode, continuous, custom event log, notifications, Python deps, classic clusters), see [pipeline-configuration.md](pipeline-configuration.md). --- @@ -355,3 +267,4 @@ If the user already has `bronze/`, `silver/`, `gold/` folders without a bundle, | Files deploy but pipeline doesn't pick them up | Glob pattern in `libraries` doesn't match — re-check `include` path relative to the resource file | | `Bundle validation failed: Invalid schema` | `databricks bundle validate`, check YAML indentation (spaces, not tabs) | | Files deploy but pipeline config stale | `databricks bundle deploy --force` | +| `Authentication error` on deploy | `databricks configure --host https://.cloud.databricks.com` or set `DATABRICKS_HOST` / `DATABRICKS_TOKEN` | diff --git a/skills/databricks-pipelines/references/2-rapid-iteration-with-cli.md b/skills/databricks-pipelines/references/2-rapid-iteration-with-cli.md new file mode 100644 index 0000000..c9053f6 --- /dev/null +++ b/skills/databricks-pipelines/references/2-rapid-iteration-with-cli.md @@ -0,0 +1,157 @@ +# Rapid Iteration with CLI (no DAB) + +Use the `databricks pipelines` CLI to create, run, and iterate on a pipeline **without managing a bundle**. Fastest path for prototyping. Production-bound work belongs in a bundle — see [1-project-initialization-with-dab.md](1-project-initialization-with-dab.md). + +**Default to serverless.** Only use classic clusters if the user explicitly requires R, Spark RDD APIs, or JAR libraries. + +--- + +## Step 1: Write pipeline files locally + +`.sql` or `.py` files in a folder. See [python-basics.md](python-basics.md) or [sql-basics.md](sql-basics.md) for syntax. + +## Step 2: Upload to the workspace + +```bash +databricks workspace import-dir ./my_pipeline /Workspace/Users//my_pipeline +``` + +Re-upload with `--overwrite` after every code change. + +## Step 3: Create the pipeline + +```bash +databricks pipelines create --json '{ + "name": "my_pipeline", + "catalog": "my_catalog", + "schema": "my_schema", + "serverless": true, + "continuous": false, + "development": true, + "channel": "PREVIEW", + "configuration": { + "pipelines.numUpdateRetryAttempts": "0", + "pipelines.maxFlowRetryAttempts": "0" + }, + "libraries": [{"glob": {"include": "/Workspace/Users//my_pipeline/**"}}] +}' +``` + +These flags are the canonical dev/iteration defaults — fail fast. **Tuned for demo / iteration.** For production pipelines, drop `"development"` and the two `pipelines.*RetryAttempts` overrides so the platform's retry defaults (5 / 2) can absorb transient infra failures. Per-field rationale in [pipeline-configuration.md#canonical-create-dev--iteration-defaults](pipeline-configuration.md#canonical-create-dev--iteration-defaults). + +`libraries`: use `"glob"` for a directory (recommended for medallion folders), `"file"` for a single `.sql`/`.py` (folder paths fail with `Paths must end with .py or .sql`), or enumerated `"file"` entries when ordering matters. `"notebook"` is deprecated — never use. + +```json +"libraries": [ + {"file": {"path": "/Workspace/.../bronze/ingest_orders.sql"}}, + {"file": {"path": "/Workspace/.../silver/clean_orders.sql"}} +] +``` + +Capture the returned `pipeline_id`. + +## Step 4: Start an update and poll *that update* + +```bash +UPDATE_ID=$(databricks pipelines start-update | jq -r .update_id) +# Or with full refresh (destructive on streaming state — omit for incremental): +# UPDATE_ID=$(databricks pipelines start-update --full-refresh | jq -r .update_id) + +while :; do + STATE=$(databricks pipelines get-update "$UPDATE_ID" | jq -r '.update.state') + echo "$(date +%H:%M:%S) update=$UPDATE_ID state=$STATE" + case "$STATE" in COMPLETED|FAILED|CANCELED) break;; esac + sleep 30 +done +``` + +**Why poll the update, not the pipeline.** Top-level pipeline `state` flips back to `RUNNING` on `RETRY_ON_FAILURE`, so a loop watching the pipeline (or `latest_updates[0]`) can spin past a real `FAILED` update forever. Poll the captured `update_id` and stop on the first terminal state — including `FAILED`. + +**On `FAILED`**: read the events log, don't re-run. **The real error is in `error.exceptions[0].message`, not in the top-level `.message`** — that one just says "Update X is FAILED", which is useless. Extract both: + +```bash +databricks pipelines list-pipeline-events \ + | jq '[.[] | select(.level=="ERROR") | { + event_type, + summary: (.message // "")[0:200], + exception: ((.error.exceptions[0].message // "no exception body") | .[0:800]) + }] | .[0:5]' +``` + +If you only see "Update X is FAILED" in your output, you're not extracting `error.exceptions[0].message` — fix the jq and re-run. + +If the pipeline is already `RUNNING`, `start-update` queues the new update. Force-stop with `databricks pipelines stop ` first if needed. + +## Step 5: Edit → re-upload → restart + +```bash +# Re-upload (whole dir) +databricks workspace import-dir ./my_pipeline /Workspace/Users//my_pipeline --overwrite + +# Or a single file +databricks workspace import /Workspace/Users//my_pipeline/gold.sql \ + --file ./my_pipeline/gold.sql --format RAW --overwrite + +# Restart +databricks pipelines start-update +``` + +**Use `--format RAW`** for raw `.sql` / `.py` FILE entries. `--format SOURCE --language SQL|PYTHON` uploads a workspace *notebook* — and **notebooks are deprecated for pipelines**. Mixing the two on the same path fails with `Cannot overwrite the asset ... due to type mismatch (asked: NOTEBOOK, actual: FILE)`. + +## Step 6: Validate output data + +Even on `COMPLETED`, verify the data: + +```bash +databricks experimental aitools tools discover-schema \ + my_catalog.my_schema.bronze_orders \ + my_catalog.my_schema.silver_orders \ + my_catalog.my_schema.gold_summary +``` + +Returns columns/types, 5 sample rows, total row count, and null counts per column per table. + +Check for: empty tables (ingestion or filter problems), unexpected row counts (broken joins), missing columns (schema mismatch), nulls in key columns (data quality). + +**If validation reveals problems**, trace upstream: run `discover-schema` on the source table of the problematic dataset, then *its* source, until you hit the layer where the issue originates. Bronze empty = source path wrong or files missing; silver empty = filter too aggressive or join condition mismatched; gold wrong counts = aggregation/grouping bug or duplicate keys in source. + +--- + +## Quick reference: CLI commands + +### Pipeline lifecycle + +| Command | Description | +|---------|-------------| +| `databricks pipelines create --json '{...}'` | Create a new pipeline. | +| `databricks pipelines get ` | Pipeline details and current status. | +| `databricks pipelines update --json '{...}'` | Update pipeline config. | +| `databricks pipelines delete ` | Delete the pipeline. | +| `databricks pipelines list-pipelines` | List all pipelines. | + +### Run management + +| Command | Description | +|---------|-------------| +| `databricks pipelines start-update ` | Start a triggered update. | +| `databricks pipelines start-update --full-refresh` | Start with full refresh (destructive on streaming state). | +| `databricks pipelines stop ` | Stop a running pipeline. | +| `databricks pipelines list-pipeline-events ` | Event log (errors live here). | +| `databricks pipelines list-updates ` | Recent runs. | +| `databricks pipelines get-update ` | Status of a specific update (use this for polling). | + +### Supporting commands + +| Command | Description | +|---------|-------------| +| `databricks workspace import-dir` | Upload files/folders to the workspace. | +| `databricks workspace import` | Upload a single file. | +| `databricks workspace list` | List workspace files. | +| `databricks experimental aitools tools discover-schema` | Schema + row counts + sample data + null counts. | +| `databricks experimental aitools tools query` | Run ad-hoc SQL. | + +--- + +## Python SDK alternative + +Same JSON shape via `databricks.sdk.WorkspaceClient`: `w.pipelines.create(name=..., catalog=..., schema=..., serverless=True, continuous=False, development=True, channel="PREVIEW", configuration={...}, libraries=[...])`. Capture `pipeline.pipeline_id`. Trigger with `w.pipelines.start_update(pipeline_id=..., full_refresh=...)` and poll `w.pipelines.get_update(pipeline_id=..., update_id=update.update_id).update.state` until it hits `COMPLETED`/`FAILED`/`CANCELED`. Prefer the CLI for interactive setup; the SDK is for programmatic / scripted workflows. diff --git a/skills/databricks-pipelines/references/auto-cdc-python.md b/skills/databricks-pipelines/references/auto-cdc-python.md index 0b30181..df3c05c 100644 --- a/skills/databricks-pipelines/references/auto-cdc-python.md +++ b/skills/databricks-pipelines/references/auto-cdc-python.md @@ -1,214 +1,131 @@ -Auto CDC in Spark Declarative Pipelines processes change data capture (CDC) events from streaming sources or snapshots. +# Auto CDC (Python) -**API Reference:** +CDC from streaming events (`dp.create_auto_cdc_flow`) or periodic snapshots (`dp.create_auto_cdc_from_snapshot_flow`). Both write into a pre-created streaming table. -**dp.create_auto_cdc_flow() / dp.apply_changes() / dlt.create_auto_cdc_flow() / dlt.apply_changes()** -Applies CDC operations (inserts, updates, deletes) from a streaming source to a target table. Supports SCD Type 1 (latest) and Type 2 (history). Does NOT return a value - call at top level without assignment. +Use streaming when CDC events arrive continuously (transaction logs, Kafka, Delta change feeds). Use snapshot when the source is a full dump compared to the previous state (daily extracts, batch exports). + +Legacy aliases `dp.apply_changes()` / `dp.apply_changes_from_snapshot()` still parse but should be migrated (see [SKILL.md Legacy DLT Syntax](../SKILL.md#legacy-dlt-syntax--always-migrate)). + +For querying SCD Type 2 history tables, see [scd-2-querying.md](scd-2-querying.md). + +## `dp.create_auto_cdc_flow(...)` + +Call at top level — does NOT return a value. ```python dp.create_auto_cdc_flow( - target="", - source="", - keys=["key1", "key2"], - sequence_by="", - ignore_null_updates=False, - apply_as_deletes=None, - apply_as_truncates=None, - column_list=None, - except_column_list=None, - stored_as_scd_type=1, - track_history_column_list=None, - track_history_except_column_list=None, - name=None, - once=False + target="", # required — pre-created via dp.create_streaming_table() + source="", # required — string name (table or @dp.temporary_view) + keys=["key1", "key2"], # required — primary key columns + sequence_by="", # required — string col name, or col("ts"), or struct("ts","id") + stored_as_scd_type=1, # 1 (default) = latest values; 2 = history with __START_AT/__END_AT + ignore_null_updates=False, # NULL values won't overwrite non-NULL existing + apply_as_deletes=None, # expr("op = 'D'") or "op = 'D'" + apply_as_truncates=None, # SCD Type 1 only + column_list=None, # include list — mutually exclusive with except_column_list + except_column_list=None, # exclude list + track_history_column_list=None, # SCD Type 2: cols that trigger new history rows + track_history_except_column_list=None, # SCD Type 2: cols that DON'T trigger new history rows + name=None, # flow name (multiple flows to one target) + once=False, ) ``` -Parameters: - -- `target` (str): Target table name (must exist, create with `dp.create_streaming_table()`). **Required.** -- `source` (str): Source table name with CDC events. **Required.** -- `keys` (list): Primary key columns for row identification. **Required.** -- `sequence_by` (str | Column): Column for ordering events (timestamp, version). **Required.** Accepts a string column name or a `Column` expression. For multi-column sequencing, use `struct("col1", "col2")` to order by multiple columns. -- `ignore_null_updates` (bool): If True, NULL values won't overwrite existing non-NULL values -- `apply_as_deletes` (str or Column): Expression identifying delete operations. Use `expr("op = 'D'")` (Column) or `"op = 'D'"` (string). -- `apply_as_truncates` (str or Column): Expression identifying truncate operations. Use `expr("op = 'TRUNCATE'")` (Column) or `"op = 'TRUNCATE'"` (string). -- `column_list` (list): Columns to include (mutually exclusive with `except_column_list`) -- `except_column_list` (list): Columns to exclude -- `stored_as_scd_type` (int): `1` for latest values (default), `2` for full history with `__START_AT`/`__END_AT` columns -- `track_history_column_list` (list): For SCD Type 2, columns to track history for (others use Type 1) -- `track_history_except_column_list` (list): For SCD Type 2, columns to exclude from history tracking -- `name` (str): Flow name (for multiple flows to same target) -- `once` (bool): Process once and stop (default: False) - -**dp.create_auto_cdc_from_snapshot_flow() / dp.apply_changes_from_snapshot() / dlt.create_auto_cdc_from_snapshot_flow() / dlt.apply_changes_from_snapshot()** -Applies CDC from full snapshots by comparing to previous state. Automatically infers inserts, updates, deletes. +`source` must be a table/view identifier (string) — NOT a DataFrame. To pre-filter, define a `@dp.temporary_view()` and reference its name. Don't materialize a streaming table just for filtering — temp view is preferred. + +## `dp.create_auto_cdc_from_snapshot_flow(...)` ```python dp.create_auto_cdc_from_snapshot_flow( - target="", - source=, - keys=["key1", "key2"], - stored_as_scd_type=1, - track_history_column_list=None, - track_history_except_column_list=None + target="", + source="", # OR callable (see below) + keys=["product_id"], + stored_as_scd_type=1, + track_history_column_list=None, + track_history_except_column_list=None, ) ``` -Parameters: +`source` accepts a string (most common — name of a table holding the latest snapshot) or a callable for historical snapshot replay: -- `target` (str): Target table name (must exist). **Required.** -- `source` (str or callable): **Required.** Can be one of: - - **String**: Source table name containing the full snapshot (most common) - - **Callable**: Function for processing historical snapshots with type `SnapshotAndVersionFunction = Callable[[SnapshotVersion], SnapshotAndVersion]` - - `SnapshotVersion = Union[int, str, float, bytes, datetime.datetime, datetime.date, decimal.Decimal]` - - `SnapshotAndVersion = Optional[Tuple[DataFrame, SnapshotVersion]]` - - Function receives the latest processed snapshot version (or None for first run) - - Must return `None` when no more snapshots to process - - Must return tuple of `(DataFrame, SnapshotVersion)` for next snapshot to process - - Snapshot version is used to track progress and must be comparable/orderable -- `keys` (list): Primary key columns. **Required.** -- `stored_as_scd_type` (int): `1` for latest (default), `2` for history -- `track_history_column_list` (list): Columns to track history for (SCD Type 2) -- `track_history_except_column_list` (list): Columns to exclude from history tracking +```python +def next_snapshot_and_version(latest_version: Optional[int]) -> Optional[Tuple[DataFrame, int]]: + # Receives the last processed snapshot version (None on first run). + # Return (DataFrame, version) for the next snapshot, or None when caught up. + if latest_version is None: + return (spark.read.load("products_v1.csv"), 1) + return None +``` -**Use create_auto_cdc_flow when:** Processing streaming CDC events from transaction logs, Kafka, Delta change feeds -**Use create_auto_cdc_from_snapshot_flow when:** Processing periodic full snapshots (daily dumps, batch extracts) +Version must be a comparable scalar (`int`, `str`, `float`, `bytes`, `datetime`, `date`, `Decimal`). -**Common Patterns:** +## Patterns -**Pattern 1: Basic CDC flow from streaming source** +### Basic (SCD Type 1) ```python -# Step 1: Create target table dp.create_streaming_table(name="users") - -# Step 2: Define CDC flow (source must be a table name) -dp.create_auto_cdc_flow( - target="users", - source="user_changes", - keys=["user_id"], - sequence_by="updated_at" -) +dp.create_auto_cdc_flow(target="users", source="user_changes", + keys=["user_id"], sequence_by="updated_at") ``` -**Pattern 2: CDC flow with upstream transformation** +### With pre-filtering via temp view ```python -# Step 1: Define view with transformation (source preprocessing) @dp.temporary_view() def filtered_user_changes(): - return ( - spark.readStream.table("raw_user_changes") - .filter("user_id IS NOT NULL") - ) + return spark.readStream.table("raw_user_changes").filter("user_id IS NOT NULL") -# Step 2: Create target table dp.create_streaming_table(name="users") - -# Step 3: Define CDC flow using the view as source -dp.create_auto_cdc_flow( - target="users", - source="filtered_user_changes", # References the view name - keys=["user_id"], - sequence_by="updated_at" -) -# Note: Use distinct names for view and target for clarity -# Note: If "raw_user_changes" is defined in the pipeline and no additional transformations or expectations are needed, -# source="raw_user_changes" can be used directly +dp.create_auto_cdc_flow(target="users", source="filtered_user_changes", + keys=["user_id"], sequence_by="updated_at") ``` -**Pattern 3: CDC with explicit deletes and truncates** +### Explicit deletes + truncates + ignore-null ```python from pyspark.sql.functions import expr -dp.create_streaming_table(name="orders") - dp.create_auto_cdc_flow( - target="orders", - source="order_events", - keys=["order_id"], + target="orders", source="order_events", keys=["order_id"], sequence_by="event_timestamp", apply_as_deletes=expr("operation = 'DELETE'"), - apply_as_truncates=expr("operation = 'TRUNCATE'"), - ignore_null_updates=True + apply_as_truncates=expr("operation = 'TRUNCATE'"), # SCD Type 1 only + ignore_null_updates=True, ) ``` -**Pattern 4: SCD Type 2 (Historical tracking)** +### SCD Type 2 with selective history tracking ```python -dp.create_streaming_table(name="customer_history") - dp.create_auto_cdc_flow( - target="customer_history", - source="source.customer_changes", - keys=["customer_id"], - sequence_by="changed_at", - stored_as_scd_type=2 # Track full history + target="accounts", source="account_changes", keys=["account_id"], + sequence_by="modified_at", + stored_as_scd_type=2, + track_history_column_list=["balance", "status"], # only these trigger new history rows ) -# Target will include __START_AT and __END_AT columns ``` -**Pattern 5: Snapshot-based CDC (Simple - table source)** +Use `track_history_except_column_list=[...]` for the inverse. -```python -dp.create_streaming_table(name="products") +### Snapshot-based (table source) +```python @dp.materialized_view(name="product_snapshot") def product_snapshot(): return spark.read.table("source.daily_product_dump") -dp.create_auto_cdc_from_snapshot_flow( - target="products", - source="product_snapshot", # String table name - most common - keys=["product_id"], - stored_as_scd_type=1 -) -``` - -**Pattern 6: Snapshot-based CDC (Advanced - callable for historical snapshots)** - -```python dp.create_streaming_table(name="products") - -# Define a callable to process historical snapshots sequentially -def next_snapshot_and_version(latest_snapshot_version: Optional[int]) -> Tuple[DataFrame, Optional[int]]: - if latest_snapshot_version is None: - return (spark.read.load("products.csv"), 1) - else: - return None - dp.create_auto_cdc_from_snapshot_flow( - target="products", - source=next_snapshot_and_version, # Callable function for historical processing - keys=["product_id"], - stored_as_scd_type=1 -) -``` - -**Pattern 7: Selective column tracking** - -```python -dp.create_streaming_table(name="accounts") - -dp.create_auto_cdc_flow( - target="accounts", - source="account_changes", - keys=["account_id"], - sequence_by="modified_at", - stored_as_scd_type=2, - track_history_column_list=["balance", "status"], # Only track history for these columns - ignore_null_updates=True + target="products", source="product_snapshot", + keys=["product_id"], stored_as_scd_type=1, ) ``` -**KEY RULES:** +## Key rules -- Create target with `dp.create_streaming_table()` before defining CDC flow -- `dp.create_auto_cdc_flow()` does NOT return a value - call it at top level without assigning to a variable -- `source` must be a table name (string) - use `@dp.temporary_view()` to preprocess/filter/transform data before CDC processing. A temporary view is the **preferred** approach for source preprocessing (not a streaming table) -- SCD Type 2 adds `__START_AT` and `__END_AT` columns for validity tracking -- When specifying the schema of the target table for SCD Type 2, you must also include the `__START_AT` and `__END_AT` columns with the same data type as the `sequence_by` field -- Legacy names (`apply_changes`, `apply_changes_from_snapshot`) are equivalent but deprecated - prefer `create_auto_cdc_*` variants +- Create the target with `dp.create_streaming_table()` first. +- `dp.create_auto_cdc_flow()` does NOT return a value — call at top level. +- `source` is a string table/view name, never a DataFrame. Pre-process via `@dp.temporary_view()`. +- SCD Type 2 adds `__START_AT` / `__END_AT` columns with the same type as `sequence_by`. If you supply an explicit target schema, include them. +- `sequence_by` accepts string column name OR `col("ts")` — both work. Use `struct("ts", "id")` for multi-column ordering. diff --git a/skills/databricks-pipelines/references/auto-cdc-sql.md b/skills/databricks-pipelines/references/auto-cdc-sql.md index 851aa69..95ae01b 100644 --- a/skills/databricks-pipelines/references/auto-cdc-sql.md +++ b/skills/databricks-pipelines/references/auto-cdc-sql.md @@ -1,76 +1,69 @@ -Auto CDC in Declarative Pipelines processes change data capture (CDC) events from streaming sources. +# Auto CDC (SQL) -**API Reference:** +`AUTO CDC INTO` processes CDC events from a streaming source into a target streaming table. SCD Type 1 (latest) or Type 2 (history). The target must be pre-created. -**CREATE FLOW ... AS AUTO CDC INTO** -Applies CDC operations (inserts, updates, deletes) from a streaming source to a target table. Supports SCD Type 1 (latest) and Type 2 (history). Must be used with a pre-created streaming table. +> SQL only supports CDC from streaming sources (`AUTO CDC INTO`). For periodic-snapshot CDC, use Python's `dp.create_auto_cdc_from_snapshot_flow()` — see [auto-cdc-python.md](auto-cdc-python.md). + +## Syntax ```sql CREATE OR REFRESH STREAMING TABLE ; CREATE FLOW AS AUTO CDC INTO -FROM -KEYS (, ) +FROM STREAM() +KEYS (, , ...) [IGNORE NULL UPDATES] [APPLY AS DELETE WHEN ] -[APPLY AS TRUNCATE WHEN ] -SEQUENCE BY -[COLUMNS { | * EXCEPT ()}] -[STORED AS {SCD TYPE 1 | SCD TYPE 2}] -[TRACK HISTORY ON { | * EXCEPT ()}] +[APPLY AS TRUNCATE WHEN ] -- SCD Type 1 only +SEQUENCE BY +[COLUMNS { | * EXCEPT ()}] +[STORED AS {SCD TYPE 1 | SCD TYPE 2}] -- default Type 1 +[TRACK HISTORY ON { | * EXCEPT ()}] -- SCD Type 2 only ``` -Parameters: +Clause notes: + +- `FROM STREAM(...)` accepts only a table/view identifier — **NOT a subquery**. Pre-filter via a temp view if needed. +- `KEYS` — required primary key columns for row identification. +- `IGNORE NULL UPDATES` — NULL values won't overwrite existing non-NULL values. +- `APPLY AS DELETE WHEN` / `APPLY AS TRUNCATE WHEN` — order matters in the SQL: put both **before** `SEQUENCE BY` or the parser fails. +- `SEQUENCE BY` — single column, or `STRUCT(ts_col, tiebreaker_col)` for multi-column ordering. +- `COLUMNS * EXCEPT (...)` — only list columns that exist in the source (omit `_rescued_data` unless bronze rescued data). +- `STORED AS SCD TYPE 2` adds `__START_AT` and `__END_AT` system columns to the target. If you supply an explicit target schema, include them with the same type as `SEQUENCE BY`. +- `TRACK HISTORY ON cols` — Type 2 only; only listed columns trigger new history rows. Others get in-place Type-1 updates. -- `target_table` (identifier): Target table name (must exist, create with `CREATE OR REFRESH STREAMING TABLE`). **Required.** -- `flow_name` (identifier): Identifier for the created flow. **Required.** -- `source` (identifier or expression): Streaming source with CDC events. Use `STREAM()` to read with streaming semantics. **Required.** -- `KEYS` (column list): Primary key columns for row identification. **Required.** -- `IGNORE NULL UPDATES` (optional): If specified, NULL values won't overwrite existing non-NULL values -- `APPLY AS DELETE WHEN` (optional): Condition identifying delete operations (e.g., `operation = 'DELETE'`) -- `APPLY AS TRUNCATE WHEN` (optional): Condition identifying truncate operations (supported only for SCD Type 1) -- `SEQUENCE BY` (column or struct): Column for ordering events (timestamp, version). **Required.** For multi-column sequencing, use `SEQUENCE BY STRUCT(timestamp_col, id_col)` to order by the first field first, then break ties with subsequent fields. -- `COLUMNS` (optional): Columns to include or exclude (use `column1, column2` or `* EXCEPT (column1, column2)`) -- `STORED AS` (optional): `SCD TYPE 1` for latest values (default), `SCD TYPE 2` for full history with `__START_AT`/`__END_AT` columns -- `TRACK HISTORY ON` (optional): For SCD Type 2, columns to track history for (others use Type 1) +For querying Type 2 history tables, see [scd-2-querying.md](scd-2-querying.md). -**Common Patterns:** +## Patterns -**Pattern 1: Basic CDC flow from streaming source** +### Basic (SCD Type 1, default) ```sql --- Step 1: Create target table CREATE OR REFRESH STREAMING TABLE users; --- Step 2: Define CDC flow using STREAM() for streaming semantics CREATE FLOW user_flow AS AUTO CDC INTO users FROM STREAM(user_changes) KEYS (user_id) SEQUENCE BY updated_at; ``` -**Pattern 2: CDC with source filtering via temporary view** +### Pre-filter via temporary view (when the source needs transformation) ```sql --- Step 1: Create temporary view to filter/transform source data CREATE OR REFRESH TEMPORARY VIEW filtered_changes AS SELECT * FROM source_table WHERE status = 'active'; --- Step 2: Create target table CREATE OR REFRESH STREAMING TABLE active_records; --- Step 3: Define CDC flow reading from the temporary view CREATE FLOW active_flow AS AUTO CDC INTO active_records FROM STREAM(filtered_changes) KEYS (record_id) SEQUENCE BY updated_at; ``` -**Pattern 3: CDC with explicit deletes** +### Explicit deletes + ignore NULL updates ```sql -CREATE OR REFRESH STREAMING TABLE orders; - CREATE FLOW order_flow AS AUTO CDC INTO orders FROM STREAM(order_events) KEYS (order_id) @@ -79,104 +72,32 @@ APPLY AS DELETE WHEN operation = 'DELETE' SEQUENCE BY event_timestamp; ``` -**Pattern 4: SCD Type 2 (Historical tracking)** +### SCD Type 2 (full history) ```sql -CREATE OR REFRESH STREAMING TABLE customer_history; - CREATE FLOW customer_flow AS AUTO CDC INTO customer_history FROM STREAM(customer_changes) KEYS (customer_id) SEQUENCE BY changed_at STORED AS SCD TYPE 2; --- Target will include __START_AT and __END_AT columns -``` - -**Pattern 5: Multi-column sequencing** - -```sql -CREATE OR REFRESH STREAMING TABLE events; - -CREATE FLOW event_flow AS AUTO CDC INTO events -FROM STREAM(event_changes) -KEYS (event_id) -SEQUENCE BY STRUCT(event_timestamp, event_id) -STORED AS SCD TYPE 1; ``` -**Pattern 6: Selective column inclusion** +Variants: `TRACK HISTORY ON balance, status` (only those columns trigger new rows) or `TRACK HISTORY ON * EXCEPT (last_login, view_count)` (track everything except). -```sql -CREATE OR REFRESH STREAMING TABLE accounts; +### Selective columns -CREATE FLOW account_flow AS AUTO CDC INTO accounts -FROM STREAM(account_changes) -KEYS (account_id) -SEQUENCE BY modified_at -COLUMNS account_id, balance, status -STORED AS SCD TYPE 1; -``` +`COLUMNS account_id, balance, status` (include list) or `COLUMNS * EXCEPT (internal_notes, temp_field)` (exclude list). -**Pattern 7: Selective column exclusion** +### Multi-column sequencing ```sql -CREATE OR REFRESH STREAMING TABLE products; - -CREATE FLOW product_flow AS AUTO CDC INTO products -FROM STREAM(product_changes) -KEYS (product_id) -SEQUENCE BY updated_at -COLUMNS * EXCEPT (internal_notes, temp_field); +SEQUENCE BY STRUCT(event_timestamp, event_id) -- order by ts first, break ties with id ``` -**Pattern 8: SCD Type 2 with selective history tracking** +### TRUNCATE support (SCD Type 1 only) ```sql -CREATE OR REFRESH STREAMING TABLE accounts; - -CREATE FLOW account_flow AS AUTO CDC INTO accounts -FROM STREAM(account_changes) -KEYS (account_id) -IGNORE NULL UPDATES -SEQUENCE BY modified_at -STORED AS SCD TYPE 2 -TRACK HISTORY ON balance, status; --- Only balance and status changes create new history records -``` - -**Pattern 9: SCD Type 2 with history tracking exclusion** - -```sql -CREATE OR REFRESH STREAMING TABLE accounts; - -CREATE FLOW account_flow AS AUTO CDC INTO accounts -FROM STREAM(account_changes) -KEYS (account_id) -SEQUENCE BY modified_at -STORED AS SCD TYPE 2 -TRACK HISTORY ON * EXCEPT (last_login, view_count); --- Track history on all columns except last_login and view_count -``` - -**Pattern 10: Truncate support (SCD Type 1 only)** - -```sql -CREATE OR REFRESH STREAMING TABLE inventory; - -CREATE FLOW inventory_flow AS AUTO CDC INTO inventory -FROM STREAM(inventory_events) -KEYS (product_id) APPLY AS TRUNCATE WHEN operation = 'TRUNCATE' SEQUENCE BY event_timestamp STORED AS SCD TYPE 1; ``` - -**KEY RULES:** - -- Create target with `CREATE OR REFRESH STREAMING TABLE` before defining CDC flow -- `source` must be a streaming source for safe CDC change processing. Use `STREAM()` to read an existing table/view with streaming semantics -- The `STREAM()` function accepts ONLY a table/view identifier - NOT a subquery. Define source data as a separate streaming table or temporary view first, then reference it in the flow -- SCD Type 2 adds `__START_AT` and `__END_AT` columns for validity tracking -- When specifying the schema of the target table for SCD Type 2, you must also include the `__START_AT` and `__END_AT` columns with the same data type as the `SEQUENCE BY` field -- Legacy `APPLY CHANGES INTO` API is equivalent but deprecated - prefer `AUTO CDC INTO` -- `AUTO CDC FROM SNAPSHOT` is only available in Python, not in SQL. SQL only supports `AUTO CDC INTO` for processing CDC events from streaming sources. diff --git a/skills/databricks-pipelines/references/auto-cdc.md b/skills/databricks-pipelines/references/auto-cdc.md deleted file mode 100644 index 3bdad12..0000000 --- a/skills/databricks-pipelines/references/auto-cdc.md +++ /dev/null @@ -1,25 +0,0 @@ -# Auto CDC (apply_changes) in Spark Declarative Pipelines - -The `apply_changes` API enables processing Change Data Capture (CDC) feeds to automatically handle inserts, updates, and deletes in target tables. - -## Key Concepts - -Auto CDC in Spark Declarative Pipelines: - -- Automatically processes CDC operations (INSERT, UPDATE, DELETE) -- Supports SCD Type 1 (update in place) and Type 2 (historical tracking) -- Handles ordering of changes via sequence columns -- Deduplicates CDC records - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [auto-cdc-python.md](auto-cdc-python.md) -- **SQL**: [auto-cdc-sql.md](auto-cdc-sql.md) - -## Reading SCD Type 2 Tables - -For querying the history tables produced by SCD Type 2 (`__START_AT` / `__END_AT`), point-in-time queries, change analysis, and joining facts with historical dimensions, see [scd-2-querying.md](scd-2-querying.md). - -**Note**: The API is also known as `applyChanges` in some contexts. diff --git a/skills/databricks-pipelines/references/auto-loader-python.md b/skills/databricks-pipelines/references/auto-loader-python.md index 251361a..7e364f7 100644 --- a/skills/databricks-pipelines/references/auto-loader-python.md +++ b/skills/databricks-pipelines/references/auto-loader-python.md @@ -1,133 +1,63 @@ -Auto Loader (`cloudFiles`) is recommended for ingesting from cloud storage. +# Auto Loader (Python) -**Basic Syntax:** +`spark.readStream.format("cloudFiles")` for incremental ingestion from cloud storage. Returns a streaming DataFrame; use inside `@dp.table()` or `@dp.append_flow()`. ```python @dp.table() def my_table(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") # or csv, parquet, etc. - .load("s3://bucket/path") - ) + return (spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json") # json, csv, parquet, avro, orc, xml, text, binaryFile + .load("s3://bucket/path")) ``` -**Critical Spark Declarative Pipelines + Auto Loader Rules:** - -- Databricks automatically manages `cloudFiles.schemaLocation` and checkpoint - do NOT specify these -- Auto Loader returns a streaming DataFrame - general API guidelines for `streamingTable` apply (MANDATORY to look up `streamingTable` guide) - - Can be used in either a streaming `@dp.table()` / `@dlt.table()` or via `@dp.append_flow()` / `@dlt.append_flow()` - - Use `spark.readStream` not `spark.read` for streaming ingestion -- If manually specifying a schema, include the rescued data column (default `_rescued_data STRING`, configurable via `rescuedDataColumn` option) -- Common Schema Options: - - `cloudFiles.inferColumnTypes`: Enable type inference (default: strings for JSON/CSV/XML) - - `cloudFiles.schemaHints`: Optionally specify known column types (e.g., `"id int, name string"`) -- File detection: File notification mode recommended for scalability - -**Common Auto Loader Options** -Below are all format agnostic options for Auto Loader. - -Common Auto Loader Options - -| Option | Type | Notes | -| ---------------------------------------- | --------------- | ---------------------------------- | -| cloudFiles.allowOverwrites | Boolean | | -| cloudFiles.backfillInterval | Interval String | | -| cloudFiles.cleanSource | String | | -| cloudFiles.cleanSource.retentionDuration | Interval String | | -| cloudFiles.cleanSource.moveDestination | String | | -| cloudFiles.format | String | | -| cloudFiles.includeExistingFiles | Boolean | | -| cloudFiles.inferColumnTypes | Boolean | | -| cloudFiles.maxBytesPerTrigger | Byte String | | -| cloudFiles.maxFileAge | Interval String | | -| cloudFiles.maxFilesPerTrigger | Integer | | -| cloudFiles.partitionColumns | String | | -| cloudFiles.schemaEvolutionMode | String | | -| cloudFiles.schemaHints | String | | -| cloudFiles.schemaLocation | String | DO NOT SET - managed automatically | -| cloudFiles.useStrictGlobber | Boolean | | -| cloudFiles.validateOptions | Boolean | | - -Directory Listing Options - -| Option | Type | -| -------------------------------- | ------ | -| cloudFiles.useIncrementalListing | String | - -File Notification Options - -| Option | Type | -| ------------------------------- | ------------------- | -| cloudFiles.fetchParallelism | Integer | -| cloudFiles.pathRewrites | JSON String | -| cloudFiles.resourceTag | Map(String, String) | -| cloudFiles.useManagedFileEvents | Boolean | -| cloudFiles.useNotifications | Boolean | - -AWS-Specific Options - -| Option | Type | -| ---------------------------- | ------ | -| cloudFiles.region | String | -| cloudFiles.queueUrl | String | -| cloudFiles.awsAccessKey | String | -| cloudFiles.awsSecretKey | String | -| cloudFiles.roleArn | String | -| cloudFiles.roleExternalId | String | -| cloudFiles.roleSessionName | String | -| cloudFiles.stsEndpoint | String | -| databricks.serviceCredential | String | - -Azure-Specific Options - -| Option | Type | -| ---------------------------- | ------ | -| cloudFiles.resourceGroup | String | -| cloudFiles.subscriptionId | String | -| cloudFiles.clientId | String | -| cloudFiles.clientSecret | String | -| cloudFiles.connectionString | String | -| cloudFiles.tenantId | String | -| cloudFiles.queueName | String | -| databricks.serviceCredential | String | - -GCP-Specific Options - -| Option | Type | -| ---------------------------- | ------ | -| cloudFiles.projectId | String | -| cloudFiles.client | String | -| cloudFiles.clientEmail | String | -| cloudFiles.privateKey | String | -| cloudFiles.privateKeyId | String | -| cloudFiles.subscription | String | -| databricks.serviceCredential | String | - -Generic File Format Options - -| Option | Type | -| -------------------------------- | ---------------- | -| ignoreCorruptFiles | Boolean | -| ignoreMissingFiles | Boolean | -| modifiedAfter | Timestamp String | -| modifiedBefore | Timestamp String | -| pathGlobFilter / fileNamePattern | String | -| recursiveFileLookup | Boolean | - -Format-Specific Options - -For detailed format-specific options, refer to these files: - -- **[JSON Options](options-json.md)**: Options for reading JSON files -- **[CSV Options](options-csv.md)**: Options for reading CSV files -- **[Parquet Options](options-parquet.md)**: Options for reading Parquet files -- **[Avro Options](options-avro.md)**: Options for reading Avro files -- **[ORC Options](options-orc.md)**: Options for reading ORC files -- **[XML Options](options-xml.md)**: Options for reading XML files -- **[Text Options](options-text.md)**: Options for reading text files - -See the linked format option files for specific documentation. - -**Auto Loader documentation:** -MANDATORY: Look up the official Databricks documentation for detailed information on any specific cloudFiles (Auto Loader) option before use. Each option has extensive documentation. No exceptions. +## Rules + +- **Don't set `cloudFiles.schemaLocation`** — the pipeline manages schema location and checkpoint automatically. +- Use `spark.readStream` (streaming), not `spark.read` (batch). Auto Loader is streaming by definition. +- If you provide an explicit `schema=`, include the rescued-data column (default name `_rescued_data STRING`; configurable via `rescuedDataColumn` option). +- **Look up the official Databricks docs for any option before use** — every option has subtle semantics not captured here. + +## Schema handling + +- `cloudFiles.inferColumnTypes` — enable type inference (default: all-string for JSON/CSV/XML). +- `cloudFiles.schemaHints` — partial typing, e.g. `"id INT, amount DECIMAL(10,2)"`. +- `cloudFiles.schemaEvolutionMode` — how to handle new columns (`addNewColumns`, `rescue`, `failOnNewColumns`, `none`). +- Quarantine malformed rows via the rescued-data pattern in [streaming-patterns.md#rescue-data-quarantine](streaming-patterns.md#rescue-data-quarantine). + +## Common format-agnostic options + +| Option | Notes | +|---|---| +| `cloudFiles.format` | json / csv / parquet / avro / orc / xml / text / binaryFile | +| `cloudFiles.inferColumnTypes` | Enable type inference | +| `cloudFiles.schemaHints` | Partial schema declaration | +| `cloudFiles.schemaEvolutionMode` | Schema-drift handling | +| `cloudFiles.includeExistingFiles` | Backfill on first run | +| `cloudFiles.allowOverwrites` | Re-process an overwritten file | +| `cloudFiles.maxFilesPerTrigger` / `maxBytesPerTrigger` | Throttle micro-batch size | +| `cloudFiles.maxFileAge` | Skip files older than the threshold | +| `cloudFiles.backfillInterval` | Periodically re-list to catch missed files | +| `cloudFiles.cleanSource` / `.cleanSource.retentionDuration` / `.cleanSource.moveDestination` | Source-side file cleanup | +| `cloudFiles.partitionColumns` | Hive-style partition discovery | +| `cloudFiles.useStrictGlobber` | Strict glob matching | +| `cloudFiles.validateOptions` | Validate options at start | +| `cloudFiles.schemaLocation` | **DO NOT SET** — managed by the pipeline | + +Generic file options (apply to all formats): `ignoreCorruptFiles`, `ignoreMissingFiles`, `modifiedAfter`, `modifiedBefore`, `pathGlobFilter` / `fileNamePattern`, `recursiveFileLookup`. + +Listing strategy: + +- **Directory listing** (default for small/medium volumes): `cloudFiles.useIncrementalListing`. +- **File notification** (recommended at scale): `cloudFiles.useNotifications`, `cloudFiles.useManagedFileEvents`, `cloudFiles.fetchParallelism`, `cloudFiles.pathRewrites`, `cloudFiles.resourceTag`. + +## Cloud-specific auth options + +All clouds accept `databricks.serviceCredential` to reference a UC service credential — prefer this over inline keys. + +- **AWS**: `cloudFiles.region`, `cloudFiles.queueUrl`, `cloudFiles.awsAccessKey` / `awsSecretKey`, `cloudFiles.roleArn` / `roleExternalId` / `roleSessionName`, `cloudFiles.stsEndpoint`. +- **Azure**: `cloudFiles.resourceGroup`, `cloudFiles.subscriptionId`, `cloudFiles.clientId` / `clientSecret`, `cloudFiles.connectionString`, `cloudFiles.tenantId`, `cloudFiles.queueName`. +- **GCP**: `cloudFiles.projectId`, `cloudFiles.client`, `cloudFiles.clientEmail`, `cloudFiles.privateKey` / `privateKeyId`, `cloudFiles.subscription`. + +## Format-specific options + +See [JSON](options-json.md), [CSV](options-csv.md), [Parquet](options-parquet.md), [Avro](options-avro.md), [ORC](options-orc.md), [XML](options-xml.md), [Text](options-text.md). diff --git a/skills/databricks-pipelines/references/auto-loader-sql.md b/skills/databricks-pipelines/references/auto-loader-sql.md index 5ebcc33..bc338d0 100644 --- a/skills/databricks-pipelines/references/auto-loader-sql.md +++ b/skills/databricks-pipelines/references/auto-loader-sql.md @@ -1,83 +1,45 @@ -Auto Loader with SQL (`read_files`) is recommended for ingesting from cloud storage. +# Auto Loader (SQL) -**Basic Syntax:** +`read_files()` for incremental ingestion from cloud storage. Use inside a streaming table as `FROM STREAM read_files(...)`. ```sql --- Using Auto Loader with CREATE STREAMING TABLE +-- In a streaming table definition CREATE OR REFRESH STREAMING TABLE my_table -AS SELECT * FROM STREAM(read_files( - 's3://bucket/path', - format => 'json' -)); +AS SELECT * FROM STREAM read_files('s3://bucket/path', format => 'json'); --- Using Auto Loader directly with CREATE FLOW (no intermediate table needed) -CREATE STREAMING TABLE target_table; +-- Or via a flow into a pre-created target +CREATE OR REFRESH STREAMING TABLE target_table; CREATE FLOW ingest_flow AS INSERT INTO target_table BY NAME -SELECT * FROM STREAM(read_files( - 's3://bucket/path', - format => 'json' -)); +SELECT * FROM STREAM read_files('s3://bucket/path', format => 'json'); ``` -**Critical Spark Declarative Pipelines + Auto Loader Rules:** +## Rules -- **MUST use `STREAM` keyword with `read_files` in streaming contexts** (e.g., `SELECT * FROM STREAM read_files(...)`) -- `inferColumnTypes` defaults to `true` - column types are automatically inferred, no need to specify unless setting to `false` -- Schema inference: Samples data initially to determine structure, then adapts as new data is encountered - - Use `schemaHints` to specify known column types (e.g., `schemaHints => 'id int, name string'`) - - Use `schemaEvolutionMode` to control how schema adapts when encountering new columns -- Unity Catalog pipelines must use external locations when loading files +- `FROM STREAM read_files(...)` (no extra parens around the function) — that's the canonical form for function sources. Without `STREAM`, `read_files` is a batch read and fails inside a streaming table. +- `inferColumnTypes` defaults to `true` for `read_files` (opposite of `cloudFiles` in Python). Set `false` to force string types. +- Use `schemaHints => 'col1 TYPE, ...'` for production tables; `schemaEvolutionMode => '...'` to control schema-drift behavior. +- Unity Catalog pipelines must use external locations to load files. +- **Look up the official Databricks docs for any option before use.** -**Common read_files Options** -Below are all format agnostic options for `read_files`. +## Common format-agnostic options -Basic Options +| Option | Notes | +|---|---| +| `format` | json / csv / parquet / avro / orc / xml / text / binaryFile | +| `inferColumnTypes` | Boolean. Defaults to true. | +| `partitionColumns` | Hive-style partition discovery | +| `schemaHints` | Partial schema declaration | +| `schemaEvolutionMode` | Schema-drift handling | +| `schemaLocation` | Managed automatically — don't set manually | +| `includeExistingFiles` | Backfill on first run | +| `allowOverwrites` | Re-process overwritten files | +| `maxFilesPerTrigger` / `maxBytesPerTrigger` | Throttle micro-batch size | +| `useStrictGlobber` | Strict glob matching | -| Option | Type | -| ------------------ | ------- | -| `format` | String | -| `inferColumnTypes` | Boolean | -| `partitionColumns` | String | -| `schemaHints` | String | -| `useStrictGlobber` | Boolean | +Generic file options: `ignoreCorruptFiles`, `ignoreMissingFiles`, `modifiedAfter`, `modifiedBefore`, `pathGlobFilter` / `fileNamePattern`, `recursiveFileLookup`. -Generic File Format Options +## Format-specific options -| Option | Type | -| ------------------------------------ | ---------------- | -| `ignoreCorruptFiles` | Boolean | -| `ignoreMissingFiles` | Boolean | -| `modifiedAfter` | Timestamp String | -| `modifiedBefore` | Timestamp String | -| `pathGlobFilter` / `fileNamePattern` | String | -| `recursiveFileLookup` | Boolean | - -Streaming Options - -| Option | Type | -| ---------------------- | ----------- | -| `allowOverwrites` | Boolean | -| `includeExistingFiles` | Boolean | -| `maxBytesPerTrigger` | Byte String | -| `maxFilesPerTrigger` | Integer | -| `schemaEvolutionMode` | String | -| `schemaLocation` | String | - -Format-Specific Options - -For detailed format-specific options, refer to these files: - -- **[JSON Options](options-json.md)**: Options for reading JSON files -- **[CSV Options](options-csv.md)**: Options for reading CSV files -- **[Parquet Options](options-parquet.md)**: Options for reading Parquet files -- **[Avro Options](options-avro.md)**: Options for reading Avro files -- **[ORC Options](options-orc.md)**: Options for reading ORC files -- **[XML Options](options-xml.md)**: Options for reading XML files -- **[Text Options](options-text.md)**: Options for reading text files - -See the linked format option files for specific documentation. - -**Auto Loader documentation:** -MANDATORY: Look up the official Databricks documentation for detailed information on any specific read_files (Auto Loader) option before use. Each option has extensive documentation. No exceptions. +See [JSON](options-json.md), [CSV](options-csv.md), [Parquet](options-parquet.md), [Avro](options-avro.md), [ORC](options-orc.md), [XML](options-xml.md), [Text](options-text.md). diff --git a/skills/databricks-pipelines/references/auto-loader.md b/skills/databricks-pipelines/references/auto-loader.md deleted file mode 100644 index 642031b..0000000 --- a/skills/databricks-pipelines/references/auto-loader.md +++ /dev/null @@ -1,38 +0,0 @@ -# Auto Loader (cloudFiles) - -Auto Loader is the recommended approach for incrementally ingesting data from cloud storage into Delta Lake tables. It automatically processes new files as they arrive in cloud storage. - -## Key Concepts - -Auto Loader (`cloudFiles`) provides: - -- Automatic file discovery and processing -- Schema inference and evolution -- Exactly-once processing guarantees -- Scalable incremental ingestion -- Support for various file formats - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [auto-loader-python.md](auto-loader-python.md) -- **SQL**: [auto-loader-sql.md](auto-loader-sql.md) - -## Related Patterns - -- [Rescue-data quarantine](streaming-patterns.md#rescue-data-quarantine) — route rows where Auto Loader rescued malformed fields to a side table instead of dropping them. -- [Kafka ingestion](kafka.md) — for message-bus sources (Kafka, Event Hubs). -- [Monitoring lag](streaming-patterns.md#monitoring-lag) — track end-to-end freshness. - -## Format-Specific Options - -For format-specific configuration options, refer to: - -- **JSON**: [options-json.md](options-json.md) -- **CSV**: [options-csv.md](options-csv.md) -- **XML**: [options-xml.md](options-xml.md) -- **Parquet**: [options-parquet.md](options-parquet.md) -- **Avro**: [options-avro.md](options-avro.md) -- **Text**: [options-text.md](options-text.md) -- **ORC**: [options-orc.md](options-orc.md) diff --git a/skills/databricks-pipelines/references/dlt-migration.md b/skills/databricks-pipelines/references/dlt-migration.md index dbde0d9..76729a1 100644 --- a/skills/databricks-pipelines/references/dlt-migration.md +++ b/skills/databricks-pipelines/references/dlt-migration.md @@ -1,428 +1,148 @@ -# Migration Guide: DLT to SDP +# Migration Guide: DLT → SDP -Guide for migrating from Delta Live Tables (DLT) to Spark Declarative Pipelines (SDP). +Two migration paths: -**Two migration paths:** -1. **DLT Python → SDP Python** (dlt → dp): Same language, new API -2. **DLT Python → SDP SQL**: Change language for simpler pipelines +1. **DLT Python → SDP Python** (`dlt` → `dp`): same language, new API. +2. **DLT Python → SDP SQL**: convert to SQL when the logic is mostly relational. ---- - -## Migration Path 1: DLT Python → SDP Python (dlt → dp) - -Use this when staying with Python but moving to the modern `pyspark.pipelines` API. - -### Quick Reference - -| Aspect | Legacy (`dlt`) | Modern (`dp`) | -|--------|---------------|----------------| -| **Import** | `import dlt` | `from pyspark import pipelines as dp` | -| **Table decorator** | `@dlt.table()` | `@dp.table()` | -| **Read table** | `dlt.read("table")` | `spark.read.table("table")` | -| **Read stream** | `dlt.read_stream("table")` | `spark.readStream.table("table")` | -| **CDC/SCD** | `dlt.apply_changes()` | `dp.create_auto_cdc_flow()` | -| **Clustering** | `partition_cols=["date"]` | `cluster_by=["date", "col2"]` | - -### Step-by-Step Migration - -#### Step 1: Update Imports - -```python -# Before -import dlt - -# After -from pyspark import pipelines as dp -``` - -#### Step 2: Update Decorators - -```python -# Before -@dlt.table(name="my_table") - -# After -@dp.table(name="my_table") -``` - -#### Step 3: Update Table Reads - -```python -# Before -@dlt.table(name="silver_events") -def silver_events(): - return dlt.read("bronze_events").filter(...) - -# After -@dp.table(name="silver_events") -def silver_events(): - return spark.read.table("bronze_events").filter(...) -``` +If 80%+ of the pipeline is SQL-expressible (filters, aggregations, joins, CDC, Auto Loader), prefer SDP SQL. Stay in Python when there are complex UDFs, external API calls, custom libraries, or ML inference. -```python -# Before (streaming) -@dlt.table(name="silver_events") -def silver_events(): - return dlt.read_stream("bronze_events").filter(...) - -# After (streaming) -@dp.table(name="silver_events") -def silver_events(): - return spark.readStream.table("bronze_events").filter(...) -``` - -#### Step 4: Update Expectations - -```python -# Before -@dlt.table(name="silver") -@dlt.expect_or_drop("valid_id", "id IS NOT NULL") - -# After (identical syntax, just change dlt → dp) -@dp.table(name="silver") -@dp.expect_or_drop("valid_id", "id IS NOT NULL") -``` - -#### Step 5: Update CDC/SCD Operations - -```python -# Before -dlt.create_streaming_table("customers_history") -dlt.apply_changes( - target="customers_history", - source="customers_cdc", - keys=["customer_id"], - sequence_by="event_timestamp", - stored_as_scd_type="2" -) - -# After -from pyspark.sql.functions import col - -dp.create_streaming_table("customers_history") -dp.create_auto_cdc_flow( - target="customers_history", - source="customers_cdc", - keys=["customer_id"], - sequence_by=col("event_timestamp"), # Note: use col() - stored_as_scd_type=2 # Note: integer, not string -) -``` - -**Key differences:** -- `apply_changes()` → `create_auto_cdc_flow()` -- `sequence_by` takes a Column object (`col("...")`) not a string -- `stored_as_scd_type` is integer `2` for Type 2, string `"1"` for Type 1 - -#### Step 6: Update Clustering (Partitioning → Liquid Clustering) - -```python -# Before (legacy partitioning) -@dlt.table( - name="bronze_events", - partition_cols=["event_date"], - table_properties={"pipelines.autoOptimize.zOrderCols": "event_type"} -) - -# After (Liquid Clustering) -@dp.table( - name="bronze_events", - cluster_by=["event_date", "event_type"] -) -``` +--- -### Complete Before/After Example +## Migration Path 1: DLT Python → SDP Python + +### Mapping + +| Concept | Legacy (`dlt`) | Modern (`dp`) | +|---------|---------------|---------------| +| Import | `import dlt` | `from pyspark import pipelines as dp` | +| Streaming table | `@dlt.table()` returning streaming DF | `@dp.table()` returning streaming DF | +| Materialized view | `@dlt.table()` returning batch DF | `@dp.materialized_view()` (preferred) | +| Temporary view | `@dlt.view()` | `@dp.temporary_view()` | +| Read batch | `dlt.read("t")` | `spark.read.table("t")` | +| Read stream | `dlt.read_stream("t")` | `spark.readStream.table("t")` | +| Expectations | `@dlt.expect*` | `@dp.expect*` (same names) | +| CDC | `dlt.apply_changes(...)` | `dp.create_auto_cdc_flow(...)` | +| Snapshot CDC | `dlt.apply_changes_from_snapshot(...)` | `dp.create_auto_cdc_from_snapshot_flow(...)` | +| Create empty target | `dlt.create_streaming_table(...)` | `dp.create_streaming_table(...)` | +| Partitioning | `partition_cols=["date"]` | `cluster_by=["date", ...]` (Liquid Clustering) | +| File metadata | `input_file_name()` | `F.col("_metadata.file_path")` | +| Pipeline target | `target=` parameter | `schema=` parameter | +| Read-source prefix | `LIVE.` | Bare name (modern pipelines reject `LIVE.`) | + +### Behavioral changes to watch for in CDC + +- `apply_changes(...)` → `create_auto_cdc_flow(...)`. Same parameters EXCEPT: + - `sequence_by` accepts string OR `col(...)`; either works. + - `stored_as_scd_type` is **integer `2`** for Type 2, **string `"1"`** for Type 1. -**Before (DLT):** ```python -import dlt -from pyspark.sql import functions as F - -@dlt.table(name="bronze_orders", partition_cols=["order_date"]) -def bronze_orders(): - return spark.readStream.format("cloudFiles").load("/data/orders") - -@dlt.table(name="silver_orders") -@dlt.expect_or_drop("valid_amount", "amount > 0") -def silver_orders(): - return dlt.read_stream("bronze_orders").filter(F.col("status") == "completed") - -dlt.create_streaming_table("dim_customers") -dlt.apply_changes( - target="dim_customers", - source="customers_cdc", - keys=["customer_id"], - sequence_by="updated_at", - stored_as_scd_type="2" -) -``` +# Legacy +dlt.apply_changes(target="dim_customers", source="customers_cdc", + keys=["customer_id"], sequence_by="updated_at", + stored_as_scd_type="2") -**After (SDP):** -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -@dp.table(name="bronze_orders", cluster_by=["order_date"]) -def bronze_orders(): - return spark.readStream.format("cloudFiles").load("/data/orders") - -@dp.table(name="silver_orders") -@dp.expect_or_drop("valid_amount", "amount > 0") -def silver_orders(): - return spark.readStream.table("bronze_orders").filter(F.col("status") == "completed") - -dp.create_streaming_table("dim_customers") -dp.create_auto_cdc_flow( - target="dim_customers", - source="customers_cdc", - keys=["customer_id"], - sequence_by=F.col("updated_at"), - stored_as_scd_type=2 -) +# Modern +dp.create_auto_cdc_flow(target="dim_customers", source="customers_cdc", + keys=["customer_id"], sequence_by=F.col("updated_at"), + stored_as_scd_type=2) ``` --- ## Migration Path 2: DLT Python → SDP SQL -Use this when simplifying pipelines by converting to SQL. +### Streaming table with Auto Loader -### Decision Matrix - -| Feature/Pattern | DLT Python | SDP SQL | Recommendation | -|-----------------|------------|---------|----------------| -| Simple transformations | ✓ | ✓ | **Migrate to SQL** | -| Aggregations | ✓ | ✓ | **Migrate to SQL** | -| Filtering, WHERE clauses | ✓ | ✓ | **Migrate to SQL** | -| CASE expressions | ✓ | ✓ | **Migrate to SQL** | -| SCD Type 1/2 | ✓ | ✓ | **Migrate to SQL** (AUTO CDC) | -| Simple joins | ✓ | ✓ | **Migrate to SQL** | -| Auto Loader | ✓ | ✓ | **Migrate to SQL** (read_files) | -| Streaming sources (Kafka) | ✓ | ✓ | **Migrate to SQL** (read_kafka) | -| Complex Python UDFs | ✓ | ❌ | **Stay in Python** | -| External API calls | ✓ | ❌ | **Stay in Python** | -| Custom libraries | ✓ | ❌ | **Stay in Python** | -| ML model inference | ✓ | ❌ | **Stay in Python** | - -**Rule**: If 80%+ is SQL-expressible, migrate to SDP SQL. If heavy Python logic, stay with Python (use modern `dp` API). - -### Side-by-Side Conversions - -#### Basic Streaming Table - -**DLT Python:** ```python +# DLT Python @dlt.table(name="bronze_sales", comment="Raw sales") def bronze_sales(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") - .load("/Volumes/my_catalog/my_schema/raw/sales") - .withColumn("_ingested_at", F.current_timestamp()) - ) + return (spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json") + .load("/Volumes/cat/sch/raw/sales") + .withColumn("_ingested_at", F.current_timestamp())) ``` -**SDP SQL:** ```sql +-- SDP SQL CREATE OR REFRESH STREAMING TABLE bronze_sales -COMMENT 'Raw sales' -AS +COMMENT 'Raw sales' AS SELECT *, current_timestamp() AS _ingested_at -FROM STREAM read_files('/Volumes/my_catalog/my_schema/raw/sales', format => 'json'); +FROM STREAM read_files('/Volumes/cat/sch/raw/sales', format => 'json'); ``` -#### Filtering and Transformations +### Filter / cast / select -**DLT Python:** -```python -@dlt.table(name="silver_sales") -@dlt.expect_or_drop("valid_amount", "amount > 0") -@dlt.expect_or_drop("valid_sale_id", "sale_id IS NOT NULL") -def silver_sales(): - return ( - dlt.read_stream("bronze_sales") - .withColumn("sale_date", F.to_date("sale_date")) - .withColumn("amount", F.col("amount").cast("decimal(10,2)")) - .select("sale_id", "customer_id", "amount", "sale_date") - ) -``` +DLT Python `dlt.read_stream("bronze_sales").withColumn("amount", ...cast("decimal(10,2)")).filter(...)` becomes: -**SDP SQL:** ```sql CREATE OR REFRESH STREAMING TABLE silver_sales AS -SELECT - sale_id, customer_id, - CAST(amount AS DECIMAL(10,2)) AS amount, - CAST(sale_date AS DATE) AS sale_date -FROM STREAM bronze_sales +SELECT sale_id, customer_id, + CAST(amount AS DECIMAL(10,2)) AS amount, + CAST(sale_date AS DATE) AS sale_date +FROM STREAM(bronze_sales) WHERE amount > 0 AND sale_id IS NOT NULL; ``` -#### SCD Type 2 - -**DLT Python:** -```python -dlt.create_streaming_table("customers_history") - -dlt.apply_changes( - target="customers_history", - source="customers_cdc_clean", - keys=["customer_id"], - sequence_by="event_timestamp", - stored_as_scd_type="2", - track_history_column_list=["*"] -) -``` +### SCD Type 2 -**SDP SQL:** ```sql CREATE OR REFRESH STREAMING TABLE customers_history; CREATE FLOW customers_scd2_flow AS AUTO CDC INTO customers_history -FROM stream(customers_cdc_clean) +FROM STREAM(customers_cdc_clean) KEYS (customer_id) -APPLY AS DELETE WHEN operation = "DELETE" +APPLY AS DELETE WHEN operation = 'DELETE' SEQUENCE BY event_timestamp COLUMNS * EXCEPT (operation, _ingested_at, _source_file) STORED AS SCD TYPE 2; ``` -**Note:** In SQL, put `APPLY AS DELETE WHEN` before `SEQUENCE BY`. Only list columns in `COLUMNS * EXCEPT (...)` that exist in the source. +Put `APPLY AS DELETE WHEN` **before** `SEQUENCE BY`. Only list columns in `COLUMNS * EXCEPT (...)` that exist in the source — `_rescued_data` should only appear if bronze uses rescue data. -#### Joins +### Expectations -**DLT Python:** -```python -@dlt.table(name="silver_sales_enriched") -def silver_sales_enriched(): - sales = dlt.read_stream("silver_sales") - products = dlt.read("dim_products") - return sales.join(products, "product_id", "left") -``` +Three options for `@dlt.expect_or_drop("valid_amount", "amount > 0")`: -**SDP SQL:** ```sql -CREATE OR REFRESH STREAMING TABLE silver_sales_enriched AS -SELECT s.*, p.product_name, p.category -FROM STREAM silver_sales s -LEFT JOIN dim_products p ON s.product_id = p.product_id; -``` +-- 1. Constraint (closest equivalent, with metrics) +CREATE OR REFRESH STREAMING TABLE silver_sales ( + CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW +) AS SELECT * FROM STREAM(bronze_sales); -### Handling Expectations - -**DLT Python:** -```python -@dlt.expect_or_drop("valid_amount", "amount > 0") -@dlt.expect_or_fail("critical_id", "id IS NOT NULL") -``` +-- 2. WHERE filter (no metrics, simplest) +... WHERE amount > 0 -**SDP SQL - Basic** (equivalent to expect_or_drop): -```sql -WHERE amount > 0 AND id IS NOT NULL +-- 3. Quarantine pattern (full audit trail; route bad rows to a side table) — +-- see streaming-patterns.md#rescue-data-quarantine ``` -**SDP SQL - Quarantine Pattern** (for auditing dropped records): -```sql --- Flag invalid records -CREATE OR REFRESH STREAMING TABLE bronze_data_flagged AS -SELECT *, - CASE WHEN amount <= 0 OR id IS NULL THEN TRUE ELSE FALSE END AS is_invalid -FROM STREAM bronze_data; +### UDFs --- Clean for downstream -CREATE OR REFRESH STREAMING TABLE silver_data_clean AS -SELECT * FROM STREAM bronze_data_flagged WHERE NOT is_invalid; +Simple UDFs (categorisation, math) translate to SQL `CASE`: --- Quarantine for investigation -CREATE OR REFRESH STREAMING TABLE silver_data_quarantine AS -SELECT * FROM STREAM bronze_data_flagged WHERE is_invalid; -``` - -### Handling UDFs - -#### Simple UDFs → SQL CASE - -**DLT Python:** -```python -@F.udf(returnType=StringType()) -def categorize_amount(amount): - if amount > 1000: return "High" - elif amount > 100: return "Medium" - else: return "Low" - -@dlt.table(name="sales_categorized") -def sales_categorized(): - return dlt.read("sales").withColumn("category", categorize_amount(F.col("amount"))) -``` - -**SDP SQL:** ```sql -CREATE OR REFRESH MATERIALIZED VIEW sales_categorized AS SELECT *, - CASE - WHEN amount > 1000 THEN 'High' - WHEN amount > 100 THEN 'Medium' - ELSE 'Low' - END AS category + CASE WHEN amount > 1000 THEN 'High' + WHEN amount > 100 THEN 'Medium' + ELSE 'Low' END AS category FROM sales; ``` -#### Complex UDFs → Stay in Python - -Keep in Python if: -- Complex conditional logic -- External API calls -- Custom algorithms -- ML inference - -Use modern `dp` API instead of `dlt`. - ---- - -## Migration Process - -### Step 1: Inventory - -Document: -- Number of tables/views -- Python UDFs (simple vs complex) -- External dependencies -- Expectations and quality rules - -### Step 2: Choose Path - -- **80%+ SQL-expressible** → Migrate to SDP SQL -- **Heavy Python logic** → Migrate to SDP Python (`dp` API) -- **Mixed** → Hybrid (SQL for most, Python for complex) - -### Step 3: Migrate by Layer - -1. **Bronze** (ingestion): `cloudFiles` → `read_files()` or keep `cloudFiles` with `dp` -2. **Silver** (cleansing): `dlt.expect*` → WHERE clause or `dp.expect*` -3. **Gold** (aggregations): Usually straightforward -4. **SCD/CDC**: `apply_changes` → AUTO CDC or `create_auto_cdc_flow` - -### Step 4: Test - -- Run both pipelines in parallel -- Compare outputs for correctness -- Validate performance -- Check quality metrics +Keep complex UDFs (external APIs, custom algorithms, ML inference) in Python with the modern `dp` API. --- -## When NOT to Migrate +## Migration Order (by layer) -**Stay with current approach if:** -1. Pipeline works well and team is comfortable -2. Heavy Python UDF usage (>30% of logic) -3. External API calls required -4. Custom ML model inference -5. Complex stateful operations not expressible in SQL -6. Limited time/resources for migration +1. **Bronze (ingestion)** — `cloudFiles` → `read_files()` (or keep `cloudFiles` with `dp`). +2. **Silver (cleansing)** — `dlt.expect*` → WHERE clause or `dp.expect*`. +3. **Gold (aggregations)** — usually straightforward port. +4. **CDC/SCD** — `apply_changes(...)` → `AUTO CDC INTO` (SQL) or `dp.create_auto_cdc_flow(...)` (Python). -**Key**: DLT and SDP are both fully supported. Migrate for simplicity or new features, not necessity. +Run old and new in parallel during cutover and diff outputs before retiring the old pipeline. --- @@ -430,18 +150,17 @@ Document: | Issue | Solution | |-------|----------| -| `sequence_by` type error | Use `col("column")` not string in `dp.create_auto_cdc_flow()` | -| UDF doesn't translate | Keep in Python or refactor with SQL built-ins | -| Expectations differ | Use quarantine pattern to audit dropped records | -| Performance degradation | Use `CLUSTER BY` for Liquid Clustering | -| Schema evolution different | Use `mode => 'PERMISSIVE'` in `read_files()` | -| AUTO CDC parse error | Put `APPLY AS DELETE WHEN` before `SEQUENCE BY` | +| `sequence_by` type error | Both string and `col("column")` work — confirm the column exists in the source. | +| `stored_as_scd_type` rejected | Integer `2` for Type 2, string `"1"` for Type 1. Don't quote `2`. | +| UDF doesn't translate cleanly | Keep in Python, or refactor into SQL built-ins. | +| Performance regressed | Replace `partition_cols` with `cluster_by` (Liquid Clustering). | +| Schema evolution different | Use `mode => 'PERMISSIVE'` in `read_files()` or rely on rescued-data column. | +| `AUTO CDC` parse error at `APPLY` | Put `APPLY AS DELETE WHEN` before `SEQUENCE BY`. | --- -## Related Documentation +## Related -- **[python/1-syntax-basics.md](python/1-syntax-basics.md)** - Modern `dp` API reference -- **[python/4-cdc-patterns.md](python/4-cdc-patterns.md)** - Python CDC patterns -- **[sql/4-cdc-patterns.md](sql/4-cdc-patterns.md)** - SQL CDC patterns -- **[SKILL.md](../SKILL.md)** - Main skill entry point +- [python-basics.md](python-basics.md) — modern `dp` API reference +- [auto-cdc-python.md](auto-cdc-python.md) / [auto-cdc-sql.md](auto-cdc-sql.md) — full CDC API +- [SKILL.md](../SKILL.md#legacy-dlt-syntax--always-migrate) — Legacy DLT Syntax mapping table diff --git a/skills/databricks-pipelines/references/expectations-python.md b/skills/databricks-pipelines/references/expectations-python.md index 484dc64..7ba5573 100644 --- a/skills/databricks-pipelines/references/expectations-python.md +++ b/skills/databricks-pipelines/references/expectations-python.md @@ -1,81 +1,26 @@ -Expectations apply data quality constraints to Lakeflow Spark Declarative Pipelines tables and views in Python. They use SQL Boolean expressions to validate each record and take actions when constraints are violated. +# Expectations (Python) -## When to Use Expectations +Data-quality constraints stacked above `@dp.materialized_view()` / `@dp.table()` / `@dp.temporary_view()` functions. Each constraint is a SQL Boolean string evaluated per row. -- Apply to `@dp.materialized_view()`/`@dp.table()`/`@dlt.table()`/`@dp.temporary_view()`/`@dp.view()`/`@dlt.view()` decorated functions -- Use on streaming tables, materialized views, or temporary views -- Stack multiple expectation decorators above the dataset function +Legacy `@dlt.expect*` decorators still parse but should be migrated to `@dp.expect*` (same names, same semantics) — see [SKILL.md Legacy DLT Syntax](../SKILL.md#legacy-dlt-syntax--always-migrate). -## Decorator Types +## Decorators -### Single Expectation Decorators +| Decorator | Action on violation | +|---|---| +| `@dp.expect(name, condition)` | Warn — invalid rows pass through, metrics logged. | +| `@dp.expect_or_drop(name, condition)` | Drop violating rows before write. | +| `@dp.expect_or_fail(name, condition)` | Fail the pipeline atomically on first violation. | +| `@dp.expect_all({name: cond, ...})` | Warn, multiple at once. | +| `@dp.expect_all_or_drop({name: cond, ...})` | Drop, multiple at once. | +| `@dp.expect_all_or_fail({name: cond, ...})` | Fail, multiple at once. | -**@dp.expect(description, constraint)** (or **@dlt.expect(description, constraint)**) +- `name` (str) — unique within the dataset; appears in metrics. +- `condition` (str) — a SQL Boolean expression. Built-ins are fine. **No** Python UDFs, external calls, or subqueries. -- Logs violations but allows invalid records to pass through -- Collects metrics for monitoring +## Patterns -**@dp.expect_or_drop(description, constraint)** (or **@dlt.expect_or_drop(description, constraint)**) - -- Removes invalid records before writing to target -- Logs dropped record metrics - -**@dp.expect_or_fail(description, constraint)** (or **@dlt.expect_or_fail(description, constraint)**) - -- Stops pipeline execution immediately on violation -- Requires manual intervention to resolve - -### Multiple Expectations Decorators - -**@dp.expect_all({description: constraint, ...})** (or **@dlt.expect_all({description: constraint, ...})**) - -- Applies multiple warn-level expectations -- Takes dictionary of description-constraint pairs - -**@dp.expect_all_or_drop({description: constraint, ...})** (or **@dlt.expect_all_or_drop({description: constraint, ...})**) - -- Applies multiple drop-level expectations -- Records dropped if any constraint fails - -**@dp.expect_all_or_fail({description: constraint, ...})** (or **@dlt.expect_all_or_fail({description: constraint, ...})**) - -- Applies multiple fail-level expectations -- Pipeline stops if any constraint fails - -## Parameters - -**description** (str, required) - -- Unique identifier for the constraint within the dataset -- Should clearly communicate what is being validated -- Can be reused across different datasets - -**constraint** (str, required) - -- SQL Boolean expression evaluated per record -- Must return true or false -- Cannot contain Python functions or UDFs, external calls, or subqueries -- Cannot include subqueries in constraint logic - -## Usage Examples - -All variants below work on both the `table`, `materialized_view` or `view` decorators. - -### Basic Single Expectation - -```python -@dp.materialized_view() -@dp.expect("valid_price", "price >= 0") -def sales_data(): - return spark.read.table("raw_sales") - -@dp.table() -@dp.expect("valid_price", "price >= 0") -def sales_data(): - return spark.read.table("raw_sales") -``` - -### Drop Invalid Records +### Single decorator ```python @dp.materialized_view() @@ -84,34 +29,25 @@ def customer_contacts(): return spark.read.table("raw_contacts") ``` -### Fail on Critical Violations +`@dp.expect("name", "cond")` (warn) and `@dp.expect_or_fail(...)` (fail) follow the same shape. -```python -@dp.materialized_view() -@dp.expect_or_fail("required_id", "customer_id IS NOT NULL") -def customer_master(): - return spark.read.table("raw_customers") -``` - -### Multiple Expectations +### Multiple expectations, same action — use `expect_all` ```python @dp.materialized_view() @dp.expect_all({ - "valid_age": "age >= 0 AND age <= 120", + "valid_age": "age >= 0 AND age <= 120", "valid_country": "country_code IN ('US', 'CA', 'MX')", - "recent_date": "created_date >= '2020-01-01'" + "recent_date": "created_date >= '2020-01-01'", }) def validated_customers(): return spark.read.table("raw_customers") ``` -### Stacking Multiple Decorators +### Multiple expectations, mixed actions — stack decorators ```python -@dp.materialized_view( - comment="Clean customer data with quality checks" -) +@dp.materialized_view(comment="Clean customer data") @dp.expect_or_drop("valid_email", "email LIKE '%@%'") @dp.expect_or_fail("required_id", "id IS NOT NULL") @dp.expect("valid_age", "age BETWEEN 0 AND 120") @@ -119,32 +55,23 @@ def customers_clean(): return spark.read.table("raw_customers") ``` -### With Views +### Temporary view with expectations ```python -@dp.view( - name="high_value_customers", - comment="Customers with total purchases over $1000" -) +@dp.temporary_view(name="high_value_customers") @dp.expect("valid_total", "total_purchases > 0") def high_value_view(): - return spark.read.table("orders") \ - .groupBy("customer_id") \ - .agg(sum("amount").alias("total_purchases")) \ - .filter("total_purchases > 1000") + return (spark.read.table("orders") + .groupBy("customer_id") + .agg(F.sum("amount").alias("total_purchases")) + .filter("total_purchases > 1000")) ``` -## Monitoring - -- View metrics in pipeline UI -- Query the event log for detailed analytics -- Metrics unavailable if pipeline fails or no updates occur - ## Best Practices -- Use unique, descriptive names for each expectation -- Apply `expect_or_fail` for critical business constraints -- Use `expect_or_drop` for data cleansing operations -- Use `expect` for monitoring optional quality metrics -- Keep constraint logic simple and SQL-based only -- Group related expectations using `expect_all` variants +- Unique, descriptive names — they appear in metrics. +- `expect_or_fail` for critical business invariants. +- `expect_or_drop` for cleansing operations. +- `expect` (warn) for measuring soft quality without blocking. +- Group same-action constraints in `expect_all*` rather than stacking many decorators. +- Predicate is a SQL string — no Python UDFs, subqueries, external calls. diff --git a/skills/databricks-pipelines/references/expectations-sql.md b/skills/databricks-pipelines/references/expectations-sql.md index cecece3..850ff01 100644 --- a/skills/databricks-pipelines/references/expectations-sql.md +++ b/skills/databricks-pipelines/references/expectations-sql.md @@ -1,155 +1,64 @@ -Expectations apply data quality constraints to Lakeflow Spark Declarative Pipelines tables and views in SQL. They use SQL Boolean expressions to validate each record and take actions when constraints are violated. +# Expectations (SQL) -## When to Use Expectations +Data-quality constraints inside `CREATE OR REFRESH STREAMING TABLE` / `MATERIALIZED VIEW` / `CREATE LIVE VIEW`. Each constraint is a SQL Boolean expression evaluated per row; the action on violation is `(default)` warn, `DROP ROW`, or `FAIL UPDATE`. -- Apply within `CREATE OR REFRESH STREAMING TABLE`, `CREATE OR REFRESH MATERIALIZED VIEW`, or `CREATE LIVE VIEW` statements -- Use as optional clauses in table/view creation statements -- Stack multiple CONSTRAINT clauses (comma-separated) in a single statement +> `CREATE TEMPORARY VIEW` does NOT support `CONSTRAINT` clauses. Use `CREATE LIVE VIEW` for the edge case of "temporary view with expectations" — see [temporary-view-sql.md#using-expectations-with-temporary-views](temporary-view-sql.md#using-expectations-with-temporary-views). -**Note on Temporary Views**: Use `CREATE LIVE VIEW` syntax when you need to include expectations with temporary views. The newer `CREATE TEMPORARY VIEW` syntax does not support CONSTRAINT clauses. `CREATE LIVE VIEW` is retained specifically for this use case, even though `CREATE TEMPORARY VIEW` is otherwise preferred for temporary views without expectations. - -## Constraint Syntax - -### Single Expectation (Warn) - -**CONSTRAINT constraint_name EXPECT (condition)** - -- Logs violations but allows invalid records to pass through -- Collects metrics for monitoring -- Invalid records are retained in target dataset - -### Single Expectation (Drop) - -**CONSTRAINT constraint_name EXPECT (condition) ON VIOLATION DROP ROW** - -- Removes invalid records before writing to target -- Logs dropped record metrics -- Invalid records are excluded from target - -### Single Expectation (Fail) - -**CONSTRAINT constraint_name EXPECT (condition) ON VIOLATION FAIL UPDATE** - -- Stops pipeline execution immediately on violation -- Requires manual intervention to resolve -- Transaction rolls back atomically - -### Multiple Expectations - -Multiple CONSTRAINT clauses can be stacked in a single CREATE statement using commas: +## Syntax ```sql -CREATE OR REFRESH STREAMING TABLE table_name( - CONSTRAINT name1 EXPECT (condition1), - CONSTRAINT name2 EXPECT (condition2) ON VIOLATION DROP ROW, - CONSTRAINT name3 EXPECT (condition3) ON VIOLATION FAIL UPDATE +CREATE OR REFRESH STREAMING TABLE table_name ( + CONSTRAINT name1 EXPECT (cond1), -- warn (default) + CONSTRAINT name2 EXPECT (cond2) ON VIOLATION DROP ROW, -- drop violating rows + CONSTRAINT name3 EXPECT (cond3) ON VIOLATION FAIL UPDATE -- fail pipeline on first violation ) AS SELECT ... ``` -## Parameters - -**constraint_name** (required) - -- Unique identifier for the constraint within the dataset -- Should clearly communicate what is being validated -- Can be reused across different datasets - -**condition** (required) - -- SQL Boolean expression evaluated per record -- Must return true or false -- Can include SQL functions (e.g., year(), date(), CASE statements) -- Cannot contain Python functions or UDFs, external calls, or subqueries - -## Usage Examples - -### Basic Single Expectation - -```sql -CREATE OR REFRESH STREAMING TABLE sales_data( - CONSTRAINT valid_price EXPECT (price >= 0) -) AS -SELECT * FROM STREAM(raw_sales); -``` - -### Drop Invalid Records - -```sql -CREATE OR REFRESH STREAMING TABLE customer_contacts( - CONSTRAINT valid_email EXPECT ( - email IS NOT NULL AND email LIKE '%@%' - ) ON VIOLATION DROP ROW -) AS -SELECT * FROM STREAM(raw_contacts); -``` - -### Fail on Critical Violations - -```sql -CREATE OR REFRESH MATERIALIZED VIEW customer_master( - CONSTRAINT required_id EXPECT (customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE -) AS -SELECT * FROM raw_customers; -``` - -### Multiple Expectations +- `constraint_name` must be unique within the dataset; describes what's validated. +- `condition` is a SQL Boolean expression. Built-in functions (`year(...)`, `current_date()`, `CASE`, ...) are fine. **No** Python UDFs, external calls, or subqueries. +- Multiple `CONSTRAINT` clauses are stacked comma-separated and each can have a different action. +- Action semantics: + - **warn (default)**: violations logged, invalid rows still written to the target. Metrics collected. + - **`DROP ROW`**: violating rows dropped before write. Metrics collected. + - **`FAIL UPDATE`**: first violation fails the pipeline atomically; transaction rolls back. Requires manual fix. -```sql -CREATE OR REFRESH STREAMING TABLE validated_customers( - CONSTRAINT valid_age EXPECT (age >= 0 AND age <= 120), - CONSTRAINT valid_country EXPECT (country_code IN ('US', 'CA', 'MX')), - CONSTRAINT recent_date EXPECT (created_date >= '2020-01-01') -) AS -SELECT * FROM STREAM(raw_customers); -``` +## Patterns -### Stacking Multiple Constraints with Different Actions +### Mixed actions in one dataset ```sql -CREATE OR REFRESH STREAMING TABLE customers_clean -( - CONSTRAINT valid_email EXPECT (email LIKE '%@%') ON VIOLATION DROP ROW, - CONSTRAINT required_id EXPECT (id IS NOT NULL) ON VIOLATION FAIL UPDATE, - CONSTRAINT valid_age EXPECT (age BETWEEN 0 AND 120) -) -COMMENT "Clean customer data with quality checks" AS -SELECT * FROM STREAM(raw_customers); +CREATE OR REFRESH STREAMING TABLE customers_clean ( + CONSTRAINT valid_email EXPECT (email LIKE '%@%') ON VIOLATION DROP ROW, + CONSTRAINT required_id EXPECT (id IS NOT NULL) ON VIOLATION FAIL UPDATE, + CONSTRAINT valid_age EXPECT (age BETWEEN 0 AND 120) -- warn only +) AS SELECT * FROM STREAM(raw_customers); ``` -### With SQL Functions +### With SQL functions / complex predicates ```sql -CREATE OR REFRESH STREAMING TABLE transactions( - CONSTRAINT valid_date EXPECT (year(transaction_date) >= 2020), - CONSTRAINT non_negative_price EXPECT (price >= 0), - CONSTRAINT valid_purchase_date EXPECT (transaction_date <= current_date()) -) AS -SELECT * FROM STREAM(raw_transactions); -``` - -### Complex Business Logic - -```sql -CREATE OR REFRESH MATERIALIZED VIEW active_subscriptions( - CONSTRAINT valid_subscription_dates EXPECT ( +CREATE OR REFRESH STREAMING TABLE transactions ( + CONSTRAINT valid_date EXPECT (year(transaction_date) >= 2020), + CONSTRAINT non_negative_price EXPECT (price >= 0), + CONSTRAINT recent_purchase EXPECT (transaction_date <= current_date()) +) AS SELECT * FROM STREAM(raw_transactions); + +CREATE OR REFRESH MATERIALIZED VIEW active_subscriptions ( + CONSTRAINT valid_dates EXPECT ( start_date <= end_date AND end_date <= current_date() AND start_date >= '2020-01-01' ) ON VIOLATION DROP ROW -) AS -SELECT * FROM subscriptions WHERE status = 'active'; +) AS SELECT * FROM subscriptions WHERE status = 'active'; ``` -### With Temporary Views +### Temporary view + expectation (only via `CREATE LIVE VIEW`) ```sql -CREATE LIVE VIEW high_value_customers( +CREATE LIVE VIEW high_value_customers ( CONSTRAINT valid_total EXPECT (total_purchases > 0) -) -COMMENT "Customers with total purchases over $1000" AS -SELECT - customer_id, - SUM(amount) AS total_purchases +) AS +SELECT customer_id, SUM(amount) AS total_purchases FROM orders GROUP BY customer_id HAVING total_purchases > 1000; @@ -157,15 +66,12 @@ HAVING total_purchases > 1000; ## Monitoring -- View metrics in pipeline UI under the **Data quality** tab -- Query the event log for detailed analytics -- Metrics available for `warn` and `drop` actions -- Metrics unavailable if pipeline fails or no updates occur +Metrics show up in the pipeline UI **Data quality** tab and the event log. Available for `warn` and `DROP ROW` actions. Unavailable if the pipeline fails before completion. ## Best Practices -- Use unique, descriptive names for each constraint -- Apply `ON VIOLATION FAIL UPDATE` for critical business constraints -- Use `ON VIOLATION DROP ROW` for data cleansing operations -- Use default (warn) behavior for monitoring optional quality metrics -- Keep constraint logic simple +- Unique, descriptive constraint names — they appear in metrics. +- `FAIL UPDATE` for critical business invariants (anything that should never reach downstream consumers). +- `DROP ROW` for data-cleansing operations where you accept some loss. +- Default (warn) for soft quality metrics you want to *measure* without blocking. +- Keep the predicate simple — no Python, no subqueries, no UDFs. diff --git a/skills/databricks-pipelines/references/expectations.md b/skills/databricks-pipelines/references/expectations.md deleted file mode 100644 index 129a59c..0000000 --- a/skills/databricks-pipelines/references/expectations.md +++ /dev/null @@ -1,19 +0,0 @@ -# Expectations (Data Quality) in Spark Declarative Pipelines - -Expectations enable you to define and enforce data quality constraints on your pipeline tables. - -## Key Concepts - -Expectations in Spark Declarative Pipelines: - -- Define constraints on data quality -- Can drop, fail, or track invalid records -- Support complex validation logic -- Integrated with pipeline monitoring - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [expectations-python.md](expectations-python.md) -- **SQL**: [expectations-sql.md](expectations-sql.md) diff --git a/skills/databricks-pipelines/references/foreach-batch-sink-python.md b/skills/databricks-pipelines/references/foreach-batch-sink-python.md index 17dc80a..a0bba47 100644 --- a/skills/databricks-pipelines/references/foreach-batch-sink-python.md +++ b/skills/databricks-pipelines/references/foreach-batch-sink-python.md @@ -1,121 +1,70 @@ -ForEachBatch sinks in Spark Declarative Pipelines process a stream as micro-batches with custom Python logic. **Public Preview** — this API may change. +# ForEachBatch Sinks (Python, Public Preview) -**When to use:** Use ForEachBatch when built-in sink formats (`delta`, `kafka`) are insufficient: +Process the stream as micro-batches with custom Python logic — for things the built-in `delta` / `kafka` sinks can't do: MERGE/upsert into Delta, fan out to multiple destinations per batch, or write to unsupported targets (JDBC, etc.). -- Custom merge/upsert logic into a Delta table -- Writing to multiple destinations per batch -- Writing to unsupported streaming sinks (e.g., JDBC targets) -- Custom per-batch transformations - -**API Reference:** - -**@dp.foreach_batch_sink()** -Decorator that defines a ForEachBatch sink. The decorated function is called for each micro-batch. +## `@dp.foreach_batch_sink(name="...")` ```python -@dp.foreach_batch_sink(name="") +@dp.foreach_batch_sink(name="") # name optional; defaults to function name def my_sink(df, batch_id): - # df: Spark DataFrame with micro-batch data - # batch_id: integer ID for the micro-batch (0 = start of stream or full refresh) - # Access SparkSession via df.sparkSession - pass -``` - -Parameters: - -- `name` (str): Optional. Unique name for the sink within the pipeline. Defaults to function name. - -The decorated function receives: - -- `df` (DataFrame): Spark DataFrame containing data for the current micro-batch -- `batch_id` (int): Integer ID of the micro-batch. Spark increments this for each trigger interval. `0` means start of stream or beginning of a full refresh — the handler should properly handle a full refresh for downstream data sources. - -The handler does not need to return a value. - -**Writing to a ForEachBatch Sink:** - -Use `@dp.append_flow()` with the `target` parameter matching the sink name: - -```python -@dp.append_flow(target="my_sink") -def my_flow(): - return spark.readStream.table("source_table") + # df: micro-batch DataFrame + # batch_id: int, increments per trigger. 0 = first run OR start of full refresh. + # Access SparkSession via df.sparkSession (NOT the module-level `spark`) + ... ``` -**Common Patterns:** +The handler doesn't return a value. Write to a sink via `@dp.append_flow(target="")` — multiple flows can target the same sink, each with its own checkpoint. -**Pattern 1: Merge/upsert into a Delta table** +## Patterns -The target table must already exist before the MERGE runs. Create it externally or handle creation in the handler. +### MERGE/upsert into an existing Delta table ```python @dp.foreach_batch_sink(name="upsert_sink") def upsert_sink(df, batch_id): df.createOrReplaceTempView("batch_data") df.sparkSession.sql(""" - MERGE INTO target_catalog.schema.target_table AS target - USING batch_data AS source - ON target.id = source.id + MERGE INTO target_catalog.schema.target_table AS t + USING batch_data AS s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """) - return @dp.append_flow(target="upsert_sink") def upsert_flow(): return spark.readStream.table("source_events") ``` -**Pattern 2: Write to multiple destinations with idempotent writes** +The target Delta table must exist before the MERGE runs — create it externally or in the handler on `batch_id == 0`. + +### Fan out to multiple destinations (idempotent) -Use `txnVersion`/`txnAppId` for idempotent Delta writes — if a batch partially fails and retries, already-completed writes are safely skipped. +Use `txnVersion` + `txnAppId` so partial-failure retries don't double-write. ```python -app_id = "my-app-name" # must be unique per application writing to the same table +APP_ID = "my-app-name" # unique per application writing to the same target @dp.foreach_batch_sink(name="multi_target_sink") def multi_target_sink(df, batch_id): + df.persist() # avoid re-reading the source for each destination df.write.format("delta").mode("append") \ - .option("txnVersion", batch_id).option("txnAppId", app_id) \ + .option("txnVersion", batch_id).option("txnAppId", APP_ID) \ .saveAsTable("my_catalog.my_schema.table_a") df.write.format("json").mode("append") \ - .option("txnVersion", batch_id).option("txnAppId", app_id) \ + .option("txnVersion", batch_id).option("txnAppId", APP_ID) \ .save("/tmp/json_target") - return @dp.append_flow(target="multi_target_sink") def multi_target_flow(): return spark.readStream.table("processed_events") ``` -When writing to multiple destinations, use `df.persist()` or `df.cache()` inside the handler to read the source data only once instead of once per destination. - -**Pattern 3: Enrich and write to an external Delta table** - -```python -from pyspark.sql.functions import current_timestamp - -@dp.foreach_batch_sink(name="enriched_sink") -def enriched_sink(df, batch_id): - enriched = df.withColumn("processed_timestamp", current_timestamp()) - enriched.write.format("delta").mode("append") \ - .saveAsTable("my_catalog.my_schema.enriched_events") - return - -@dp.append_flow(target="enriched_sink") -def enriched_flow(): - return spark.readStream.table("source_events") -``` +## Key rules -**KEY RULES:** - -- ForEachBatch sinks are **Python only** and in **Public Preview** -- Designed for streaming queries (`append_flow`) only — not for batch-only pipelines or Auto CDC semantics -- The pipeline does NOT track data written from a ForEachBatch sink — you manage downstream data and retention -- On full refresh, checkpoints reset and `batch_id` restarts from 0. Data in your target is NOT automatically cleaned up — you must manually drop or truncate target tables/locations if a clean slate is needed -- Multiple `@dp.append_flow()` decorators can target the same sink — each flow maintains its own checkpoint -- To access SparkSession inside the handler, use `df.sparkSession` (not `spark`) -- ForEachBatch supports all Unity Catalog features — you can write to UC managed or external tables and volumes -- When writing to multiple destinations, use `df.persist()` or `df.cache()` to avoid multiple source reads, and `txnVersion`/`txnAppId` for idempotent Delta writes -- Keep the handler function concise — avoid threading, heavy library dependencies, or large in-memory data manipulations -- **databricks-connect compatibility**: If your pipeline may run on databricks-connect, the handler function must be serializable and must not use `dbutils`. Avoid referencing local objects, classes, or unpickleable resources — use pure Python modules. Move `dbutils` calls (e.g., `dbutils.widgets.get()`) outside the handler and capture values in variables. The pipeline raises a warning in the event log for non-serializable UDFs but does not fail the pipeline. However, non-serializable logic can break at runtime in databricks-connect contexts +- Streaming only — append flows only. No batch DataFrames, no Auto CDC. +- The pipeline does NOT track sink data. On full refresh, checkpoints reset and `batch_id` restarts at 0 but **your target is NOT cleaned up** — truncate/drop manually if you want a clean slate. +- Access the session via `df.sparkSession`, not the module-level `spark`. +- Multiple `@dp.append_flow`s can target the same sink; each maintains its own checkpoint. +- For Delta writes use `txnVersion`/`txnAppId` for idempotency. For multi-destination handlers, `df.persist()` / `df.cache()` to avoid re-reading the source. +- Keep handlers small — no threading, no heavy libraries, no large in-memory work. +- **databricks-connect**: the handler must be serializable and must not call `dbutils`. Capture `dbutils.widgets.get(...)` values into variables *outside* the handler. Non-serializable handlers log a warning but may fail at runtime. diff --git a/skills/databricks-pipelines/references/foreach-batch-sink.md b/skills/databricks-pipelines/references/foreach-batch-sink.md deleted file mode 100644 index 348e8c5..0000000 --- a/skills/databricks-pipelines/references/foreach-batch-sink.md +++ /dev/null @@ -1,20 +0,0 @@ -# ForEachBatch Sinks in Spark Declarative Pipelines - -> **Public Preview** — This API may change. - -ForEachBatch sinks process a stream as a series of micro-batches, each handled by a custom Python function. Use when built-in sink formats (Delta, Kafka) are insufficient. - -## When to Use - -- Custom merge/upsert into a Delta table -- Writing to multiple destinations per batch -- Unsupported streaming sinks (e.g., JDBC targets) -- Custom per-batch transformations - -## Language Support - -- **Python only** — SQL does not support ForEachBatch sinks. - -## Implementation Guide - -- **Python**: [foreach-batch-sink-python.md](foreach-batch-sink-python.md) diff --git a/skills/databricks-pipelines/references/kafka.md b/skills/databricks-pipelines/references/kafka.md index 5e776b9..41b859e 100644 --- a/skills/databricks-pipelines/references/kafka.md +++ b/skills/databricks-pipelines/references/kafka.md @@ -1,22 +1,20 @@ # Kafka Ingestion -Ingest from Apache Kafka into streaming tables. Examples in both Python (`spark.readStream.format("kafka")`) and SQL (`read_kafka()`). Same pattern works for Azure Event Hubs via the Kafka protocol — see [Event Hubs](#event-hubs) below. +Ingest from Apache Kafka into a streaming table. Same shape works for Azure Event Hubs (Kafka protocol on port 9093) — only the connection string and SASL config differ. -For Kinesis, Pub/Sub, and Pulsar, use the analogous `read_kinesis`, `read_pubsub`, `read_pulsar` functions / Spark formats — same overall shape as below. - ---- +For Kinesis, Pub/Sub, and Pulsar, use the analogous `read_kinesis` / `read_pubsub` / `read_pulsar` SQL functions or `spark.readStream.format("kinesis|pubsub|pulsar")` — same overall shape as below. ## Basic Read -Kafka returns rows with binary `key` and `value` columns plus `topic`, `partition`, `offset`, `timestamp`. Cast to strings (or `from_json` / `from_avro`) downstream. +Kafka returns rows with binary `key` and `value` columns plus `topic`, `partition`, `offset`, `timestamp`. Cast to `STRING`/`BINARY` and parse downstream — don't carry raw bytes. ```sql CREATE OR REFRESH STREAMING TABLE bronze_kafka_events AS SELECT CAST(key AS STRING) AS event_key, CAST(value AS STRING) AS event_value, topic, partition, offset, - timestamp AS kafka_timestamp, - current_timestamp() AS _ingested_at + timestamp AS kafka_timestamp, + current_timestamp() AS _ingested_at FROM read_kafka( bootstrapServers => '${kafka_brokers}', subscribe => 'events-topic', @@ -24,29 +22,9 @@ FROM read_kafka( ); ``` -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -@dp.table(name="bronze_kafka_events") -def bronze_kafka_events(): - kafka_brokers = spark.conf.get("kafka_brokers") - return ( - spark.readStream.format("kafka") - .option("kafka.bootstrap.servers", kafka_brokers) - .option("subscribe", "events-topic") - .option("startingOffsets", "latest") - .load() - .selectExpr( - "CAST(key AS STRING) AS event_key", - "CAST(value AS STRING) AS event_value", - "topic", "partition", "offset", - "timestamp AS kafka_timestamp") - .withColumn("_ingested_at", F.current_timestamp()) - ) -``` +Python equivalent: `spark.readStream.format("kafka").option("kafka.bootstrap.servers", spark.conf.get("kafka_brokers")).option("subscribe", "events-topic").option("startingOffsets", "latest").load().selectExpr(...)` + `.withColumn("_ingested_at", F.current_timestamp())`. -**Documentation**: [`read_kafka` function reference](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_kafka). +[`read_kafka` reference](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_kafka). ### Common options @@ -56,15 +34,13 @@ def bronze_kafka_events(): | `subscribe` | Topic name or comma-separated list. | | `subscribePattern` | Regex over topic names (alternative to `subscribe`). | | `startingOffsets` | `"latest"`, `"earliest"`, or JSON per-partition offsets. | -| `endingOffsets` | Only for batch reads — ignored in streaming. | +| `endingOffsets` | Batch reads only; ignored in streaming. | | `maxOffsetsPerTrigger` | Throttle per micro-batch. | -| `failOnDataLoss` | Default `true`. Set `false` only when you accept gaps. | - ---- +| `failOnDataLoss` | Default `true`. `false` only when you accept gaps. | ## Parse JSON Payloads -`value` is a binary/string blob. Extract structured columns with `from_json` (SQL/Python) against an explicit schema. +`value` is a blob. Extract structured columns with `from_json` against an explicit schema — JSON-schema inference from a streaming Kafka source is not supported. ```sql CREATE OR REFRESH STREAMING TABLE silver_events AS @@ -73,39 +49,17 @@ FROM ( SELECT from_json(event_value, 'event_id STRING, event_type STRING, timestamp TIMESTAMP') AS data, kafka_timestamp, _ingested_at - FROM STREAM bronze_kafka_events + FROM STREAM(bronze_kafka_events) ); ``` -```python -from pyspark.sql.types import StructType, StructField, StringType, TimestampType - -event_schema = StructType([ - StructField("event_id", StringType()), - StructField("event_type", StringType()), - StructField("timestamp", TimestampType()), -]) - -@dp.table(name="silver_events") -def silver_events(): - return ( - spark.readStream.table("bronze_kafka_events") - .withColumn("data", F.from_json("event_value", event_schema)) - .select("data.*", "kafka_timestamp", "_ingested_at") - ) -``` - -**Schema hygiene**: keep the schema in code (Python `StructType` or SQL string), versioned alongside the pipeline. Inferring JSON schema from a streaming Kafka source is not supported — the schema must be explicit. +Python: build a `StructType` and `.withColumn("data", F.from_json("event_value", event_schema)).select("data.*", ...)`. Keep the schema in code, versioned alongside the pipeline. For Avro / Protobuf payloads, swap `from_json` for `from_avro` / `from_protobuf` (with Schema Registry config). Same overall pattern. ---- - ## Authentication -### Databricks Secrets - -Don't put credentials in code or pipeline config literally. Use `{{secrets/scope/key}}` interpolation. +Use `{{secrets/scope/key}}` interpolation in SQL or `dbutils.secrets.get(scope, key)` in Python. Never hard-code credentials. ```sql -- SASL/PLAIN @@ -121,109 +75,58 @@ FROM read_kafka( ); ``` -```python -@dp.table(name="bronze_kafka_authenticated") -def bronze_kafka_authenticated(): - username = dbutils.secrets.get(scope="kafka", key="username") - password = dbutils.secrets.get(scope="kafka", key="password") - return ( - spark.readStream.format("kafka") - .option("kafka.bootstrap.servers", spark.conf.get("kafka_brokers")) - .option("subscribe", "events-topic") - .option("kafka.security.protocol", "SASL_SSL") - .option("kafka.sasl.mechanism", "PLAIN") - .option("kafka.sasl.jaas.config", - f'org.apache.kafka.common.security.plain.PlainLoginModule required ' - f'username="{username}" password="{password}";') - .load() - ) -``` - -### TLS / mTLS - -For mTLS, additional `kafka.ssl.truststore.*` and `kafka.ssl.keystore.*` options are required. Truststore/keystore files typically come from Unity Catalog volumes; pass file paths via pipeline config. - ---- +For mTLS, add `kafka.ssl.truststore.*` and `kafka.ssl.keystore.*` options pointing at files in a UC volume; pass paths via pipeline config. ## Event Hubs (via Kafka protocol) -Azure Event Hubs speaks the Kafka protocol on port 9093. Use the same Kafka source — only the connection string changes. +Same Kafka source — change the connection target and auth: ```sql -FROM read_kafka( - bootstrapServers => '.servicebus.windows.net:9093', - subscribe => '', - `kafka.security.protocol` => 'SASL_SSL', - `kafka.sasl.mechanism` => 'PLAIN', - `kafka.sasl.jaas.config` => - 'org.apache.kafka.common.security.plain.PlainLoginModule required ' || - 'username="$ConnectionString" ' || +bootstrapServers => '.servicebus.windows.net:9093', +subscribe => '', +`kafka.security.protocol` => 'SASL_SSL', +`kafka.sasl.mechanism` => 'PLAIN', +`kafka.sasl.jaas.config` => + 'org.apache.kafka.common.security.plain.PlainLoginModule required ' + 'username="$ConnectionString" ' 'password="{{secrets/eventhub/connection-string}}";' -); ``` -```python -@dp.table(name="bronze_eventhub_events") -def bronze_eventhub_events(): - conn_str = dbutils.secrets.get(scope="eventhub", key="connection-string") - return ( - spark.readStream.format("kafka") - .option("kafka.bootstrap.servers", ".servicebus.windows.net:9093") - .option("subscribe", "") - .option("kafka.security.protocol", "SASL_SSL") - .option("kafka.sasl.mechanism", "PLAIN") - .option("kafka.sasl.jaas.config", - 'org.apache.kafka.common.security.plain.PlainLoginModule required ' - f'username="$ConnectionString" password="{conn_str}";') - .load() - ) -``` - -The username is the literal string `$ConnectionString` and the password is the namespace-level or entity-level connection string (with `SharedAccessKey=…`). - ---- +The username is the literal `$ConnectionString`; the password is the namespace- or entity-level connection string (with `SharedAccessKey=...`). ## Pipeline Configuration -Pass Kafka brokers, topics, and consumer-group identity through pipeline configuration so dev/prod can differ without code changes. +Pass brokers, topics, consumer-group identity through pipeline config so dev/prod differ without code changes. ```yaml -# In resources/.pipeline.yml +# resources/.pipeline.yml resources: pipelines: my_pipeline: - ... configuration: kafka_brokers: "broker-1:9092,broker-2:9092,broker-3:9092" kafka_topic: "events-topic" ``` -Read in code with `spark.conf.get("kafka_brokers")` (Python) or `${kafka_brokers}` (SQL). - ---- +Read with `spark.conf.get("kafka_brokers")` (Python) or `${kafka_brokers}` (SQL). -## Writing to Kafka (Sinks) +## Writing to Kafka (sinks) -Sinks are Python-only. Write a payload to Kafka by creating a sink with `format="kafka"` and appending via `@dp.append_flow`. The `value` column is mandatory — use `to_json(struct(*))` to serialize the row. See [sink.md](sink.md) and [sink-python.md](sink-python.md). - ---- +Sinks are Python-only. Create a sink with `format="kafka"` and write via `@dp.append_flow`. The `value` column is mandatory — use `to_json(struct(*))` to serialize the row. See [sink-python.md](sink-python.md). ## Best Practices -1. **Always cast `value` to a usable type** (`STRING`, `BINARY`) and parse with `from_json` / `from_avro` against an explicit schema. Don't carry `value` as bytes downstream. -2. **Add `_ingested_at`** for lag monitoring — see [streaming-patterns.md](streaming-patterns.md#monitoring-lag). -3. **Tune `maxOffsetsPerTrigger`** if downstream operations are bottlenecking. -4. **Don't set `failOnDataLoss = false`** unless you genuinely accept gaps. The default protects against retention-window data loss. -5. **Use the parent `databricks-core` skill** for secret-scope management. - ---- +1. Cast `value` to `STRING` / `BINARY` and parse with `from_json` / `from_avro` against an explicit schema. +2. Add `_ingested_at` — see [streaming-patterns.md#monitoring-lag](streaming-patterns.md#monitoring-lag). +3. Tune `maxOffsetsPerTrigger` if downstream operations bottleneck. +4. Don't set `failOnDataLoss = false` unless you accept retention-window gaps. ## Common Issues | Issue | Fix | |-------|-----| -| `Unable to find Kafka source` | Confirm `format("kafka")` (Python) / `read_kafka` (SQL) and that the cluster has Kafka client libraries (default on serverless / DBR ML / standard runtimes). | +| `Unable to find Kafka source` | Confirm `format("kafka")` / `read_kafka`; default runtimes have Kafka client libs. | | `Connection refused` / SSL handshake | Verify `bootstrapServers` reachability and `kafka.security.protocol`. | -| Schema for `value` doesn't match | `from_json` returns `NULL` on parse failure — add a quarantine fanout on `data IS NULL` similar to [rescue-data quarantine](streaming-patterns.md#rescue-data-quarantine). | -| Increasing consumer lag | Bottleneck downstream — see [streaming-patterns.md](streaming-patterns.md#monitoring-lag) for lag table; tune cluster size / `maxOffsetsPerTrigger`. | -| `failOnDataLoss` error after a long pause | Kafka topic retention expired the offset checkpoint. Reset the pipeline (full refresh) or start from `earliest`. | +| `from_json` returns NULL | Schema mismatch — quarantine on `data IS NULL` (see [rescue-data quarantine](streaming-patterns.md#rescue-data-quarantine)). | +| Growing consumer lag | Downstream bottleneck — see [streaming-patterns.md#monitoring-lag](streaming-patterns.md#monitoring-lag); tune cluster size / `maxOffsetsPerTrigger`. | +| `failOnDataLoss` error after a pause | Kafka retention expired the offset checkpoint. Full refresh, or start from `earliest`. | diff --git a/skills/databricks-pipelines/references/materialized-view-python.md b/skills/databricks-pipelines/references/materialized-view-python.md index 856ae6f..a0d652a 100644 --- a/skills/databricks-pipelines/references/materialized-view-python.md +++ b/skills/databricks-pipelines/references/materialized-view-python.md @@ -1,144 +1,52 @@ -Materialized Views in Spark Declarative Pipelines enable batch processing of data with full refresh or incremental computation. +# Materialized Views (Python) -**NOTE:** This guide focuses on materialized views. For details on streaming tables (incremental processing with `spark.readStream`), use the API guide for `streamingTable` instead. +Batch processing with full refresh or incremental computation. For streaming tables, see [streaming-table-python.md](streaming-table-python.md). For the incremental-refresh operation-support table, see [materialized-view-sql.md](materialized-view-sql.md#incremental-refresh). -**API Reference:** - -**@dp.materialized_view() (Recommended)** -Decorator to define a materialized view. This is the recommended approach for creating materialized views. +## `@dp.materialized_view()` — preferred ```python @dp.materialized_view( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - row_filter="row-filter-clause", - private=False + name="", + comment="", + spark_conf={...}, + table_properties={...}, + path="", + cluster_by=["", ...], # Liquid Clustering — preferred + cluster_by_auto=True, # let Databricks pick keys + partition_cols=[""], # legacy, prefer cluster_by — see performance.md#liquid-clustering + schema="col1 TYPE, ...", # supports GENERATED ALWAYS AS, MASK clauses, PK/FK constraints + row_filter="ROW FILTER my_catalog.my_schema.func ON (col)", + private=False, # True = pipeline-scoped, not published to UC ) -def my_materialized_view(): - return spark.read.table("source.data") +def my_mv(): + return spark.read.table("source.data") # must be a batch DataFrame ``` -**@dp.table() / @dlt.table() (Alternative for Materialized Views)** -In the older `dlt` module, the `@dlt.table` decorator was used to create both streaming tables and materialized views. The `@dp.table()` decorator in the `pyspark.pipelines` module still works in this way, but Databricks recommends using the `@dp.materialized_view()` decorator to create materialized views. Note that `@dp.table()` remains the standard decorator for streaming tables. - -```python -# Still works, but @dp.materialized_view() is preferred for materialized views -@dp.table( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - row_filter="row-filter-clause", - private=False -) -def my_materialized_view(): - return spark.read.table("source.data") -``` - -Parameters: - -- `name` (str): Table name (defaults to function name) -- `comment` (str): Description for the table -- `spark_conf` (dict): Spark configurations for query execution -- `table_properties` (dict): Delta table properties -- `path` (str): Storage location for table data (defaults to managed location) -- `partition_cols` (list): Columns to partition the table by -- `cluster_by_auto` (bool): Enable automatic liquid clustering -- `cluster_by` (list): Columns to use as clustering keys for liquid clustering -- `schema` (str or StructType): Schema definition (SQL DDL string or StructType) - - Supports generated columns: `"order_datetime STRING, order_day STRING GENERATED ALWAYS AS (dayofweek(order_datetime))"` - - Supports constraints: Primary keys, foreign keys - - Supports column masks: `"ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region)"` -- `row_filter` (str): (Public Preview) A row filter clause that filters rows when fetched from the table. - - Must use syntax: `"ROW FILTER func_name ON (column_name [, ...])"` where `func_name` is a SQL UDF returning `BOOLEAN`. The UDF can be defined in Unity Catalog. - - Rows are filtered out when the function returns `FALSE` or `NULL`. - - You can pass table columns or constant literals (`STRING`, numeric, `BOOLEAN`, `INTERVAL`, `NULL`) as arguments. - - The filter is applied as soon as rows are fetched from the data source. - - The function runs with pipeline owner's rights during refresh and invoker's rights during queries (allowing user-context functions like `CURRENT_USER()` and `IS_MEMBER()` for data security). - - Note: Using row filters on source tables forces full refresh of downstream materialized views. - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `private` (bool): Restricts table to pipeline scope; prevents metastore publication - -**Materialized View vs Streaming Table:** - -- **Materialized View**: Use `@dp.materialized_view()` decorator with function returning `spark.read...` (batch DataFrame) -- **Streaming Table**: Use `@dp.table()` decorator with function returning `spark.readStream...` (streaming DataFrame) - see the `streamingTable` API guide - -Note: When using `@dp.table()` with a batch DataFrame return type, a materialized view is created. However, `@dp.materialized_view()` is preferred for this use case. The `@dp.table()` decorator remains the standard approach for streaming tables (with streaming DataFrame return type). - -**Incremental Refresh for Materialized Views:** - -Materialized views on **serverless pipelines** support automatic incremental refresh, which processes only changes in underlying data since the last refresh rather than recomputing everything. This significantly reduces compute costs. - -**How it works:** +`@dp.table()` with a batch DataFrame return type also creates a materialized view (legacy DLT shape), but `@dp.materialized_view()` is the recommended decorator. Use `@dp.table` only for streaming tables now. -- Lakeflow Spark Declarative Pipelines uses a cost model to determine whether to perform incremental refresh or full recompute -- Incremental refresh processes delta changes and appends to the table -- If incremental refresh is not feasible or more expensive, the system falls back to full recompute automatically +For the detailed semantics of `row_filter` (UC SQL UDF returning BOOLEAN; forces full refresh of downstream MVs; cannot define the UDF inside the pipeline), see [streaming-table-python.md](streaming-table-python.md). -**Requirements for incremental refresh:** +## Incremental refresh -- Must run on **serverless pipelines** (not classic compute) -- Source tables must be Delta tables, materialized views, or streaming tables -- Row-tracking must be enabled on source tables for certain operations (see Notes column) +Requires **serverless** + Delta row tracking on source tables (`delta.enableRowTracking = true`). Falls back to full recompute otherwise. For the supported-operations matrix, see [materialized-view-sql.md](materialized-view-sql.md#incremental-refresh) — same support applies to the Python DataFrame equivalents. -**Supported SQL operations for incremental refresh (use PySpark DataFrame API equivalents in Python):** +For exactly-once semantics (Kafka, Auto Loader), use a streaming table instead. -| SQL Operation | Support | Notes | -| --------------------------- | ------- | ------------------------------------------------------------------------------------------------------- | -| SELECT expressions | Yes | Deterministic built-in functions and immutable UDFs. Requires row tracking | -| GROUP BY | Yes | — | -| WITH | Yes | Common table expressions | -| UNION ALL | Yes | Requires row tracking | -| FROM | Yes | Supported base tables include Delta tables, materialized views, and streaming tables | -| WHERE, HAVING | Yes | Requires row tracking | -| INNER JOIN | Yes | Requires row tracking | -| LEFT OUTER JOIN | Yes | Requires row tracking | -| FULL OUTER JOIN | Yes | Requires row tracking | -| RIGHT OUTER JOIN | Yes | Requires row tracking | -| OVER (Window functions) | Yes | Must specify PARTITION BY columns | -| QUALIFY | Yes | — | -| EXPECTATIONS | Partial | Generally supported; exceptions for views with expectations and DROP expectations with NOT NULL columns | -| Non-deterministic functions | Limited | Time functions like `current_date()` supported in WHERE clauses only | -| Non-Delta sources | No | Volumes, external locations, foreign catalogs unsupported | +## Patterns -**Limitations:** - -- Falls back to full recompute when incremental is more expensive or query uses unsupported expressions - -**Best practices:** - -- Enable deletion vectors, row tracking, and change data feed on source tables for optimal incremental refresh -- Design queries with supported operations to leverage incremental refresh -- For exactly-once processing semantics (Kafka, Auto Loader), use streaming tables instead - -**Common Patterns:** - -**Pattern 1: Simple batch transformation** +### Aggregation with clustering ```python -@dp.materialized_view() -def bronze_batch(): - return spark.read.format("parquet").load("/path/to/data") - -@dp.materialized_view() -def silver_batch(): - return spark.read.table("bronze_batch").filter("id IS NOT NULL") +@dp.materialized_view(name="daily_sales_summary", cluster_by=["sale_date", "region"]) +def daily_sales_summary(): + return (spark.read.table("raw.orders") + .withColumn("sale_date", F.to_date("order_timestamp")) + .groupBy("sale_date", "region") + .agg(F.count("*").alias("order_count"), + F.sum("amount").alias("total_revenue"))) ``` -**Pattern 2: Schema with generated columns** +### Generated columns ```python @dp.materialized_view( @@ -148,45 +56,29 @@ def silver_batch(): customer_id BIGINT, amount DECIMAL(10,2) """, - cluster_by=["order_day_of_week", "customer_id"] + cluster_by=["order_day_of_week", "customer_id"], ) def orders_with_day(): return spark.read.table("raw.orders") ``` -**Pattern 3: Row filters for data security** +### Row filter / column masking (UC, Public Preview) ```python -# Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN - @dp.materialized_view( name="employees", schema="emp_id INT, emp_name STRING, dept STRING, salary DECIMAL(10,2)", - row_filter="ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept)" + row_filter="ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept)", ) def employees(): return spark.read.table("source.employees") ``` -**Pattern 4: Column masking for sensitive data** - -```python -@dp.materialized_view( - schema=""" - user_id BIGINT, - ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region), - region STRING - """ -) -def users_with_masked_ssn(): - return spark.read.table("raw.users") -``` +Column masking uses `MASK ... USING COLUMNS (...)` inside the `schema=` string — same form as in SQL. -**KEY RULES:** +## Key rules -- Use `@dp.materialized_view()` for materialized views (preferred over `@dp.table()`) -- Materialized views use `spark.read` (batch reads) -- Streaming tables use `spark.readStream` (streaming reads) - see the `streamingTable` API guide -- Never use `.write`, `.save()`, `.saveAsTable()`, or `.toTable()` - Databricks manages writes automatically -- Generated columns, constraints, and masks require schema definition -- Row filters force full refresh of downstream materialized views +- MVs use `spark.read` (batch); streaming tables use `spark.readStream`. +- Never `.write`, `.save()`, `.saveAsTable()`, `.toTable()` — Databricks manages writes. +- Generated columns, PK/FK constraints, and MASK clauses require an explicit `schema=`. +- Row filters on source tables force full refresh of downstream MVs. diff --git a/skills/databricks-pipelines/references/materialized-view-sql.md b/skills/databricks-pipelines/references/materialized-view-sql.md index 5851f39..d351da1 100644 --- a/skills/databricks-pipelines/references/materialized-view-sql.md +++ b/skills/databricks-pipelines/references/materialized-view-sql.md @@ -1,187 +1,96 @@ -Materialized Views in Lakeflow Spark Declarative Pipelines enable batch processing of data with full refresh or incremental computation. +# Materialized Views (SQL) -**NOTE:** This guide focuses on materialized views. For details on streaming tables (incremental processing with streaming reads), use the API guide for `streamingTable` instead. +Batch processing with full refresh or incremental computation. For streaming tables (incremental streaming), see [streaming-table-sql.md](streaming-table-sql.md). -**SQL Syntax:** - -**CREATE MATERIALIZED VIEW** -Creates a materialized view for batch data processing. For streaming tables, see the `CREATE STREAMING TABLE` guide. +## Syntax ```sql -CREATE OR REFRESH [PRIVATE] MATERIALIZED VIEW - view_name - [ column_list ] - [ view_clauses ] +CREATE OR REFRESH [PRIVATE] MATERIALIZED VIEW view_name + [ ( col_name col_type [NOT NULL] [COMMENT '...'] [column_constraint | MASK clause] + [, ...] + [, CONSTRAINT name EXPECT (cond) [ON VIOLATION DROP ROW | FAIL UPDATE]] + [, table_constraint] ) ] + [ PARTITIONED BY (col, ...) | CLUSTER BY (col, ...) ] -- prefer CLUSTER BY + [ LOCATION path ] -- Hive metastore only + [ COMMENT '...' ] + [ TBLPROPERTIES (key = value, ...) ] + [ WITH ROW FILTER func_name ON (col, ...) ] AS query - -column_list - ( { column_name column_type column_properties } [, ...] - [ column_constraint ] [, ...] - [ , table_constraint ] [...] ) - - column_properties - { NOT NULL | COMMENT column_comment | column_constraint | MASK clause } [ ... ] - -view_clauses - { USING DELTA | - PARTITIONED BY (col [, ...]) | - CLUSTER BY clause | - LOCATION path | - COMMENT view_comment | - TBLPROPERTIES clause | - WITH { ROW FILTER clause } } [...] ``` -**Parameters:** - -- `PRIVATE`: Restricts table to pipeline scope; prevents metastore publication -- `view_name`: Unique identifier for the view (fully qualified name including catalog and schema must be unique unless marked PRIVATE) -- `column_list`: Optional schema definition with column names, types, and properties - - `column_name`: Name of the column - - `column_type`: Data type (STRING, BIGINT, DECIMAL, etc.) - - `column_properties`: Column attributes: - - `NOT NULL`: Column cannot contain null values - - `COMMENT column_comment`: Description for the column - - `column_constraint`: Data quality constraints, consult the `expectations` API guide for details. - - `MASK clause`: Column masking syntax `MASK catalog.schema.mask_fn USING COLUMNS (other_column)` (Public Preview) - - `table_constraint`: Informational table-level constraints (Unity Catalog only, **not enforced** by Databricks): - - Look up exact documentation when using - - Note: Constraints are informational metadata for documentation and query optimization hints; data validation must be performed independently -- `view_clauses`: Optional clauses for view configuration: - - `USING DELTA`: Optional format specification (only DELTA supported, can be omitted) - - `PARTITIONED BY (col [, ...])`: Columns for traditional partitioning, mutually exclusive with CLUSTER BY - - `CLUSTER BY clause`: Columns for liquid clustering (optimized query performance) - - `LOCATION path`: Storage path (Hive metastore only) - - `COMMENT view_comment`: Description for the view - - `TBLPROPERTIES clause`: Custom table properties `(key = value [, ...])` - - `WITH ROW FILTER clause`: Row-level security filtering - - Syntax: `ROW FILTER func_name ON (column_name [, ...])` (Public Preview) - - `func_name` must be a SQL UDF returning BOOLEAN (can be defined in Unity Catalog) - - Rows are filtered out when function returns FALSE or NULL - - Accepts table columns or constant literals (STRING, numeric, BOOLEAN, INTERVAL, NULL) - - Filter applies when rows are fetched from the data source - - Runs with pipeline owner's rights during refresh and invoker's rights during queries - - Note: Using row filters on source tables forces full refresh of downstream materialized views - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `query`: A Spark SQL query that defines the dataset for the table - -**Incremental Refresh for Materialized Views:** - -Materialized views on **serverless pipelines** support automatic incremental refresh, which processes only changes in underlying data since the last refresh rather than recomputing everything. This significantly reduces compute costs. - -**How it works:** - -- Lakeflow Spark Declarative Pipelines uses a cost model to determine whether to perform incremental refresh or full recompute -- Incremental refresh processes delta changes and appends to the table -- If incremental refresh is not feasible or more expensive, the system falls back to full recompute automatically - -**Requirements for incremental refresh:** - -- Must run on **serverless pipelines** (not classic compute) -- Source tables must be Delta tables, materialized views, or streaming tables -- Row-tracking must be enabled on source tables for certain operations (see Notes column) - -**Supported SQL operations for incremental refresh:** - -| SQL Operation | Support | Notes | -| --------------------------- | ------- | ------------------------------------------------------------------------------------------------------- | -| SELECT expressions | Yes | Deterministic built-in functions and immutable UDFs. Requires row tracking | -| GROUP BY | Yes | — | -| WITH | Yes | Common table expressions | -| UNION ALL | Yes | Requires row tracking | -| FROM | Yes | Supported base tables include Delta tables, materialized views, and streaming tables | -| WHERE, HAVING | Yes | Requires row tracking | -| INNER JOIN | Yes | Requires row tracking | -| LEFT OUTER JOIN | Yes | Requires row tracking | -| FULL OUTER JOIN | Yes | Requires row tracking | -| RIGHT OUTER JOIN | Yes | Requires row tracking | -| OVER (Window functions) | Yes | Must specify PARTITION BY columns | -| QUALIFY | Yes | — | -| EXPECTATIONS | Partial | Generally supported; exceptions for views with expectations and DROP expectations with NOT NULL columns | -| Non-deterministic functions | Limited | Time functions like `current_date()` supported in WHERE clauses only | -| Non-Delta sources | No | Volumes, external locations, foreign catalogs unsupported | - -**Best practices:** - -- Enable deletion vectors, row tracking, and change data feed on source tables for optimal incremental refresh -- Design queries with supported operations to leverage incremental refresh -- For exactly-once processing semantics (Kafka, Auto Loader), use streaming tables instead - -**Common Patterns:** - -**Pattern 1: Simple batch transformation** +Clause notes (same semantics as streaming tables — see [streaming-table-sql.md](streaming-table-sql.md) for the detailed treatment of `PRIVATE`, `MASK`, `WITH ROW FILTER`, and informational table constraints): -```sql -CREATE MATERIALIZED VIEW bronze_batch -AS SELECT * FROM delta.`/path/to/data`; +- `query` must NOT use `STREAM(...)` — MVs are batch. Streaming reads belong in a streaming table. +- PRIMARY KEY requires explicit `NOT NULL`. +- Generated columns supported via `col TYPE GENERATED ALWAYS AS (expr)`. +- Identity columns, default columns, and explicit `OPTIMIZE` / `VACUUM` are not supported (the pipeline handles maintenance). +- Non-column expressions in the SELECT list require explicit aliases. +- Sum aggregates over a nullable column return `0` (not NULL) when only NULLs remain. + +## Incremental refresh + +MVs use incremental refresh automatically when possible. Falls back to full recompute otherwise. + +**Requirements**: serverless pipeline, source is Delta / MV / streaming table, row tracking enabled on sources (for ops marked below). + +| SQL operation | Support | Notes | +|---|---|---| +| `SELECT` expressions | Yes | Deterministic built-ins / immutable UDFs. Requires row tracking. | +| `WHERE`, `HAVING` | Yes | Requires row tracking. | +| `GROUP BY`, `WITH`, `QUALIFY` | Yes | — | +| `UNION ALL` | Yes | Requires row tracking. | +| `INNER` / `LEFT` / `RIGHT` / `FULL OUTER JOIN` | Yes | Requires row tracking. | +| `OVER` (window functions) | Yes | Must specify `PARTITION BY`. | +| Expectations | Partial | Views-with-expectations and `DROP ROW` on `NOT NULL` columns are exceptions. | +| Non-deterministic functions | Limited | `current_date()` etc. allowed in `WHERE` only. | +| Non-Delta sources | No | Volumes, external locations, foreign catalogs not supported. | -CREATE MATERIALIZED VIEW silver_batch -AS SELECT * FROM bronze_batch WHERE id IS NOT NULL; +Enable `delta.enableRowTracking = true`, `delta.enableChangeDataFeed = true`, and deletion vectors on source tables for the best incremental coverage. For exactly-once semantics (Kafka, Auto Loader), use a streaming table instead. + +## Patterns + +### Aggregation with Liquid Clustering + +```sql +CREATE OR REFRESH MATERIALIZED VIEW daily_sales_summary +CLUSTER BY (sale_date, region) +AS +SELECT DATE(order_timestamp) AS sale_date, region, + COUNT(*) AS order_count, SUM(amount) AS total_revenue +FROM raw.orders +GROUP BY DATE(order_timestamp), region; ``` -**Pattern 2: Schema with generated columns** +### Generated column ```sql -CREATE MATERIALIZED VIEW orders_with_day ( - order_datetime STRING, +CREATE OR REFRESH MATERIALIZED VIEW orders_with_day ( + order_datetime STRING, order_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime)), - customer_id BIGINT, - amount DECIMAL(10,2) + customer_id BIGINT, + amount DECIMAL(10,2) ) CLUSTER BY (order_day_of_week, customer_id) AS SELECT order_datetime, customer_id, amount FROM raw.orders; ``` -**Pattern 3: Row filters for data security** +### Row filter (UC, Public Preview) ```sql --- Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN - -CREATE MATERIALIZED VIEW employees ( - emp_id INT, - emp_name STRING, - dept STRING, - salary DECIMAL(10,2) +CREATE OR REFRESH MATERIALIZED VIEW employees ( + emp_id INT, emp_name STRING, dept STRING, salary DECIMAL(10,2) ) WITH ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept) AS SELECT * FROM source.employees; ``` -**Pattern 4: Column masking for sensitive data** +### Column masking (UC, Public Preview) ```sql -CREATE MATERIALIZED VIEW users_with_masked_ssn ( +CREATE OR REFRESH MATERIALIZED VIEW users_with_masked_ssn ( user_id BIGINT, - ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region), - region STRING + ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region), + region STRING ) AS SELECT user_id, ssn, region FROM raw.users; ``` - -**Pattern 5: Aggregations with liquid clustering** - -```sql -CREATE MATERIALIZED VIEW daily_sales_summary -CLUSTER BY (sale_date, region) -AS -SELECT - DATE(order_timestamp) AS sale_date, - region, - COUNT(*) AS order_count, - SUM(amount) AS total_revenue -FROM raw.orders -GROUP BY DATE(order_timestamp), region; -``` - -**KEY RULES:** - -- Materialized views perform batch processing of data -- Streaming tables perform incremental streaming processing - see the `streamingTable` guide -- Identity columns, and default columns are not supported -- Row filters force full refresh of downstream materialized views -- Sum aggregates over nullable columns return zero instead of NULL when only nulls remain (when last non-NULL value is removed) -- Non-column expressions require explicit aliases (column references do not need aliases) -- PRIMARY KEY requires explicit NOT NULL specification to be valid -- OPTIMIZE and VACUUM commands unavailable, Lakeflow Declarative Pipelines handles maintenance automatically -- `CLUSTER BY` is recommended over `PARTITIONED BY` for most use cases -- Table renaming and ownership changes prohibited diff --git a/skills/databricks-pipelines/references/materialized-view.md b/skills/databricks-pipelines/references/materialized-view.md deleted file mode 100644 index e23fa0b..0000000 --- a/skills/databricks-pipelines/references/materialized-view.md +++ /dev/null @@ -1,19 +0,0 @@ -# Materialized Views in Spark Declarative Pipelines - -Materialized views store the results of a query physically, enabling faster query performance for expensive transformations and aggregations. - -## Key Concepts - -Materialized views in Spark Declarative Pipelines: - -- Physically store query results -- Are incrementally refreshed when source data changes -- Support complex transformations and aggregations -- Published to Unity Catalog - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [materialized-view-python.md](materialized-view-python.md) -- **SQL**: [materialized-view-sql.md](materialized-view-sql.md) diff --git a/skills/databricks-pipelines/references/performance.md b/skills/databricks-pipelines/references/performance.md index cf82997..2fe135c 100644 --- a/skills/databricks-pipelines/references/performance.md +++ b/skills/databricks-pipelines/references/performance.md @@ -1,44 +1,27 @@ # Performance Tuning -Performance patterns for Spark Declarative Pipelines: Liquid Clustering, state management for streaming, join strategy, query optimization, and pre-aggregation. Examples are shown in both SQL and Python. +Liquid Clustering, state management for streaming, join strategy, query optimization, pre-aggregation. SQL is shown as canonical; Python equivalents use the obvious `@dp.table` + DataFrame translation (`cluster_by=[...]`, `table_properties={...}`). --- ## Liquid Clustering -**Recommended** for data layout. Replaces `PARTITION BY` + `ZORDER`. Adaptive, multi-dimensional, self-optimizing — no more manual `OPTIMIZE`. - -### Basic syntax +**Recommended** for data layout. Replaces `PARTITION BY` + `ZORDER`. Adaptive, multi-dimensional, self-optimizing — no manual `OPTIMIZE` needed. ```sql CREATE OR REFRESH STREAMING TABLE bronze_events CLUSTER BY (event_type, event_date) -AS -SELECT *, current_timestamp() AS _ingested_at +AS SELECT *, current_timestamp() AS _ingested_at FROM STREAM read_files('/Volumes/cat/sch/raw/events/', format => 'json'); ``` -```python -@dp.table(cluster_by=["event_type", "event_date"]) -def bronze_events(): - return spark.readStream.format("cloudFiles").load("/Volumes/cat/sch/raw/events/") -``` - -### Automatic key selection +Python: `@dp.table(cluster_by=["event_type", "event_date"])`. -```sql -CLUSTER BY (AUTO) -``` - -```python -cluster_by=["AUTO"] -``` - -Use `AUTO` while learning the workload, prototyping, or when access patterns are unclear. Pick keys manually for production once query patterns are stable. +Use `CLUSTER BY (AUTO)` / `cluster_by=["AUTO"]` while learning the workload, prototyping, or when access patterns are unclear. Pick keys manually for production once query patterns are stable. ### Cluster key data types -**Cluster keys must be numeric, string, date, or timestamp.** `BOOLEAN`, `ARRAY`, `MAP`, `STRUCT`, `BINARY` are rejected at runtime with `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED`. Low-cardinality flags also don't benefit from clustering — leave them out. +**Numeric, string, date, or timestamp only.** `BOOLEAN`, `ARRAY`, `MAP`, `STRUCT`, `BINARY` fail at first write with `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED` (no data-skipping stats). Low-cardinality flags also don't benefit from clustering — leave them out. ### Cluster key selection by layer @@ -48,180 +31,64 @@ Use `AUTO` while learning the workload, prototyping, or when access patterns are | **Silver** | `primary_key`, `business_date` | Entity lookups + time-range queries. | | **Gold** | aggregation dimensions | Dashboard filters. | -**Rules of thumb**: -- First key: most-selective filter (e.g. `customer_id`). -- Second key: next-most-common filter (e.g. date). -- Order matters. Most-selective first. -- Limit to **4 keys** — diminishing returns beyond that. -- Use `AUTO` if unsure. - -### Bronze example - -```sql -CREATE OR REFRESH STREAMING TABLE bronze_events -CLUSTER BY (event_type, ingestion_date) -TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true') -AS -SELECT *, - current_timestamp() AS _ingested_at, - CAST(current_date() AS DATE) AS ingestion_date -FROM STREAM read_files('/Volumes/cat/sch/raw/events/', format => 'json'); -``` - -```python -@dp.table( - name="bronze_events", - cluster_by=["event_type", "ingestion_date"], - table_properties={"delta.autoOptimize.optimizeWrite": "true"}, -) -def bronze_events(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") - .load("/Volumes/cat/sch/raw/events/") - .withColumn("_ingested_at", F.current_timestamp()) - .withColumn("ingestion_date", F.current_date()) - ) -``` - -### Silver example (clustering for joins + time filters) - -```sql -CREATE OR REFRESH STREAMING TABLE silver_orders -CLUSTER BY (customer_id, order_date) -AS -SELECT order_id, customer_id, product_id, - CAST(amount AS DECIMAL(10,2)) AS amount, -- DECIMAL for monetary - CAST(order_timestamp AS DATE) AS order_date, - order_timestamp -FROM STREAM bronze_orders; -``` - -```python -@dp.table(name="silver_orders", cluster_by=["customer_id", "order_date"]) -def silver_orders(): - return ( - spark.readStream.table("bronze_orders") - .withColumn("order_date", F.to_date("order_timestamp")) - .select("order_id", "customer_id", "product_id", "amount", "order_date") - ) -``` - -### Gold example (clustering on aggregation dimensions) - -```sql -CREATE OR REFRESH MATERIALIZED VIEW gold_sales_summary -CLUSTER BY (product_category, year_month) -AS -SELECT product_category, - DATE_FORMAT(order_date, 'yyyy-MM') AS year_month, - SUM(amount) AS total_sales, - COUNT(*) AS transaction_count, - AVG(amount) AS avg_order_value -FROM silver_orders -GROUP BY product_category, DATE_FORMAT(order_date, 'yyyy-MM'); -``` - -```python -@dp.materialized_view(name="gold_sales_summary", cluster_by=["product_category", "year_month"]) -def gold_sales_summary(): - return ( - spark.read.table("silver_orders") - .withColumn("year_month", F.date_format("order_date", "yyyy-MM")) - .groupBy("product_category", "year_month") - .agg(F.sum("amount").alias("total_sales"), - F.count("*").alias("transaction_count"), - F.avg("amount").alias("avg_order_value")) - ) -``` +Rules of thumb: most-selective key first, second-most-common filter second; order matters; cap at 4 keys (diminishing returns beyond). Use `AUTO` if unsure. ### Migrating from `PARTITION BY` + `ZORDER` -Before (legacy): +Replace: ```sql -CREATE OR REFRESH STREAMING TABLE events PARTITIONED BY (date DATE) TBLPROPERTIES ('pipelines.autoOptimize.zOrderCols' = 'user_id,event_type') -AS SELECT ...; ``` -After: +with: ```sql -CREATE OR REFRESH STREAMING TABLE events CLUSTER BY (date, user_id, event_type) -AS SELECT ...; ``` -Typical wins: 20–50% query improvement, no small-file problem, automatic optimization, no manual `OPTIMIZE` job. - -**Keep `PARTITION BY` only for**: regulatory requirements (physical separation), data-lifecycle (need to `DROP` partitions for retention), DBR < 13.3 compatibility, or existing huge tables where migration cost > benefit. +Typical wins: 20–50% query improvement, no small-file problem, automatic optimization. **Keep `PARTITION BY` only for**: regulatory physical separation, data lifecycle requiring `DROP PARTITION`, DBR < 13.3 compatibility, or huge existing tables where migration cost > benefit. --- ## Table Properties -### Auto-optimize - ```sql TBLPROPERTIES ( - 'delta.autoOptimize.optimizeWrite' = 'true', - 'delta.autoOptimize.autoCompact' = 'true' + 'delta.autoOptimize.optimizeWrite' = 'true', -- right-size new files on write + 'delta.autoOptimize.autoCompact' = 'true', -- compact small files automatically + 'delta.enableChangeDataFeed' = 'true', -- if downstream needs CDF + 'delta.logRetentionDuration' = '7 days', -- high-volume tables only + 'delta.deletedFileRetentionDuration' = '7 days' -- shortens time-travel window ) ``` -```python -table_properties={ - "delta.autoOptimize.optimizeWrite": "true", - "delta.autoOptimize.autoCompact": "true", -} -``` - -### Change Data Feed - -```sql -TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true') -``` - -Enable when downstream systems need efficient change tracking. +Python: `table_properties={"delta.autoOptimize.optimizeWrite": "true", ...}`. -### Retention (high-volume tables) - -```sql -TBLPROPERTIES ( - 'delta.logRetentionDuration' = '7 days', - 'delta.deletedFileRetentionDuration' = '7 days' -) -``` - -Use for high-volume tables to reduce storage cost. Be careful: short retention windows break time-travel queries beyond the window. +Short retention windows break time-travel queries beyond the window — only set on high-volume tables where storage cost dominates. --- ## Materialized View Refresh ```sql --- Near-real-time CREATE OR REFRESH MATERIALIZED VIEW gold_live_metrics -REFRESH EVERY 5 MINUTES +REFRESH EVERY 5 MINUTES -- or REFRESH EVERY 1 DAY for batch reports AS SELECT metric_name, AVG(metric_value) AS avg_value, MAX(last_updated) AS freshness FROM silver_metrics GROUP BY metric_name; - --- Daily -CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary -REFRESH EVERY 1 DAY -AS SELECT report_date, SUM(amount) AS total_amount - FROM silver_sales GROUP BY report_date; ``` ### Incremental refresh MVs use incremental refresh automatically when possible. Requirements: -- Source has Delta row tracking enabled. -- No row-level filters. + +- **Serverless pipeline** (incremental refresh for aggregations is serverless-only). +- Source has Delta row tracking enabled (`delta.enableRowTracking = true`). +- No row-level filters on the source. - Aggregation/expression pattern is supported. -- **Serverless pipeline.** Incremental refresh for aggregations is a serverless feature. + +Falls back to full recompute if any requirement isn't met. --- @@ -231,59 +98,34 @@ Higher cardinality → more state. Watch the combinations in `GROUP BY`. ```sql -- High state: every unique combination creates state -SELECT user_id, product_id, session_id, COUNT(*) AS events -FROM STREAM bronze_events -GROUP BY user_id, product_id, session_id; -- 1M × 10K × 100M — massive +SELECT user_id, product_id, session_id, COUNT(*) +FROM STREAM(bronze_events) +GROUP BY user_id, product_id, session_id; -- 1M × 10K × 100M — massive ``` -### Strategy 1: reduce cardinality +Three strategies to bound state: + +**1. Reduce cardinality** — group by coarser keys. ```sql -- 100 categories instead of 10K products -SELECT user_id, product_category, DATE(event_time) AS event_date, COUNT(*) AS events -FROM STREAM bronze_events -GROUP BY user_id, product_category, DATE(event_time); -``` - -```python -@dp.table(name="user_category_stats") -def user_category_stats(): - return ( - spark.readStream.table("bronze_events") - .groupBy("user_id", "product_category", - F.to_date("event_time").alias("event_date")) - .agg(F.count("*").alias("events")) - ) +GROUP BY user_id, product_category, DATE(event_time) ``` -### Strategy 2: use time windows +**2. Use time windows** — explicit retention boundary. ```sql -SELECT user_id, window(event_time, '1 hour') AS time_window, COUNT(*) AS events -FROM STREAM bronze_events -GROUP BY user_id, window(event_time, '1 hour'); +GROUP BY user_id, window(event_time, '1 hour') ``` -```python -@dp.table(name="user_hourly_stats") -def user_hourly_stats(): - return ( - spark.readStream.table("bronze_events") - .groupBy("user_id", F.window("event_time", "1 hour")) - .agg(F.count("*").alias("events")) - ) -``` - -### Strategy 3: materialize intermediates (move state to batch) +**3. Materialize daily then aggregate batch monthly** — move state from streaming to batch. ```sql --- Streaming aggregation (maintains state) CREATE OR REFRESH STREAMING TABLE user_daily_stats AS SELECT user_id, DATE(event_time) AS event_date, COUNT(*) AS event_count -FROM STREAM bronze_events +FROM STREAM(bronze_events) GROUP BY user_id, DATE(event_time); --- Batch aggregation on top (no streaming state) CREATE OR REFRESH MATERIALIZED VIEW user_monthly_stats AS SELECT user_id, DATE_TRUNC('month', event_date) AS month, SUM(event_count) AS total_events FROM user_daily_stats @@ -296,85 +138,53 @@ GROUP BY user_id, DATE_TRUNC('month', event_date); ### Stream-to-static (efficient) +Small static dimensions broadcast naturally — no special config needed. + ```sql --- Small static dimension joined to large streaming fact CREATE OR REFRESH STREAMING TABLE sales_enriched AS SELECT s.sale_id, s.product_id, s.amount, p.product_name, p.category -FROM STREAM bronze_sales s +FROM STREAM(bronze_sales) s LEFT JOIN dim_products p ON s.product_id = p.product_id; ``` -```python -@dp.table(name="sales_enriched") -def sales_enriched(): - sales = spark.readStream.table("bronze_sales") - products = spark.read.table("dim_products") # static, broadcastable - return sales.join(products, "product_id", "left") \ - .select("sale_id", "product_id", "amount", "product_name", "category") -``` +Python: `sales = spark.readStream.table("bronze_sales")` / `products = spark.read.table("dim_products")` (static, broadcastable) / `sales.join(products, "product_id", "left")`. **Rule**: keep static dimensions small (< 10K rows) so they broadcast. ### Stream-to-stream (stateful, time-bounded) +Always bound by event-time interval. Without bounds, state grows unbounded. + ```sql --- Time bounds limit state retention CREATE OR REFRESH STREAMING TABLE orders_with_payments AS SELECT o.order_id, o.amount AS order_amount, p.payment_id, p.amount AS payment_amount -FROM STREAM bronze_orders o -INNER JOIN STREAM bronze_payments p +FROM STREAM(bronze_orders) o +INNER JOIN STREAM(bronze_payments) p ON o.order_id = p.order_id AND p.payment_time BETWEEN o.order_time AND o.order_time + INTERVAL 1 HOUR; ``` -```python -@dp.table(name="orders_with_payments") -def orders_with_payments(): - orders = spark.readStream.table("bronze_orders") - payments = spark.readStream.table("bronze_payments") - return orders.join( - payments, - (orders.order_id == payments.order_id) & - (payments.payment_time >= orders.order_time) & - (payments.payment_time <= orders.order_time + F.expr("INTERVAL 1 HOUR")), - "inner", - ) -``` - -Without time bounds, stream-to-stream state grows unbounded. +Python: same shape, time-bound predicate as `(p.payment_time >= o.order_time) & (p.payment_time <= o.order_time + F.expr("INTERVAL 1 HOUR"))`. --- ## Query Optimization -### Filter early +**Filter early** — push filters into the streaming read so downstream MV inputs stay small. The anti-pattern is wide-open silver tables filtered later in gold MVs — every row is processed twice. ```sql --- Filter at source CREATE OR REFRESH STREAMING TABLE silver_recent AS -SELECT * -FROM STREAM bronze_events +SELECT * FROM STREAM(bronze_events) WHERE event_date >= CURRENT_DATE() - INTERVAL 7 DAYS; ``` -```python -@dp.table(name="silver_recent") -def silver_recent(): - return (spark.readStream.table("bronze_events") - .filter(F.col("event_date") >= F.current_date() - 7)) -``` - -Pushing filters into the streaming read keeps downstream MV inputs small. The anti-pattern is wide-open silver tables filtered later in gold MVs — every row is processed twice. - -### Select specific columns - -Skip `SELECT *` once schema is stable. Narrowed projections enable column pruning in Delta and shrink wire/state size for stateful operations. +**Skip `SELECT *`** once schema is stable. Narrow projections enable Delta column pruning and shrink wire/state size for stateful operations. --- ## Pre-Aggregation -When the same coarse aggregation is queried frequently, materialize it. Querying the MV is far cheaper than re-aggregating the underlying table. +When the same coarse aggregation is queried frequently, materialize it. ```sql CREATE OR REFRESH MATERIALIZED VIEW orders_monthly AS @@ -382,20 +192,9 @@ SELECT customer_id, YEAR(order_date) AS year, MONTH(order_date) AS month, SUM(amount) AS total FROM large_orders_table GROUP BY customer_id, YEAR(order_date), MONTH(order_date); - --- Query the MV directly -SELECT * FROM orders_monthly WHERE year = 2024; ``` -```python -@dp.materialized_view(name="orders_monthly") -def orders_monthly(): - return (spark.read.table("large_orders_table") - .groupBy("customer_id", - F.year("order_date").alias("year"), - F.month("order_date").alias("month")) - .agg(F.sum("amount").alias("total"))) -``` +Querying `orders_monthly` is far cheaper than re-aggregating the underlying table. --- @@ -406,7 +205,7 @@ def orders_monthly(): | Startup | Seconds | Minutes | | Scaling | Automatic, instant | Manual / autoscale | | Cost | Pay-per-use | Pay for cluster time | -| Best for | Variable / dev / test / most prod | Steady, very long-running workloads with special requirements | +| Best for | Variable / dev / test / most prod | Steady long-running workloads with special requirements | **Default to serverless.** Switch to classic only when R, Spark RDD APIs, JAR/Maven libraries, or other serverless-incompatible features are required — see [pipeline-configuration.md](pipeline-configuration.md#serverless-limitations-force-classic-clusters). @@ -417,64 +216,12 @@ def orders_monthly(): ```sql SELECT table_name, MAX(event_timestamp) AS latest_event, - CURRENT_TIMESTAMP() AS now, TIMESTAMPDIFF(MINUTE, MAX(event_timestamp), CURRENT_TIMESTAMP()) AS lag_minutes FROM pipeline_monitoring.table_metrics GROUP BY table_name; ``` -Watch for: - -1. Slow streaming tables (high processing lag). -2. Large state operations (high memory). -3. Expensive joins (long batch durations). -4. Small-file accumulation (raise auto-optimize, check write patterns). - ---- - -## Complete Example (Python) - -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -@dp.table( - name="bronze_orders", - cluster_by=["order_date"], - table_properties={ - "delta.autoOptimize.optimizeWrite": "true", - "delta.autoOptimize.autoCompact": "true", - }, -) -def bronze_orders(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") - .load("/Volumes/cat/sch/raw/orders/") - .withColumn("_ingested_at", F.current_timestamp()) - .withColumn("order_date", F.to_date("order_timestamp")) - ) - -@dp.table(name="silver_orders", cluster_by=["customer_id", "order_date"]) -@dp.expect_or_drop("valid_amount", "amount > 0") -def silver_orders(): - return ( - spark.readStream.table("bronze_orders") - .filter(F.col("order_date") >= F.current_date() - 90) # filter early - .withColumn("amount", F.col("amount").cast("decimal(10,2)")) - .select("order_id", "customer_id", "amount", "order_date") - ) - -@dp.materialized_view(name="gold_daily_revenue", cluster_by=["order_date"]) -def gold_daily_revenue(): - return ( - spark.read.table("silver_orders") - .groupBy("order_date") - .agg(F.sum("amount").alias("total_revenue"), - F.count("order_id").alias("order_count"), - F.countDistinct("customer_id").alias("unique_customers")) - ) -``` +Watch for slow streaming tables (high processing lag), large state ops (memory), expensive joins (long batch durations), small-file accumulation (raise auto-optimize). --- @@ -484,7 +231,7 @@ def gold_daily_revenue(): |-------|-------------| | Pipeline running slowly | Check clustering keys, state size, join patterns. | | High memory usage | Unbounded state — add time windows, reduce cardinality. | -| Many small files | Enable auto-optimize table properties. | +| Many small files | Enable `delta.autoOptimize.optimizeWrite` + `autoCompact`. | | Expensive queries on large tables | Add clustering on filter columns, build pre-aggregated MVs. | -| MV refresh slow | Enable row tracking on source, verify the refresh is actually incremental. | -| `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED` | A cluster key has an unsupported type (BOOLEAN / complex). Replace with a numeric / string / date / timestamp column. | +| MV refresh slow / not incremental | Enable row tracking on source; verify serverless. | +| `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED` | A cluster key is BOOLEAN / ARRAY / MAP / STRUCT / BINARY. Replace with numeric / string / date / timestamp. | diff --git a/skills/databricks-pipelines/references/pipeline-configuration.md b/skills/databricks-pipelines/references/pipeline-configuration.md index f102e57..0be40d5 100644 --- a/skills/databricks-pipelines/references/pipeline-configuration.md +++ b/skills/databricks-pipelines/references/pipeline-configuration.md @@ -4,7 +4,9 @@ JSON field reference for `databricks pipelines create --json '{...}'` and `datab Defaults to **serverless + Unity Catalog**. Don't set `serverless: false` unless the user explicitly needs R, Spark RDD APIs, or JAR / Maven libraries. -## Canonical Create +## Canonical Create (dev / iteration defaults) + +For dev, demo, and iteration work, always pass these fields: ```bash databricks pipelines create --json '{ @@ -13,14 +15,26 @@ databricks pipelines create --json '{ "schema": "my_schema", "serverless": true, "continuous": false, + "development": true, "channel": "PREVIEW", + "configuration": { + "pipelines.numUpdateRetryAttempts": "0", + "pipelines.maxFlowRetryAttempts": "0" + }, "libraries": [{"glob": {"include": "/Workspace/Users//my_pipeline/**"}}] }' ``` -**Always pass `"continuous": false` explicitly.** A continuous pipeline auto-restarts failed updates forever (`cause: RETRY_ON_FAILURE`), burning serverless cost and trapping polling loops. Only set `true` when the user explicitly asks for an always-on streaming pipeline. +> **Tuned for demo / iteration.** The `pipelines.*RetryAttempts: "0"` overrides disable retries so a broken update fails fast (~30s) instead of retrying for 10+ min on the same root cause. For production, **drop these overrides** so the platform's retry defaults (5 update / 2 flow) absorb transient infra failures. + +Per-field rationale: -The variant blocks below show only the **deltas** to add or change — don't re-paste the whole JSON. +- **`continuous: false`** — triggered runs. `true` auto-restarts failed updates forever (`cause: RETRY_ON_FAILURE`), burning cost and trapping polling loops. Only `true` when the user explicitly asks for always-on streaming. +- **`development: true`** — faster startup, relaxed validation, no retry-on-failure. Required for any edit/re-run loop. +- **`pipelines.numUpdateRetryAttempts: "0"` + `maxFlowRetryAttempts: "0"`** — belt-and-suspenders against retries. Even with `development`, some configs still retry. Drop for prod. +- **`channel: "PREVIEW"`** — latest features. `"CURRENT"` (default) for production stability. + +Variant snippets below show only the **deltas** to add/replace in the canonical JSON. --- @@ -176,24 +190,31 @@ For continuous pipelines, the 5-hour window when daily restarts may occur: Each block shows what to add to (or replace in) the canonical create JSON. -### Development mode +### Production mode (remove dev defaults) + +The canonical create above is tuned for iteration. For production, **remove** `"development": true` and the two `pipelines.*RetryAttempts` overrides so the platform's retry defaults (5 / 2) can absorb transient infra failures. Add ownership tags: ```json -"development": true, -"tags": {"environment": "development", "owner": "data-team"} +"channel": "CURRENT", +"tags": {"environment": "production", "owner": "data-team"} ``` +Switch `"channel"` to `"CURRENT"` for stable runtime behavior. + ### Non-serverless / dedicated cluster +Required only for R, Spark RDD APIs, or JAR/Maven libraries. + ```json "serverless": false, "photon": true, "edition": "ADVANCED", "clusters": [{ "label": "default", - "num_workers": 4, + "autoscale": {"min_workers": 2, "max_workers": 8, "mode": "ENHANCED"}, // or "num_workers": 4 for fixed "node_type_id": "i3.xlarge", - "custom_tags": {"cost_center": "analytics"} + "spark_conf": {"spark.sql.adaptive.enabled": "true"}, + "custom_tags": {"environment": "production"} }] ``` @@ -204,18 +225,6 @@ Each block shows what to add to (or replace in) the canonical create JSON. "configuration": {"spark.sql.shuffle.partitions": "auto"} ``` -### Production autoscaling cluster - -```json -"clusters": [{ - "label": "default", - "autoscale": {"min_workers": 2, "max_workers": 8, "mode": "ENHANCED"}, - "node_type_id": "i3.xlarge", - "spark_conf": {"spark.sql.adaptive.enabled": "true"}, - "custom_tags": {"environment": "production"} -}] -``` - ### Email notifications ```json @@ -269,55 +278,31 @@ databricks pipelines update --json '{ }' ``` -Then trigger a new run with `databricks pipelines start-update [--full-refresh]`. See [workflows.md](workflows.md#step-4-start-an-update-and-poll-that-update) for the polling pattern — never poll top-level `pipelines get` state for run completion. +Then trigger a new run with `databricks pipelines start-update [--full-refresh]`. See [2-rapid-iteration-with-cli.md](2-rapid-iteration-with-cli.md#step-4-start-an-update-and-poll-that-update) for the polling pattern — never poll top-level `pipelines get` state for run completion. --- ## Multi-Schema Patterns -**Preferred: one pipeline, multiple schemas** via fully-qualified table names. Simpler than running multiple pipelines. +**Preferred: one pipeline, multiple schemas** via fully-qualified table names. Simpler than running multiple pipelines. For trivial cases where all tables share one schema, use name prefixes (`bronze_*`, `silver_*`, `gold_*`). -For trivial cases where all tables go to the same schema, use name prefixes (`bronze_*`, `silver_*`, `gold_*`). - -### Same catalog, separate schemas (parameterized) - -Set pipeline defaults to bronze; pull silver/gold schemas from configuration: +Set pipeline defaults to one schema (e.g. bronze); pull the rest from `configuration`: ```python -from pyspark import pipelines as dp -from pyspark.sql.functions import col - -silver_schema = spark.conf.get("silver_schema") -gold_schema = spark.conf.get("gold_schema") -landing_schema = spark.conf.get("landing_schema") +silver_schema = spark.conf.get("silver_schema") # add silver_catalog too for cross-catalog +gold_schema = spark.conf.get("gold_schema") -@dp.table(name="orders_bronze") -def orders_bronze(): - return spark.readStream.table(f"{landing_schema}.orders_raw") +@dp.table(name="orders_bronze") # uses pipeline default schema +def orders_bronze(): ... -@dp.table(name=f"{silver_schema}.orders_clean") -def orders_clean(): - return spark.read.table("orders_bronze").filter(col("order_id").isNotNull()) +@dp.table(name=f"{silver_schema}.orders_clean") # other schema, same catalog +def orders_clean(): ... @dp.materialized_view(name=f"{gold_schema}.orders_by_date") -def orders_by_date(): - return (spark.read.table(f"{silver_schema}.orders_clean") - .groupBy("order_date").count()) -``` - -Pass `silver_schema` / `gold_schema` / `landing_schema` via the pipeline's `configuration` block. - -### Custom catalog AND schema per layer - -For cross-catalog scenarios, use fully-qualified names directly: - -```python -@dp.table(name=f"{silver_catalog}.{silver_schema}.orders_clean") -def orders_clean(): - return spark.read.table("orders_bronze").filter(col("order_id").isNotNull()) +def orders_by_date(): ... ``` -Same approach in SQL via fully-qualified `catalog.schema.table` in `CREATE OR REFRESH ...`. +For cross-catalog: use three-part `f"{cat}.{schema}.{table}"` in `name=`. SQL uses the same fully-qualified form in `CREATE OR REFRESH ...`. --- diff --git a/skills/databricks-pipelines/references/python-basics.md b/skills/databricks-pipelines/references/python-basics.md index 0216eba..f393f45 100644 --- a/skills/databricks-pipelines/references/python-basics.md +++ b/skills/databricks-pipelines/references/python-basics.md @@ -1,67 +1,44 @@ -#### Setup +# Python Basics -- `from pyspark import pipelines as dp` (preferred) or `import dlt` (deprecated but still works) is always required on top when doing Python. Prefer `dp` import style unless `dlt` was already imported, don't change existing imports unless explicitly asked. -- The SparkSession object is already available (no need to import it again) - unless in a utility file +## Setup -#### Core Decorators +- `from pyspark import pipelines as dp` — required at the top. Legacy `import dlt` still parses but should be migrated (see [SKILL.md Legacy DLT Syntax](../SKILL.md#legacy-dlt-syntax--always-migrate)). +- `spark` (SparkSession) is pre-imported in pipeline files. In utility modules, import it normally. -- `@dp.materialized_view()` - Materialized views (batch processing, recommended for materialized views) -- `@dp.table()` - Streaming tables (when returning streaming DataFrame) or materialized views (legacy, when returning batch DataFrame) -- `@dp.temporary_view()` - Temporary views (non-materialized, private to pipeline) -- `@dp.expect*()` - Data quality constraints (expect, expect_or_drop, expect_or_fail, expect_all, expect_all_or_drop, expect_all_or_fail) +## Core decorators -#### Core Functions +- `@dp.materialized_view()` — batch table. See [materialized-view-python.md](materialized-view-python.md). +- `@dp.table()` — streaming table when the function returns a streaming DataFrame. (Returns-batch-DataFrame is legacy DLT shape — use `@dp.materialized_view` instead.) See [streaming-table-python.md](streaming-table-python.md). +- `@dp.temporary_view()` — pipeline-scoped view. See [temporary-view-python.md](temporary-view-python.md). +- `@dp.expect*()` — quality constraints. See [expectations-python.md](expectations-python.md). +- `@dp.append_flow(target=..., once=...)` — fan multiple sources into one target. See [streaming-table-python.md](streaming-table-python.md). +- `@dp.foreach_batch_sink()` — custom per-batch Python sink (Public Preview). See [foreach-batch-sink-python.md](foreach-batch-sink-python.md). -- `dp.create_streaming_table()` - Continuous processing -- `dp.create_auto_cdc_flow()` - Change data capture -- `dp.create_auto_cdc_from_snapshot_flow()` - Change data capture from database snapshots -- `dp.create_sink()` - Write to alternative targets (Kafka, Event Hubs, external Delta tables) -- `@dp.foreach_batch_sink()` - Custom streaming sink with per-batch Python logic (Public Preview) -- `dp.append_flow()` - Append-only patterns -- `dp.read()`/`dp.read_stream()` - Read from other pipeline datasets (deprecated - always use `spark.read.table()` or `spark.readStream.table()` instead) +## Core functions -#### Critical Rules +- `dp.create_streaming_table()` — empty target for `@dp.append_flow` / `dp.create_auto_cdc_flow`. See [streaming-table-python.md](streaming-table-python.md). +- `dp.create_auto_cdc_flow()` / `dp.create_auto_cdc_from_snapshot_flow()` — CDC. See [auto-cdc-python.md](auto-cdc-python.md). +- `dp.create_sink()` — external Delta / Kafka / Event Hubs sinks. See [sink-python.md](sink-python.md). -- ✅ Dataset functions MUST return Spark DataFrames -- ✅ Use `spark.read.table`/`spark.readStream.table` (NOT dp.read* and NOT dlt.read*) -- ✅ Use `auto_cdc` API (NOT apply_changes) -- ✅ Look up documentation for decorator/function parameters when unsure -- ❌ Do not use star imports -- ❌ NEVER use .collect(), .count(), .toPandas(), .save(), .saveAsTable(), .start(), .toTable() -- ❌ AVOID custom monitoring in dataset definitions -- ❌ Keep functions pure (evaluated multiple times) -- ❌ NEVER use the "LIVE." prefix when reading other datasets (deprecated) -- ❌ No arbitrary Python logic in dataset definitions - focus on DataFrame operations only +## Reading datasets -#### Python-Specific Considerations +- Batch sibling table: `spark.read.table("name")`. +- Streaming sibling table: `spark.readStream.table("name")`. +- **Never** use the `LIVE.` prefix — fully deprecated, errors in modern pipelines. +- `dp.read()` / `dp.read_stream()` are legacy — always use `spark.read.table(...)` / `spark.readStream.table(...)`. -**Reading Pipeline Datasets:** +## Critical rules -When reading from other datasets defined in the pipeline, use the dataset's **dataset name directly** - NEVER use the `LIVE.` prefix: +- ✅ Dataset functions return a Spark DataFrame. +- ✅ Use the modern `auto_cdc` API, not `apply_changes`. +- ✅ Look up parameter docs when unsure — many decorators have nuanced options. +- ❌ Never call `.collect()`, `.count()`, `.toPandas()`, `.save()`, `.saveAsTable()`, `.start()`, `.toTable()` inside a dataset function. The pipeline owns the write side. +- ❌ No custom monitoring or side effects in dataset functions — they may be evaluated multiple times. Keep them pure DataFrame definitions. +- ❌ No star imports. -```python -# ✅ CORRECT - use the function name directly -customers = spark.read.table("bronze_customers") -transactions = spark.readStream.table("bronze_transactions") - -# ❌ WRONG - do NOT use "LIVE." prefix (deprecated) -customers = spark.read.table("LIVE.bronze_customers") -transactions = spark.readStream.table("LIVE.bronze_transactions") -``` - -The `LIVE.` prefix is deprecated and should never be used. The pipeline automatically resolves dataset references by dataset name. - -**Streaming vs. Batch Semantics:** - -- Use `spark.read.table()` (or deprecated `dp.read()`/`dlt.read()`) for batch processing (materialized views with full refresh or incremental computation) -- Use `spark.readStream.table()` (or deprecated `dp.read_stream()`/`dlt.read_stream()`) for streaming tables to enable continuous incremental processing -- **Materialized views**: Use `@dp.materialized_view()` decorator (recommended) with batch DataFrame (`spark.read`) -- **Streaming tables**: Use `@dp.table()` decorator with streaming DataFrame (`spark.readStream`) -- Note: The `@dp.table()` decorator can create both batch and streaming tables based on return type, but `@dp.materialized_view()` is preferred for materialized views - -#### skipChangeCommits +## `skipChangeCommits` -When a downstream streaming table reads from an upstream streaming table that has updates or deletes (e.g., GDPR compliance, Auto CDC targets), use `skipChangeCommits` to ignore those change commits: +When a downstream streaming table reads from an upstream streaming table that has updates/deletes (GDPR purges, Auto CDC targets), set `skipChangeCommits` to ignore the change commits — without it, they cause errors: ```python @dp.table() diff --git a/skills/databricks-pipelines/references/scd-2-querying.md b/skills/databricks-pipelines/references/scd-2-querying.md index 9a626a1..8f4f1f4 100644 --- a/skills/databricks-pipelines/references/scd-2-querying.md +++ b/skills/databricks-pipelines/references/scd-2-querying.md @@ -1,10 +1,8 @@ # Querying SCD Type 2 Tables -How to read SCD Type 2 history tables produced by Auto CDC: current-state views, point-in-time queries, change analysis, and joining facts with historical dimensions. Examples in both SQL and Python. +How to read SCD Type 2 history tables produced by Auto CDC: current-state views, point-in-time queries, change analysis, and joining facts with historical dimensions. SQL is shown as canonical; Python translates via `spark.read.table(...).filter(F.col("__END_AT").isNull())` etc. -For the CDC flow that *writes* these tables, see [auto-cdc.md](auto-cdc.md) and the per-language references. - ---- +For the CDC flow that *writes* these tables, see [auto-cdc-python.md](auto-cdc-python.md) / [auto-cdc-sql.md](auto-cdc-sql.md). ## Temporal Columns @@ -12,76 +10,40 @@ SCD Type 2 tables (from `stored_as_scd_type=2` / `STORED AS SCD TYPE 2`) include | Column | Meaning | |--------|---------| -| `__START_AT` | When this version became effective (typically `sequence_by` value). | +| `__START_AT` | When this version became effective (typically the `sequence_by` value). | | `__END_AT` | When this version expired. `NULL` for the current version. | Both have the same type as the `SEQUENCE BY` / `sequence_by` column (usually `TIMESTAMP`). **Rule of thumb**: `WHERE __END_AT IS NULL` selects only current rows. That's the most common filter — bake it into a materialized view if you query it often. ---- - ## Current State ```sql --- All current records (materialize for repeated use) CREATE OR REFRESH MATERIALIZED VIEW dim_customers_current AS SELECT customer_id, customer_name, email, phone, address, __START_AT AS valid_from FROM dim_customers WHERE __END_AT IS NULL; - --- Single customer current row -SELECT * -FROM dim_customers -WHERE customer_id = '12345' AND __END_AT IS NULL; ``` -```python -@dp.materialized_view(name="dim_customers_current") -def dim_customers_current(): - return ( - spark.read.table("dim_customers") - .filter(F.col("__END_AT").isNull()) - .select("customer_id", "customer_name", "email", "phone", "address", - F.col("__START_AT").alias("valid_from")) - ) -``` - ---- +For a single entity: `WHERE customer_id = '12345' AND __END_AT IS NULL`. ## Point-in-Time Queries -State as it existed on a specific date. The inclusive-lower / exclusive-upper boundary matters — get it right or you'll double-count at the seam between versions. +State as it existed on a specific date. **Boundary convention**: `[__START_AT, __END_AT)` — start inclusive, end exclusive. Get this wrong and you'll either drop the seam row or double-count it. ```sql --- Products as of 2024-01-01 CREATE OR REFRESH MATERIALIZED VIEW products_as_of_2024_01_01 AS -SELECT product_id, product_name, price, category, - __START_AT, __END_AT +SELECT product_id, product_name, price, category, __START_AT, __END_AT FROM products_history WHERE __START_AT <= '2024-01-01' AND (__END_AT > '2024-01-01' OR __END_AT IS NULL); ``` -```python -@dp.materialized_view(name="products_as_of_2024_01_01") -def products_as_of_2024_01_01(): - as_of = "2024-01-01" - return ( - spark.read.table("products_history") - .filter(F.col("__START_AT") <= as_of) - .filter((F.col("__END_AT") > as_of) | F.col("__END_AT").isNull()) - ) -``` - -**Boundary convention**: `[__START_AT, __END_AT)` — start is inclusive, end is exclusive. A version with `__END_AT = '2024-01-01'` is *not* the active version on 2024-01-01. - ---- - ## Change Analysis -### All versions of one entity (history) +### All versions of one entity ```sql SELECT customer_id, customer_name, email, phone, @@ -93,57 +55,24 @@ WHERE customer_id = '12345' ORDER BY __START_AT DESC; ``` -```python -def customer_history(customer_id: str): - return ( - spark.read.table("dim_customers") - .filter(F.col("customer_id") == customer_id) - .withColumn("days_active", - F.coalesce(F.datediff("__END_AT", "__START_AT"), - F.datediff(F.current_timestamp(), "__START_AT"))) - .orderBy(F.col("__START_AT").desc()) - ) -``` - -### Changes within a time period +### Changes within a period (excluding the original version per entity) ```sql --- Customers who changed during Q1 2024 (excluding the original version) SELECT customer_id, customer_name, __START_AT AS change_timestamp, 'UPDATE' AS change_type FROM dim_customers c WHERE __START_AT BETWEEN '2024-01-01' AND '2024-03-31' - AND __START_AT != ( - SELECT MIN(__START_AT) FROM dim_customers c2 - WHERE c2.customer_id = c.customer_id - ) + AND __START_AT != (SELECT MIN(__START_AT) FROM dim_customers c2 + WHERE c2.customer_id = c.customer_id) ORDER BY __START_AT; ``` -```python -@dp.materialized_view(name="customer_changes_q1_2024") -def customer_changes_q1_2024(): - history = spark.read.table("dim_customers") - first_seen = (history.groupBy("customer_id") - .agg(F.min("__START_AT").alias("first_start"))) - return ( - history.join(first_seen, "customer_id") - .filter(F.col("__START_AT").between("2024-01-01", "2024-03-31")) - .filter(F.col("__START_AT") != F.col("first_start")) - .select("customer_id", "customer_name", - F.col("__START_AT").alias("change_timestamp"), - F.lit("UPDATE").alias("change_type")) - ) -``` - ---- - ## Joining Facts with Historical Dimensions -### As-of-transaction-time (canonical) +### As-of-transaction-time (canonical for revenue-correct gold) -For each fact row, pick the dimension version that was active at the transaction's event time. This is the common case for revenue-correct gold tables. +For each fact row, pick the dimension version that was active at the transaction's event time. ```sql CREATE OR REFRESH MATERIALIZED VIEW sales_with_historical_prices AS @@ -159,30 +88,9 @@ INNER JOIN products_history p AND (s.sale_date < p.__END_AT OR p.__END_AT IS NULL); ``` -```python -@dp.materialized_view(name="sales_with_historical_prices") -def sales_with_historical_prices(): - sales = spark.read.table("sales_fact") - products = spark.read.table("products_history") - return ( - sales.join( - products, - (sales.product_id == products.product_id) & - (sales.sale_date >= products.__START_AT) & - ((sales.sale_date < products.__END_AT) | products.__END_AT.isNull()), - "inner", - ) - .select(sales.sale_id, sales.product_id, sales.sale_date, sales.quantity, - products.product_name, - products.price.alias("unit_price_at_sale_time"), - (sales.quantity * products.price).alias("calculated_amount"), - products.category) - ) -``` - ### With the current dimension (ignore history) -For reports that should always reflect today's attribute values (regardless of when the sale happened), join against the current row only. +When attributes are *labels* (always-current product name, region label), not values that drive the math. ```sql CREATE OR REFRESH MATERIALIZED VIEW sales_with_current_prices AS @@ -196,82 +104,40 @@ INNER JOIN products_history p AND p.__END_AT IS NULL; ``` -```python -@dp.materialized_view(name="sales_with_current_prices") -def sales_with_current_prices(): - sales = spark.read.table("sales_fact") - products_current = spark.read.table("products_history").filter(F.col("__END_AT").isNull()) - return ( - sales.join(products_current, "product_id", "inner") - .select("sale_id", "product_id", "sale_date", "quantity", - sales.amount.alias("amount_at_sale"), - products_current.product_name.alias("current_product_name"), - products_current.price.alias("current_price")) - ) -``` - -**Choosing between the two**: as-of-time for revenue, billing, and audit; current-dim for operational dashboards where attributes are *labels*, not values that drive the math. +**When to use which**: as-of-time for revenue, billing, and audit; current-dim for operational dashboards where attributes are labels. ---- +## Optimization -## Optimization Patterns - -### Pre-filter materialized views - -Querying the full history table for "current" repeatedly is wasteful. Bake the `__END_AT IS NULL` filter into an MV: +**Pre-filter into MVs** for repeated queries on history tables: ```sql CREATE OR REFRESH MATERIALIZED VIEW dim_products_current AS SELECT * FROM products_history WHERE __END_AT IS NULL; -CREATE OR REFRESH MATERIALIZED VIEW dim_recent_changes AS -SELECT * FROM products_history -WHERE __START_AT >= CURRENT_DATE() - INTERVAL 90 DAYS; - CREATE OR REFRESH MATERIALIZED VIEW product_change_stats AS -SELECT product_id, - COUNT(*) AS version_count, - MIN(__START_AT) AS first_seen, - MAX(__START_AT) AS last_updated +SELECT product_id, COUNT(*) AS version_count, + MIN(__START_AT) AS first_seen, MAX(__START_AT) AS last_updated FROM products_history GROUP BY product_id; ``` -```python -@dp.materialized_view(name="dim_products_current") -def dim_products_current(): - return spark.read.table("products_history").filter(F.col("__END_AT").isNull()) -``` - -### Cluster on lookup keys + time - -```sql -CREATE OR REFRESH STREAMING TABLE products_history -CLUSTER BY (product_id, __START_AT) -... -``` - -Clustering on `product_id` accelerates entity lookups; adding `__START_AT` helps point-in-time scans. See [performance.md](performance.md#cluster-key-selection-by-layer) for the full layer-by-layer key guide. - ---- +**Cluster the history table on lookup key + time**: `CLUSTER BY (product_id, __START_AT)`. Accelerates both entity lookups and point-in-time scans. See [performance.md#cluster-key-selection-by-layer](performance.md#cluster-key-selection-by-layer). ## Best Practices -1. **Filter `__END_AT IS NULL` for "current"** — never compare `__START_AT` against `MAX(__START_AT)` per entity. It's slower and breaks under concurrent updates. -2. **Use inclusive-lower / exclusive-upper** for point-in-time joins. Mismatched boundaries either drop the seam row or double-count it. -3. **Materialize repeated filters.** A `dim_*_current` MV is cheaper than re-filtering the history table on every downstream read. -4. **Make `SEQUENCE BY` high-precision.** Sub-second collisions (multiple changes at the same `updated_at`) cause non-deterministic ordering; prefer microsecond timestamps or compose with a tiebreaker via `STRUCT(timestamp, id)`. -5. **For wide history tables, `TRACK HISTORY ON` only the columns that need versions.** Other columns get Type-1 in-place updates and don't create new history rows. See [auto-cdc-python.md](auto-cdc-python.md) / [auto-cdc-sql.md](auto-cdc-sql.md). - ---- +1. **Filter `__END_AT IS NULL` for "current"** — never compare `__START_AT` against `MAX(__START_AT)` per entity. Slower and breaks under concurrent updates. +2. **Inclusive-lower / exclusive-upper** for point-in-time joins (`__START_AT <= D AND (__END_AT > D OR __END_AT IS NULL)`). +3. **Materialize repeated filters.** A `dim_*_current` MV is cheaper than re-filtering history on every downstream read. +4. **High-precision `SEQUENCE BY`.** Sub-second collisions cause non-deterministic ordering — use microsecond timestamps or `STRUCT(ts, tiebreaker)`. +5. **`TRACK HISTORY ON` only columns that need versions** on wide tables (other columns get Type-1 in-place updates without creating new history rows). ## Common Issues | Issue | Cause / Fix | |-------|-------------| | Multiple rows for the same key | Missing `__END_AT IS NULL` filter. | -| Point-in-time query returns no rows at the boundary | Wrong inclusive/exclusive — use `__START_AT <= D AND (__END_AT > D OR __END_AT IS NULL)`. | -| Point-in-time query double-counts at the boundary | Used `__END_AT >= D` instead of `__END_AT > D`. | +| Point-in-time returns no rows at the boundary | Wrong inclusive/exclusive — use `__START_AT <= D AND (__END_AT > D OR __END_AT IS NULL)`. | +| Point-in-time double-counts at the boundary | Used `__END_AT >= D` instead of `__END_AT > D`. | | Slow temporal join | Materialize current-state MV; cluster history on `(entity_key, __START_AT)`. | -| Unexpected duplicates per business key per moment | Multiple changes at the same `sequence_by` value — use a higher-precision sequence column or `STRUCT(ts, tiebreaker)`. | -| `__START_AT` / `__END_AT` columns missing | Source table isn't SCD Type 2 (Type 1 doesn't have temporals). | +| Unexpected duplicates per business key per moment | Multiple changes at the same `sequence_by` value — higher-precision sequence column or `STRUCT(ts, tiebreaker)`. | +| `__START_AT` / `__END_AT` columns missing | Source table isn't SCD Type 2 (Type 1 has no temporal columns). | diff --git a/skills/databricks-pipelines/references/sink-python.md b/skills/databricks-pipelines/references/sink-python.md index f680588..4af8f86 100644 --- a/skills/databricks-pipelines/references/sink-python.md +++ b/skills/databricks-pipelines/references/sink-python.md @@ -1,133 +1,65 @@ -Sinks enable writing pipeline data to alternative targets like event streaming services (Apache Kafka, Azure Event Hubs), external Delta tables, or custom data sources using Python code. Sinks are Python-only and work exclusively with streaming append flows. +# Sinks (Python only) -## Creating Sinks +Sinks write pipeline output to non-pipeline-managed targets: Kafka / Event Hubs topics, externally-managed Delta tables, or volumes. Python-only. Streaming queries only. Only compatible with `@dp.append_flow()`. -**dp.create_sink() / dlt.create_sink()** +For per-batch custom Python logic (merge/upsert, multi-destination), see [foreach-batch-sink-python.md](foreach-batch-sink-python.md). -Defines a sink for writing to alternative targets (Kafka, Event Hubs, external Delta tables). Call at top level before using in append flows. +## `dp.create_sink(...)` + +Call at top level before any `@dp.append_flow` references it. ```python dp.create_sink( - name="", - format="", - options={"": ""} + name="", # required — referenced as target= in @dp.append_flow + format="", # required — "delta", "kafka", or a custom format + options={...}, # required — format-specific options ) ``` -Parameters: - -- `name` (str): Unique identifier for the sink within the pipeline. Used to reference the sink in append flows. **Required.** -- `format` (str): Output format (`"kafka"`, `"delta"`, or custom format). Determines required options. **Required.** -- `options` (dict): Configuration dictionary with format-specific key-value pairs. Required options depend on the format. **Required.** - -## Writing to Sinks - -After creating a sink, use `@dp.append_flow()` (or `@dlt.append_flow()`) decorator to write streaming data to it. The `target` parameter specifies which sink to write to (must match a sink name created with `dp.create_sink()`). - -For complete documentation on append flows, see [streaming-table-python.md](../streaming-table/streaming-table-python.md). - -## Supported Sink Formats - -### Delta Sinks - -Write to Unity Catalog external/managed tables or file paths. +## Delta sinks -**Options for Unity Catalog tables:** +Write to an externally-managed Delta table or to a UC volume path. Use three-part names for UC tables. ```python -{ - "tableName": "catalog_name.schema_name.table_name" # Fully qualified table name -} -``` - -**Options for file paths:** - -```python -{ - "path": "/Volumes/catalog_name/schema_name/path/to/data" -} -``` - -**Example:** +# Unity Catalog table +dp.create_sink(name="delta_sink", format="delta", + options={"tableName": "main.sales.transactions"}) +# OR volume path +dp.create_sink(name="delta_sink_path", format="delta", + options={"path": "/Volumes/catalog/schema/transactions"}) -```python -# Create Delta sink with table name -dp.create_sink( - name="delta_sink", - format="delta", - options={"tableName": "main.sales.transactions"} -) - -# Write to sink using append flow @dp.append_flow(name="write_to_delta", target="delta_sink") def write_transactions(): - return spark.readStream.table("bronze_transactions") \ - .select("transaction_id", "customer_id", "amount", "timestamp") + return (spark.readStream.table("bronze_transactions") + .select("transaction_id", "customer_id", "amount", "timestamp")) ``` -### Kafka and Azure Event Hubs Sinks +## Kafka / Event Hubs sinks -Write to Apache Kafka or Azure Event Hubs topics for real-time event streaming. +Same `format="kafka"` for both — only the broker endpoint differs (Event Hubs is `.servicebus.windows.net:9093`). -**Important**: This code works for both Apache Kafka and Azure Event Hubs sinks. - -**Required options:** +The output DataFrame **must** have a `value` column (the serialized payload). Optional output columns: `key`, `partition`, `headers`, `topic`. ```python -{ - "kafka.bootstrap.servers": "host:port", # Kafka/Event Hubs endpoint - "topic": "topic_name", # Target topic - "databricks.serviceCredential": "credential_name" # Unity Catalog service credential -} -``` - -**Authentication**: Use `databricks.serviceCredential` to reference a Unity Catalog service credential for connecting to external cloud services. - -**Data format requirements**: - -- The `value` parameter is mandatory for Kafka and Azure Event Hubs sinks -- Optional parameters: `key`, `partition`, `headers`, and `topic` - -**Example (works for both Kafka and Event Hubs):** +dp.create_sink(name="kafka_sink", format="kafka", options={ + "kafka.bootstrap.servers": "kafka-broker:9092", + "topic": "customer_events", + "databricks.serviceCredential": "", # UC service credential +}) -```python -# Define credentials and connection details -credential_name = "" -bootstrap_servers = "kafka-broker:9092" # or "{eh-namespace}.servicebus.windows.net:9093" for Event Hubs -topic_name = "customer_events" - -# Create Kafka/Event Hubs sink -dp.create_sink( - name="kafka_sink", - format="kafka", - options={ - "databricks.serviceCredential": credential_name, - "kafka.bootstrap.servers": bootstrap_servers, - "topic": topic_name - } -) - -# Write to sink with required value parameter @dp.append_flow(name="stream_to_kafka", target="kafka_sink") def kafka_flow(): - return spark.readStream.table("customer_events") \ - .selectExpr( - "cast(customer_id as string) as key", - "to_json(struct(*)) AS value" - ) + return (spark.readStream.table("customer_events") + .selectExpr("cast(customer_id as string) AS key", + "to_json(struct(*)) AS value")) ``` -## Limitations and Considerations - -- Sinks only work with streaming queries and cannot be used with batch DataFrames -- Only compatible with `@dp.append_flow()` decorator -- Full refresh updates don't clean existing sink data - - Reprocessed data will be appended to the sink - - Consider idempotency: Design for duplicate writes since full refresh appends data -- Delta sink table names must be fully qualified (catalog.schema.table), use three-part names for Unity Catalog tables -- Volume file paths are supported as an alternative -- Pipeline expectations cannot be applied to sinks - - Apply data quality checks before writing to sinks - - Validate data in upstream tables/views instead -- Sinks are Python-only in Spark Declarative Pipelines, SQL does not support sink creation or usage -- Handle serialization: For Kafka/Event Hubs, convert data to JSON or appropriate format +Use `databricks.serviceCredential` (UC service credential) for auth — don't hard-code keys or use raw `kafka.sasl.*` for sinks. + +## Limitations + +- Streaming queries only; sinks are not compatible with batch DataFrames. +- Only `@dp.append_flow` writes to a sink — no `@dp.table` direct writes. +- Pipeline expectations cannot be attached to a sink. Validate in upstream tables/views. +- Full refresh re-runs the flow and **appends** to the sink (no cleanup of prior writes). Design downstream consumers to be idempotent, or pre-truncate the target manually. +- SQL has no sink support. diff --git a/skills/databricks-pipelines/references/sink.md b/skills/databricks-pipelines/references/sink.md deleted file mode 100644 index cf54ef4..0000000 --- a/skills/databricks-pipelines/references/sink.md +++ /dev/null @@ -1,21 +0,0 @@ -# Sinks in Spark Declarative Pipelines - -Sinks enable writing pipeline data to alternative targets beyond Databricks-managed Delta tables, including event streaming services and external tables. - -## Key Concepts - -Sinks in Spark Declarative Pipelines: - -- Write to event streaming services (Apache Kafka, Azure Event Hubs) -- Write to externally-managed Delta tables (Unity Catalog external/managed tables) -- Enable reverse ETL into systems outside Databricks -- Support custom Python data sources -- Work exclusively with streaming queries and append flows - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [sink-python.md](sink-python.md) - -**Important**: Sinks are only available in Python. SQL does not support sinks in Spark Declarative Pipelines. diff --git a/skills/databricks-pipelines/references/sql-basics.md b/skills/databricks-pipelines/references/sql-basics.md index bbbf496..c2dab2a 100644 --- a/skills/databricks-pipelines/references/sql-basics.md +++ b/skills/databricks-pipelines/references/sql-basics.md @@ -1,57 +1,55 @@ -#### Core SQL Statements +# SQL Basics -- `CREATE MATERIALIZED VIEW` - Batch processing with full refresh or incremental computation -- `CREATE STREAMING TABLE` - Continuous incremental processing -- `CREATE TEMPORARY VIEW` - Non-materialized views (pipeline lifetime only) -- `CREATE VIEW` - Non-materialized catalog views (Unity Catalog only) -- `AUTO CDC INTO` - Change data capture flows -- `CREATE FLOW` - Define flows or backfills for streaming tables +## Core statements -#### Message Bus Ingestion Functions +- `CREATE OR REFRESH STREAMING TABLE` — continuous incremental processing. See [streaming-table-sql.md](streaming-table-sql.md). +- `CREATE OR REFRESH MATERIALIZED VIEW` — batch table. See [materialized-view-sql.md](materialized-view-sql.md). +- `CREATE TEMPORARY VIEW` — pipeline-scoped view. See [temporary-view-sql.md](temporary-view-sql.md). +- `CREATE VIEW` — UC-published view. See [view-sql.md](view-sql.md). +- `AUTO CDC INTO` (inside `CREATE FLOW`) — CDC. See [auto-cdc-sql.md](auto-cdc-sql.md). +- `CREATE FLOW ... AS INSERT INTO [ONCE] target_table` — append / backfill flows. See [streaming-table-sql.md](streaming-table-sql.md). -- `read_kafka(bootstrapServers => '...', subscribe => '...')` - Apache Kafka -- `read_kinesis(streamName => '...', region => '...')` - AWS Kinesis -- `read_pubsub(subscriptionId => '...', topicId => '...')` - Google Cloud Pub/Sub -- `read_pulsar(serviceUrl => '...', topics => '...')` - Apache Pulsar -- Event Hubs: Use `read_kafka()` with Kafka-compatible Event Hubs config +## Source functions (streaming) -#### Critical Rules +Used as `FROM STREAM read_*(...)` inside a streaming table: -- ✅ Prefer `CREATE OR REFRESH` syntax for defining datasets (bare `CREATE` also works, but `OR REFRESH` is the idiomatic convention) -- ✅ Use `STREAM` keyword when reading sources for streaming tables -- ✅ Use `read_files()` function for Auto Loader (cloud storage ingestion) -- ✅ Look up documentation for statement parameters when unsure -- ❌ NEVER use `LIVE.` prefix when reading other datasets (deprecated) -- ❌ NEVER use `CREATE LIVE TABLE` or `CREATE LIVE VIEW` (deprecated - use `CREATE STREAMING TABLE`, `CREATE MATERIALIZED VIEW`, or `CREATE TEMPORARY VIEW` instead) -- ❌ Do not use `PIVOT` clause (unsupported) +- `read_files(path, format => '...')` — Auto Loader. See [auto-loader-sql.md](auto-loader-sql.md). +- `read_kafka(bootstrapServers => '...', subscribe => '...')` — Kafka. Also covers Event Hubs via Kafka protocol. See [kafka.md](kafka.md). +- `read_kinesis(streamName => '...', region => '...')` — AWS Kinesis. +- `read_pubsub(subscriptionId => '...', topicId => '...')` — GCP Pub/Sub. +- `read_pulsar(serviceUrl => '...', topics => '...')` — Apache Pulsar. -#### SQL-Specific Considerations +## Critical rules -**Streaming vs. Batch Semantics:** +- ✅ Prefer `CREATE OR REFRESH` over bare `CREATE` for SDP datasets (idiomatic convention; both parse). +- ✅ Use `FROM STREAM(table)` (function form with parens) for table sources in streaming tables; `FROM STREAM read_files(...)` (no extra parens) for function sources. +- ❌ Never use the `LIVE.` prefix when reading sibling datasets — deprecated, errors in modern pipelines. +- ❌ Never `CREATE LIVE TABLE` / `CREATE STREAMING LIVE TABLE` / `CREATE TEMPORARY LIVE VIEW` — all legacy. (Exception: `CREATE LIVE VIEW` is retained for the edge case of expectations on a temp view — see [temporary-view-sql.md#using-expectations-with-temporary-views](temporary-view-sql.md#using-expectations-with-temporary-views).) +- ❌ Never `CREATE OR REPLACE STREAMING TABLE` — that's standard SQL, not SDP. Use `CREATE OR REFRESH`. +- ❌ `PIVOT` clause is unsupported. -- Omit `STREAM` keyword for materialized views (batch processing) -- Use `STREAM` keyword for streaming tables to enable streaming semantics +## Streaming vs batch -**GROUP BY Best Practices:** +`STREAM(...)` opts in to streaming semantics; omit it for batch reads. Streaming tables require streaming reads. Materialized views require batch reads. -- Prefer `GROUP BY ALL` over explicitly listing individual columns unless the user specifically requests explicit grouping -- Benefits: more maintainable when adding/removing columns, less verbose, reduces risk of missing columns in the GROUP BY clause -- Example: `SELECT category, region, SUM(sales) FROM table GROUP BY ALL` instead of `GROUP BY category, region` +## `GROUP BY ALL` -**Python UDFs:** +Prefer `SELECT category, region, SUM(sales) FROM t GROUP BY ALL` over enumerating the grouping columns — less drift when columns are added/removed, no risk of forgetting a column in the `GROUP BY` clause. -- You can use Python user-defined functions (UDFs) in SQL queries -- UDFs must be defined in Python files before calling them in SQL source files +## Configuration -**Configuration:** +- Reference pipeline config values with `${var_name}` interpolation in SQL files. +- Use `SET key = value;` for Spark-level config. -- Use `SET` statements and `${}` string interpolation for dynamic values and Spark configurations +## Python UDFs in SQL -#### skipChangeCommits +UDFs must be declared in a Python file in the pipeline (e.g. `@dp.temporary_view()` is not enough — you need a top-level `spark.udf.register(...)` or a UC SQL UDF). The SQL file can then call them by name. -When a downstream streaming table reads from an upstream streaming table that has updates or deletes, use `skipChangeCommits` to ignore change commits: +## `skipChangeCommits` ```sql CREATE OR REFRESH STREAMING TABLE downstream -AS SELECT * FROM STREAM read_stream("upstream_table", skipChangeCommits => true) +AS SELECT * FROM STREAM read_stream("upstream_table", skipChangeCommits => true); ``` + +Use when reading from a streaming table that has updates/deletes (GDPR purges, Auto CDC targets). Without it, change commits fail. diff --git a/skills/databricks-pipelines/references/streaming-patterns.md b/skills/databricks-pipelines/references/streaming-patterns.md index d7275a5..d54bd34 100644 --- a/skills/databricks-pipelines/references/streaming-patterns.md +++ b/skills/databricks-pipelines/references/streaming-patterns.md @@ -1,8 +1,8 @@ # Streaming Patterns -Patterns for streaming pipelines: deduplication, windowed aggregations, late-arriving data, rescue-data quarantine, monitoring lag, and anomaly detection. Examples are shown in both SQL and Python. +Patterns for streaming pipelines: deduplication, windowed aggregations, late-arriving data, rescue-data quarantine, monitoring lag, anomaly detection. SQL is shown as canonical; Python equivalents use `@dp.table` + `spark.readStream.table(...)` with the obvious DataFrame translation. -For perf-framed treatment of stream-to-stream joins, see [performance.md](performance.md#join-optimization). For Auto Loader API and options, see [auto-loader.md](auto-loader.md). For Kafka ingestion, see [kafka.md](kafka.md). +For stream-to-stream joins as a perf-framed topic, see [performance.md](performance.md#join-optimization). For Auto Loader, see [auto-loader-python.md](auto-loader-python.md) / [auto-loader-sql.md](auto-loader-sql.md). For Kafka ingestion, see [kafka.md](kafka.md). --- @@ -17,26 +17,14 @@ CREATE OR REFRESH STREAMING TABLE silver_events_dedup AS SELECT event_id, user_id, event_type, event_timestamp, _ingested_at FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_timestamp) AS rn - FROM STREAM bronze_events + FROM STREAM(bronze_events) ) WHERE rn = 1; ``` -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F -from pyspark.sql.window import Window - -@dp.table(name="silver_events_dedup", cluster_by=["event_date"]) -def silver_events_dedup(): - w = Window.partitionBy("event_id").orderBy("event_timestamp") - return ( - spark.readStream.table("bronze_events") - .withColumn("rn", F.row_number().over(w)) - .filter(F.col("rn") == 1) - .drop("rn") - ) -``` +Python equivalent: `Window.partitionBy("event_id").orderBy("event_timestamp")` + `.withColumn("rn", F.row_number().over(w)).filter(F.col("rn") == 1).drop("rn")`. + +**Simple alternative**: `SELECT DISTINCT` (SQL) / `.dropDuplicates(["event_id"])` (Python). Cheaper for low-cardinality append-only streams; maintains state per unique row. ### Within a time window (tolerates late arrivals) @@ -44,47 +32,16 @@ def silver_events_dedup(): CREATE OR REFRESH STREAMING TABLE silver_events_dedup AS SELECT event_id, user_id, event_type, event_timestamp, MIN(_ingested_at) AS first_seen_at -FROM STREAM bronze_events +FROM STREAM(bronze_events) GROUP BY event_id, user_id, event_type, event_timestamp, window(event_timestamp, '1 hour'); ``` -```python -@dp.table(name="silver_events_dedup") -def silver_events_dedup(): - return ( - spark.readStream.table("bronze_events") - .groupBy("event_id", "user_id", "event_type", "event_timestamp", - F.window("event_timestamp", "1 hour")) - .agg(F.min("_ingested_at").alias("first_seen_at")) - ) -``` - -### Composite key - -```sql -CREATE OR REFRESH STREAMING TABLE silver_transactions_dedup AS -SELECT transaction_id, customer_id, amount, transaction_timestamp, - MIN(_ingested_at) AS _ingested_at -FROM STREAM bronze_transactions -GROUP BY transaction_id, customer_id, amount, transaction_timestamp; -``` - -```python -@dp.table(name="silver_transactions_dedup") -def silver_transactions_dedup(): - return ( - spark.readStream.table("bronze_transactions") - .groupBy("transaction_id", "customer_id", "amount", "transaction_timestamp") - .agg(F.min("_ingested_at").alias("_ingested_at")) - ) -``` - -**Alternative for simple cases**: `SELECT DISTINCT ...` (SQL) or `.dropDuplicates(["event_id"])` (Python). These are fine for low-cardinality dedup but maintain state per unique row. +Same `GROUP BY` shape generalises to composite-key dedup (just add the key columns to the `GROUP BY`). ### When to use Auto CDC instead -For dedup with sequenced updates (most-recent-wins, deletes, late corrections), use Auto CDC with SCD Type 1 — see [auto-cdc.md](auto-cdc.md). Manual `ROW_NUMBER`/`GROUP BY` dedup is for append-only streams without semantic updates. +For dedup with sequenced updates (most-recent-wins, deletes, late corrections), use Auto CDC with SCD Type 1 — see [auto-cdc-python.md](auto-cdc-python.md) / [auto-cdc-sql.md](auto-cdc-sql.md). Manual `ROW_NUMBER` / `GROUP BY` dedup is for append-only streams without semantic updates. --- @@ -99,97 +56,32 @@ SELECT sensor_id, AVG(temperature) AS avg_temperature, MIN(temperature) AS min_temperature, MAX(temperature) AS max_temperature, - COUNT(*) AS event_count -FROM STREAM bronze_sensor_events + COUNT(*) AS event_count +FROM STREAM(bronze_sensor_events) GROUP BY sensor_id, window(event_timestamp, '5 minutes'); ``` -```python -@dp.table(name="silver_sensor_5min", cluster_by=["sensor_id"]) -def silver_sensor_5min(): - return ( - spark.readStream.table("bronze_sensor_events") - .groupBy("sensor_id", F.window("event_timestamp", "5 minutes")) - .agg(F.avg("temperature").alias("avg_temperature"), - F.min("temperature").alias("min_temperature"), - F.max("temperature").alias("max_temperature"), - F.count("*").alias("event_count")) - ) -``` - -### Multiple window sizes (separate tables per granularity) - -```sql -CREATE OR REFRESH STREAMING TABLE gold_sensor_1min AS -SELECT sensor_id, - window(event_timestamp, '1 minute').start AS window_start, - window(event_timestamp, '1 minute').end AS window_end, - AVG(value) AS avg_value, - COUNT(*) AS event_count -FROM STREAM silver_sensor_data -GROUP BY sensor_id, window(event_timestamp, '1 minute'); - -CREATE OR REFRESH STREAMING TABLE gold_sensor_1hour AS -SELECT sensor_id, - window(event_timestamp, '1 hour').start AS window_start, - AVG(value) AS avg_value, - STDDEV(value) AS stddev_value -FROM STREAM silver_sensor_data -GROUP BY sensor_id, window(event_timestamp, '1 hour'); -``` +Python equivalent: `.groupBy("sensor_id", F.window("event_timestamp", "5 minutes")).agg(F.avg(...), F.min(...), F.max(...), F.count("*"))`. -```python -@dp.table(name="gold_sensor_1min") -def gold_sensor_1min(): - return ( - spark.readStream.table("silver_sensor_data") - .groupBy("sensor_id", F.window("event_timestamp", "1 minute")) - .agg(F.avg("value").alias("avg_value"), - F.count("*").alias("event_count")) - .select("sensor_id", - F.col("window.start").alias("window_start"), - F.col("window.end").alias("window_end"), - "avg_value", "event_count") - ) - -@dp.table(name="gold_sensor_1hour") -def gold_sensor_1hour(): - return ( - spark.readStream.table("silver_sensor_data") - .groupBy("sensor_id", F.window("event_timestamp", "1 hour")) - .agg(F.avg("value").alias("avg_value"), - F.stddev("value").alias("stddev_value")) - ) -``` +For multiple granularities, define a separate streaming table per window size (e.g. `gold_sensor_1min` + `gold_sensor_1hour`) — same shape, different `window(...)` argument. To expose start/end as columns: `window(event_timestamp, '1 minute').start AS window_start`, `.end AS window_end`. ### Session windows (inactivity-bounded) -Group events into sessions terminated by an inactivity gap: +Group events into sessions terminated by an inactivity gap. ```sql CREATE OR REFRESH STREAMING TABLE silver_user_sessions AS SELECT user_id, session_window(event_timestamp, '30 minutes') AS session, - MIN(event_timestamp) AS session_start, - MAX(event_timestamp) AS session_end, - COUNT(*) AS event_count, - COLLECT_LIST(event_type) AS event_sequence -FROM STREAM bronze_user_events + MIN(event_timestamp) AS session_start, + MAX(event_timestamp) AS session_end, + COUNT(*) AS event_count, + COLLECT_LIST(event_type) AS event_sequence +FROM STREAM(bronze_user_events) GROUP BY user_id, session_window(event_timestamp, '30 minutes'); ``` -```python -@dp.table(name="silver_user_sessions") -def silver_user_sessions(): - return ( - spark.readStream.table("bronze_user_events") - .groupBy("user_id", F.session_window("event_timestamp", "30 minutes")) - .agg(F.min("event_timestamp").alias("session_start"), - F.max("event_timestamp").alias("session_end"), - F.count("*").alias("event_count"), - F.collect_list("event_type").alias("event_sequence")) - ) -``` +Python: `F.session_window("event_timestamp", "30 minutes")`. ### Window-size guidance @@ -205,35 +97,24 @@ Larger windows = less state pressure but stale results. Pick the smallest window ## Late-Arriving Data -### Use event time, not processing time, for business logic +Use event time (the timestamp in the row), not processing time (`_ingested_at`), as the aggregation key. Keep `_ingested_at` as a debugging field — never the aggregation key. ```sql CREATE OR REFRESH STREAMING TABLE gold_daily_orders AS SELECT CAST(order_timestamp AS DATE) AS order_date, -- event time - COUNT(*) AS order_count, + COUNT(*) AS order_count, SUM(amount) AS total_amount -FROM STREAM silver_orders +FROM STREAM(silver_orders) GROUP BY CAST(order_timestamp AS DATE); ``` -```python -@dp.table(name="gold_daily_orders") -def gold_daily_orders(): - return ( - spark.readStream.table("silver_orders") - .groupBy(F.to_date("order_timestamp").alias("order_date")) # event time - .agg(F.count("*").alias("order_count"), - F.sum("amount").alias("total_amount")) - ) -``` - -Keep `_ingested_at` (processing time) in the schema as a debugging field — never the aggregation key. +Python: `.groupBy(F.to_date("order_timestamp").alias("order_date"))`. --- ## Rescue-Data Quarantine -Pattern: route malformed records to a quarantine table so the clean stream stays clean, but no data is silently dropped. Uses Auto Loader's rescued-data column (`_rescued_data`, default name; configurable via `rescuedDataColumn`). +Route malformed records to a quarantine table so the clean stream stays clean but no data is silently dropped. Uses Auto Loader's `_rescued_data` column (default name; configurable via `rescuedDataColumn`). ```sql -- Bronze: ingest everything, flag rows where Auto Loader rescued bad fields @@ -243,72 +124,40 @@ SELECT *, _rescued_data IS NOT NULL AS _has_errors FROM STREAM read_files('/Volumes/cat/sch/raw/events/', format => 'json'); --- Quarantine: only the rescued/malformed rows +-- Quarantine and clean streams branch from the flagged bronze CREATE OR REFRESH STREAMING TABLE bronze_quarantine AS -SELECT * FROM STREAM bronze_events WHERE _rescued_data IS NOT NULL; +SELECT * FROM STREAM(bronze_events) WHERE _rescued_data IS NOT NULL; --- Silver: only the clean rows CREATE OR REFRESH STREAMING TABLE silver_clean AS -SELECT * FROM STREAM bronze_events WHERE _rescued_data IS NULL; +SELECT * FROM STREAM(bronze_events) WHERE _rescued_data IS NULL; ``` -```python -@dp.table(name="bronze_events", cluster_by=["ingestion_date"]) -def bronze_events(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") - .option("rescuedDataColumn", "_rescued_data") - .load("/Volumes/cat/sch/raw/events/") - .withColumn("_ingested_at", F.current_timestamp()) - .withColumn("_has_errors", F.col("_rescued_data").isNotNull()) - ) - -@dp.table(name="bronze_quarantine") -def bronze_quarantine(): - return spark.readStream.table("bronze_events").filter("_has_errors = true") - -@dp.table(name="silver_clean") -def silver_clean(): - return spark.readStream.table("bronze_events").filter("_has_errors = false") -``` +Python equivalent uses `.option("rescuedDataColumn", "_rescued_data")` on the Auto Loader read, then two `@dp.table` functions filtering on `_has_errors`. -**When to use**: Schema drift on JSON / CSV ingestion, optional fields that arrive late, downstream tables that can't tolerate nulls in known columns. Pair with an alert on `bronze_quarantine` row growth. +**When to use**: schema drift on JSON / CSV, optional fields that arrive late, downstream tables that can't tolerate nulls in known columns. Alert on `bronze_quarantine` row growth. **Alternative**: `@dp.expect_or_drop` / `CONSTRAINT ... ON VIOLATION DROP ROW`. Use expectations when the rule is a value check (`amount > 0`); use rescued-data quarantine when the rule is a schema/parse problem. --- -## Stream-to-Stream Joins (Pattern) +## Stream-to-Stream Joins Always bound the join by event-time interval. Without bounds, state grows unbounded. ```sql CREATE OR REFRESH STREAMING TABLE silver_orders_with_payments AS SELECT o.order_id, o.customer_id, o.order_timestamp, - o.amount AS order_amount, + o.amount AS order_amount, p.payment_id, p.payment_timestamp, p.payment_method, - p.amount AS payment_amount -FROM STREAM bronze_orders o -INNER JOIN STREAM bronze_payments p + p.amount AS payment_amount +FROM STREAM(bronze_orders) o +INNER JOIN STREAM(bronze_payments) p ON o.order_id = p.order_id AND p.payment_timestamp BETWEEN o.order_timestamp AND o.order_timestamp + INTERVAL 1 HOUR; ``` -```python -@dp.table(name="silver_orders_with_payments") -def silver_orders_with_payments(): - orders = spark.readStream.table("bronze_orders") - payments = spark.readStream.table("bronze_payments") - return orders.join( - payments, - (orders.order_id == payments.order_id) & - (payments.payment_timestamp >= orders.order_timestamp) & - (payments.payment_timestamp <= orders.order_timestamp + F.expr("INTERVAL 1 HOUR")), - "inner", - ) -``` +Python equivalent: same join with the time-bound predicate as `(p.payment_timestamp >= o.order_timestamp) & (p.payment_timestamp <= o.order_timestamp + F.expr("INTERVAL 1 HOUR"))`. For stream-to-static (broadcast small dimensions) and perf-tuning, see [performance.md](performance.md#join-optimization). @@ -321,25 +170,13 @@ Streaming `GROUP BY` without windows yields cumulative aggregates per group. Wat ```sql CREATE OR REFRESH STREAMING TABLE silver_customer_running_totals AS SELECT customer_id, - SUM(amount) AS total_spent, - COUNT(*) AS transaction_count, - MAX(transaction_timestamp) AS last_transaction_at -FROM STREAM bronze_transactions + SUM(amount) AS total_spent, + COUNT(*) AS transaction_count, + MAX(transaction_timestamp) AS last_transaction_at +FROM STREAM(bronze_transactions) GROUP BY customer_id; ``` -```python -@dp.table(name="silver_customer_running_totals") -def silver_customer_running_totals(): - return ( - spark.readStream.table("bronze_transactions") - .groupBy("customer_id") - .agg(F.sum("amount").alias("total_spent"), - F.count("*").alias("transaction_count"), - F.max("transaction_timestamp").alias("last_transaction_at")) - ) -``` - --- ## Anomaly Detection @@ -356,83 +193,34 @@ SELECT sensor_id, event_timestamp, temperature, WHEN temperature < AVG(temperature) OVER w - 3 * STDDEV(temperature) OVER w THEN 'LOW_OUTLIER' ELSE 'NORMAL' END AS anomaly_flag -FROM STREAM bronze_sensor_events +FROM STREAM(bronze_sensor_events) WINDOW w AS (PARTITION BY sensor_id ORDER BY event_timestamp ROWS BETWEEN 100 PRECEDING AND CURRENT ROW); -- Route anomalies for alerting CREATE OR REFRESH STREAMING TABLE silver_sensor_anomalies AS -SELECT * FROM STREAM silver_sensor_with_anomalies +SELECT * FROM STREAM(silver_sensor_with_anomalies) WHERE anomaly_flag IN ('HIGH_OUTLIER', 'LOW_OUTLIER'); ``` -```python -@dp.table(name="silver_sensor_with_anomalies") -def silver_sensor_with_anomalies(): - w = Window.partitionBy("sensor_id").orderBy("event_timestamp").rowsBetween(-100, 0) - return ( - spark.readStream.table("bronze_sensor_events") - .withColumn("rolling_avg", F.avg("temperature").over(w)) - .withColumn("rolling_stddev", F.stddev("temperature").over(w)) - .withColumn("anomaly_flag", - F.when(F.col("temperature") > F.col("rolling_avg") + 3 * F.col("rolling_stddev"), "HIGH_OUTLIER") - .when(F.col("temperature") < F.col("rolling_avg") - 3 * F.col("rolling_stddev"), "LOW_OUTLIER") - .otherwise("NORMAL")) - ) - -@dp.table(name="silver_sensor_anomalies") -def silver_sensor_anomalies(): - return ( - spark.readStream.table("silver_sensor_with_anomalies") - .filter(F.col("anomaly_flag").isin("HIGH_OUTLIER", "LOW_OUTLIER")) - ) -``` - -### Static threshold filtering - -```sql -CREATE OR REFRESH STREAMING TABLE silver_high_value_transactions AS -SELECT transaction_id, customer_id, amount, transaction_timestamp -FROM STREAM bronze_transactions -WHERE amount > 10000; -``` - -```python -@dp.table(name="silver_high_value_transactions") -def silver_high_value_transactions(): - return (spark.readStream.table("bronze_transactions").filter(F.col("amount") > 10000)) -``` +Python: same shape with `Window.partitionBy("sensor_id").orderBy("event_timestamp").rowsBetween(-100, 0)` and `F.when(...).when(...).otherwise(...)`. Static-threshold variants are just `.filter(F.col("amount") > 10000)`. --- ## Monitoring Lag -Track end-to-end freshness by comparing event-time max to processing time. Useful for alerting on ingestion delays from Kafka, Kinesis, or Auto Loader. +Compare event-time max to processing time. Useful for alerting on ingestion delays from Kafka, Kinesis, or Auto Loader. ```sql CREATE OR REFRESH STREAMING TABLE monitoring_lag AS SELECT 'kafka_events' AS source, - MAX(kafka_timestamp) AS max_event_timestamp, - current_timestamp() AS processing_timestamp, + MAX(kafka_timestamp) AS max_event_timestamp, + current_timestamp() AS processing_timestamp, unix_timestamp(current_timestamp()) - unix_timestamp(MAX(kafka_timestamp)) AS lag_seconds -FROM STREAM bronze_kafka_events +FROM STREAM(bronze_kafka_events) GROUP BY window(kafka_timestamp, '1 minute'); ``` -```python -@dp.table(name="monitoring_lag") -def monitoring_lag(): - return ( - spark.readStream.table("bronze_kafka_events") - .groupBy(F.window("kafka_timestamp", "1 minute")) - .agg(F.lit("kafka_events").alias("source"), - F.max("kafka_timestamp").alias("max_event_timestamp"), - F.current_timestamp().alias("processing_timestamp")) - .withColumn("lag_seconds", - F.unix_timestamp("processing_timestamp") - F.unix_timestamp("max_event_timestamp")) - ) -``` - --- ## Best Practices @@ -441,7 +229,7 @@ def monitoring_lag(): 2. **Deduplicate at silver**, not bronze. Bronze is append-only, silver is clean. 3. **Bound state**: time windows, lower cardinality, materialize intermediates — see [performance.md](performance.md#state-management-for-streaming). 4. **Quarantine, don't drop silently** — route bad rows to a side table for observability. -5. **Use Auto CDC for sequenced updates** instead of building dedup with `ROW_NUMBER` — see [auto-cdc.md](auto-cdc.md). +5. **Use Auto CDC for sequenced updates** instead of building dedup with `ROW_NUMBER` — see [auto-cdc-python.md](auto-cdc-python.md) / [auto-cdc-sql.md](auto-cdc-sql.md). --- diff --git a/skills/databricks-pipelines/references/streaming-table-python.md b/skills/databricks-pipelines/references/streaming-table-python.md index 2259cbf..2baba74 100644 --- a/skills/databricks-pipelines/references/streaming-table-python.md +++ b/skills/databricks-pipelines/references/streaming-table-python.md @@ -1,164 +1,85 @@ -Streaming Tables in Spark Declarative Pipelines enable incremental processing of continuously arriving data. +# Streaming Tables (Python) -**NOTE:** This guide focuses on streaming tables. For details on materialized views (batch processing with `spark.read`), use the API guide for `materializedView` instead. +Streaming tables enable incremental processing of continuously arriving data. For materialized views (batch with `spark.read`), see [materialized-view-python.md](materialized-view-python.md). -**API Reference:** - -**@dp.table() / @dlt.table()** -Decorator to define a streaming table or materialized view. Returns streaming table when function returns `spark.readStream`. For materialized views using `spark.read`, see the `materializedView` API guide. +## `@dp.table()` — streaming or batch depending on return type ```python @dp.table( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - row_filter="row-filter-clause", - private=False + name="", + comment="", + spark_conf={...}, + table_properties={...}, + path="", + cluster_by=["", ...], # Liquid Clustering — preferred + cluster_by_auto=True, # let Databricks pick keys + partition_cols=[""], # legacy, prefer cluster_by — see performance.md#liquid-clustering + schema="col1 TYPE, ...", # supports GENERATED ALWAYS AS, MASK clauses, PK/FK constraints + row_filter="ROW FILTER my_catalog.my_schema.func ON (col)", # Public Preview + private=False, # True = pipeline-scoped, not published to UC ) -def my_append_flow(): - return spark.readStream.table("source.data") +def my_table(): + return spark.readStream.table("source.data") # streaming → streaming table + # or spark.read.table(...) # batch → materialized view (prefer @dp.materialized_view) ``` -Parameters: - -- `name` (str): Table name (defaults to function name) -- `comment` (str): Description for the table -- `spark_conf` (dict): Spark configurations for query execution -- `table_properties` (dict): Delta table properties -- `path` (str): Storage location for table data (defaults to managed location) -- `partition_cols` (list): Columns to partition the table by -- `cluster_by_auto` (bool): Enable automatic liquid clustering -- `cluster_by` (list): Columns to use as clustering keys for liquid clustering -- `schema` (str or StructType): Schema definition (SQL DDL string or StructType) - - Supports generated columns: `"order_datetime STRING, order_day STRING GENERATED ALWAYS AS (dayofweek(order_datetime))"` - - Supports constraints: Primary keys, foreign keys - - Supports column masks: `"ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region)"` -- `row_filter` (str): (Public Preview) A row filter clause that filters rows when fetched from the table. - - Must use syntax: `"ROW FILTER func_name ON (column_name [, ...])"` where `func_name` is a SQL UDF returning `BOOLEAN`. The UDF can be defined in Unity Catalog. - - Rows are filtered out when the function returns `FALSE` or `NULL`. - - You can pass table columns or constant literals (`STRING`, numeric, `BOOLEAN`, `INTERVAL`, `NULL`) as arguments. - - The filter is applied as soon as rows are fetched from the data source. - - The function runs with pipeline owner's rights during refresh and invoker's rights during queries (allowing user-context functions like `CURRENT_USER()` and `IS_MEMBER()` for data security). - - Note: Using row filters on source tables forces full refresh of downstream materialized views. - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `private` (bool): Restricts table to pipeline scope; prevents metastore publication - -**dp.create_streaming_table() / dlt.create_streaming_table()** -Creates an empty streaming table as target for CDC flows or append flows. Does NOT return a value - call at top level without assignment. +`row_filter` notes: `func_name` must be a SQL UDF in UC returning BOOLEAN; rows are dropped when it returns FALSE/NULL. Forces full refresh of downstream MVs. Cannot define the UDF inside the pipeline. + +## `dp.create_streaming_table()` — empty target for flows + +Use when one target is fed by multiple `@dp.append_flow`s or by `dp.create_auto_cdc_flow()`. Call at top level; does NOT return a value. ```python dp.create_streaming_table( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - expect_all={"": ""}, - expect_all_or_drop={"": ""}, - expect_all_or_fail={"": ""}, - row_filter="row-filter-clause" + name="", + cluster_by=[...], + schema="...", + expect_all={"name": "cond"}, # warn + expect_all_or_drop={"name": "cond"}, # drop row + expect_all_or_fail={"name": "cond"}, # fail update + row_filter="...", ) ``` -Parameters: Same as @dp.table() except `private`, plus: +Same parameters as `@dp.table()` except `private`, plus the three `expect_all*` dicts. -- `expect_all` (dict): Data quality expectations (warn on failure, include in target) -- `expect_all_or_drop` (dict): Expectations that drop failing rows from target -- `expect_all_or_fail` (dict): Expectations that fail pipeline on violation - -**@dp.append_flow() / @dlt.append_flow()** -Decorator to define a flow that appends data from a source to an existing target table. Multiple append flows can write to the same target table. +## `@dp.append_flow()` — fan multiple sources into one table ```python -@dp.append_flow( - target="", - name="", # optional, defaults to function name - once=, # optional, defaults to False - spark_conf={"": "", "": ""}, # optional - comment="" # optional -) -def my_append_flow(): - # For once=False (streaming): use spark.readStream - return spark.readStream.table("source.data") - # For once=True (batch): use spark.read - return spark.read.table("source.data") +@dp.append_flow(target="", name="", once=False) +def my_flow(): + return spark.readStream.table("source.data") # once=False → streaming + # or spark.read.table("archive.historical") # once=True → batch (one-shot) ``` -Parameters: - -- `target` (str): The name of the target streaming table where data will be appended. Target must exist (created with `dp.create_streaming_table()`). **Required.** -- `name` (str): The name of the flow. If not specified, defaults to the function name. Use distinct names when multiple flows target the same table. -- `once` (bool): Controls whether the flow runs continuously or once: - - **False (default)**: Flow continuously processes new data as it arrives in streaming mode. **Must return a streaming DataFrame using `spark.readStream`**, CAN use `cloudFiles` (Auto Loader). - - **True**: Flow processes data only once during pipeline execution and then stops. **Must return a batch DataFrame using `spark.read`**. Do NOT use `cloudFiles` (Auto Loader) with `once=True` - use regular batch reads like `spark.read.format("")` instead. -- `spark_conf` (dict): A dictionary of Spark configuration key-value pairs to apply specifically to this flow's query execution (e.g., `{"spark.sql.shuffle.partitions": "10"}`). -- `comment` (str): A description of the flow that appears in the pipeline metadata and documentation. - -**Two Ways to Define Streaming Tables:** - -1. **@dp.table decorator (MOST COMMON)** - - Returns a streaming DataFrame using `spark.readStream` - - Automatically inferred as a streaming table when returning a streaming DataFrame - - ```python - @dp.table(name="events_stream") - def events_stream(): - return spark.readStream.table("source_catalog.schema.events") - ``` +- `target` (required): name of the target table (created via `dp.create_streaming_table()`). +- `name`: defaults to the function name. Use distinct names when multiple flows target the same table. +- `once=True`: one-shot batch. Use `spark.read`, NOT `cloudFiles` (Auto Loader is streaming-only). +- `spark_conf`: per-flow Spark config (e.g. `{"spark.sql.shuffle.partitions": "10"}`). -2. **dp.create_streaming_table()** - - Creates an empty streaming table target - - Required as target for Auto CDC flows and append flows - - Does NOT return a value (do not assign to a variable) +## Single source vs multi-source - ```python - dp.create_streaming_table( - name="users", - schema="user_id INT, name STRING, updated_at TIMESTAMP" - ) - ``` +- **Single source** → `@dp.table()` with `spark.readStream.*` and the transformation in the function body. Continuous processing is automatic. +- **Multi-source / AUTO CDC target** → `dp.create_streaming_table(...)` (empty target) + one `@dp.append_flow` per source (or `dp.create_auto_cdc_flow` for CDC). -**WHEN TO USE WHICH:** +Don't combine: don't have both an `@dp.table` definition AND a separate `@dp.append_flow` targeting it — the decorator already handles continuous processing, the flow is redundant. -Use **@dp.table with readStream** when: +## Common Patterns -- Reading and transforming streaming data -- Creating streaming tables from sources (Auto Loader, Delta tables, etc.) -- This is the standard pattern for most streaming use cases - -Use **dp.create_streaming_table()** when: - -- Creating a target table for `dp.create_auto_cdc_flow()` -- Creating a target table for `@dp.append_flow` from multiple sources -- Need to explicitly define table schema before data flows in - -**Common Patterns:** - -**Pattern 1: Simple streaming transformation** +### Auto Loader + filter ```python @dp.table() def bronze(): - return spark.readStream.format("cloudFiles") \ - .option("cloudFiles.format", "json") \ - .load("/path/to/data") + return (spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json").load("/path/to/data")) @dp.table() def silver(): return spark.readStream.table("bronze").filter("id IS NOT NULL") ``` -**Pattern 2: Multi-source aggregation** +### Multi-source append ```python dp.create_streaming_table(name="all_events") @@ -166,61 +87,46 @@ dp.create_streaming_table(name="all_events") @dp.append_flow(target="all_events", name="mobile") def mobile(): return spark.readStream.table("mobile.events") - -@dp.append_flow(target="all_events", name="web") -def web(): - return spark.readStream.table("web.events") +# Add @dp.append_flow(target="all_events", name="web") ... for additional sources. ``` -**Pattern 3: One-time backfill with append flow** +### Backfill + live stream into the same table ```python dp.create_streaming_table(name="transactions") -# Continuous streaming flow for new data @dp.append_flow(target="transactions", name="live_stream") def live_transactions(): return spark.readStream.table("source.transactions") -# One-time backfill flow for historical data (uses spark.read for batch) -@dp.append_flow( - target="transactions", - name="historical_backfill", - once=True, - comment="Backfill historical transactions from archive" -) +@dp.append_flow(target="transactions", name="historical_backfill", once=True) def backfill_transactions(): - return spark.read.table("archive.historical_transactions") + return spark.read.table("archive.historical_transactions") # batch, no cloudFiles ``` -**Pattern 4: Row filters for data security** +### Row filter for data security ```python -# Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN - -# Apply row filter to streaming table @dp.table( name="employees", schema="emp_id INT, emp_name STRING, dept STRING, salary DECIMAL(10,2)", - row_filter="ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept)" + row_filter="ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept)", ) def employees(): return spark.readStream.table("source.employees") ``` -**Pattern 5: Stream-static join (enrich streaming data with dimension table)** +### Stream-static join (enrich with dimension) ```python @dp.table() def enriched_transactions(): transactions = spark.readStream.table("transactions") - customers = spark.read.table("customers") + customers = spark.read.table("customers") # static snapshot at stream start return transactions.join(customers, transactions.customer_id == customers.id) ``` -The dimension table (`customers`) is read as a static snapshot at stream start, while the streaming source (`transactions`) is read incrementally. - -**Pattern 6: Reading from upstream ST with updates/deletes (skipChangeCommits)** +### Reading from a streaming table that has updates/deletes ```python @dp.table() @@ -228,15 +134,11 @@ def downstream(): return spark.readStream.option("skipChangeCommits", "true").table("upstream_with_deletes") ``` -Use `skipChangeCommits` when reading from a streaming table that has updates/deletes (e.g., GDPR compliance, Auto CDC targets). Without this flag, change commits cause errors. +Without `skipChangeCommits`, update/delete commits on the upstream (e.g. GDPR purges, Auto CDC targets) cause errors. -**KEY RULES:** +## Key rules -- Streaming tables use `spark.readStream` (streaming reads) -- Materialized views use `spark.read` (batch reads) - see the `materializedView` API guide -- Never use `.writeStream`, `.start()`, or checkpoint options - Databricks manages these automatically -- For streaming flows (`once=False`): Use `spark.readStream` to return a streaming DataFrame -- For one-time flows (`once=True`): Use `spark.read` to return a batch DataFrame -- Generated columns, constraints, and masks require schema definition -- Row filters force full refresh of downstream materialized views -- Use `skipChangeCommits` when reading from STs that have updates/deletes +- Streaming tables use `spark.readStream`; MVs use `spark.read`. +- Never `.writeStream`, `.start()`, or pass checkpoint options — Databricks manages them. +- Generated columns, masks, and PK/FK constraints require an explicit `schema=`. +- Row filters on source tables force full refresh of downstream MVs. diff --git a/skills/databricks-pipelines/references/streaming-table-sql.md b/skills/databricks-pipelines/references/streaming-table-sql.md index 316b7d8..04712c3 100644 --- a/skills/databricks-pipelines/references/streaming-table-sql.md +++ b/skills/databricks-pipelines/references/streaming-table-sql.md @@ -1,260 +1,117 @@ -Streaming Tables in SQL Declarative Pipelines enable incremental processing of continuously arriving data. +# Streaming Tables (SQL) -**NOTE:** This guide focuses on streaming tables in SQL. For details on materialized views (batch processing), use the API guide for `materializedView` instead. +Streaming tables enable incremental processing of continuously arriving data. For materialized views (batch), see [materialized-view-sql.md](materialized-view-sql.md). -**API Reference:** - -**CREATE STREAMING TABLE** -Creates a streaming table that processes data incrementally using `STREAM()` for streaming reads. For materialized views using batch reads (without `STREAM()`), see the `materializedView` API guide. +## Syntax ```sql -CREATE OR REFRESH [PRIVATE] STREAMING TABLE - table_name - [ table_specification ] - [ table_clauses ] +CREATE OR REFRESH [PRIVATE] STREAMING TABLE table_name + [ ( col_name col_type [NOT NULL] [COMMENT '...'] [column_constraint | MASK clause] + [, ...] + [, CONSTRAINT name EXPECT (cond) [ON VIOLATION DROP ROW | FAIL UPDATE]] + [, table_constraint] ) ] + [ PARTITIONED BY (col, ...) | CLUSTER BY (col, ...) ] -- prefer CLUSTER BY + [ LOCATION path ] + [ COMMENT 'view_comment' ] + [ TBLPROPERTIES (key = value, ...) ] + [ WITH ROW FILTER func_name ON (col, ...) ] [ AS query ] - -table_specification - ( { column_identifier column_type [column_properties] } [, ...] - [ column_constraint ] [, ...] - [ , table_constraint ] [...] ) - - column_properties - { NOT NULL | COMMENT column_comment | column_constraint | MASK clause } [ ... ] - -table_clauses - { USING DELTA - PARTITIONED BY (col [, ...]) | - CLUSTER BY clause | - LOCATION path | - COMMENT view_comment | - TBLPROPERTIES clause | - WITH { ROW FILTER clause } } [ ... ] -``` - -**Parameters:** - -- `PRIVATE`: Restricts table to pipeline scope; prevents metastore publication -- `table_name`: Unique identifier for the table (fully qualified name including catalog and schema must be unique unless marked PRIVATE) -- `table_specification`: Optional schema definition with column names, types, and properties - - `column_identifier`: Name of the column - - `column_type`: Data type (STRING, BIGINT, DECIMAL, etc.) - - `column_properties`: Column attributes: - - `NOT NULL`: Column cannot contain null values - - `COMMENT column_comment`: Description for the column - - `column_constraint`: Data quality constraints, consult the `expectations` API guide for details. - - `MASK clause`: Column masking syntax `MASK catalog.schema.mask_fn USING COLUMNS (other_column)` (Public Preview) - - `table_constraint`: Informational table-level constraints (Unity Catalog only, **not enforced** by Databricks): - - Look up exact documentation when using - - Note: Constraints are informational metadata for documentation and query optimization hints; data validation must be performed independently -- `table_clauses`: Optional clauses for table configuration: - - `USING DELTA`: Optional format specification (only DELTA supported, can be omitted) - - `PARTITIONED BY (col [, ...])`: Columns for traditional partitioning, mutually exclusive with CLUSTER BY - - `CLUSTER BY clause`: Columns for liquid clustering (optimized query performance, recommended over partitioning) - - `LOCATION path`: Storage path (defaults to pipeline storage location) - - `COMMENT view_comment`: Description for the table - - `TBLPROPERTIES clause`: Custom table properties `(key = value [, ...])` - - `WITH ROW FILTER clause`: Row-level security filtering - - Syntax: `ROW FILTER func_name ON (column_name [, ...])` (Public Preview) - - `func_name` must be a SQL UDF returning BOOLEAN (can be defined in Unity Catalog) - - Rows are filtered out when function returns FALSE or NULL - - Accepts table columns or constant literals (STRING, numeric, BOOLEAN, INTERVAL, NULL) - - Filter applies when rows are fetched from the data source - - Runs with pipeline owner's rights during refresh and invoker's rights during queries - - Note: Using row filters on source tables forces full refresh of downstream materialized views - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `query`: A Spark SQL query that defines the streaming dataset. Must use `STREAM()` function for streaming semantics. - -**STREAM() Function:** -Provides streaming read semantics for the source table. Required for streaming queries. - -```sql -SELECT * FROM STREAM(source_catalog.schema.source_table); -``` - -**CREATE FLOW with INSERT INTO** -Creates a flow that appends data from a source to an existing target streaming table. Multiple flows can write to the same target table. - -```sql -CREATE FLOW flow_name [COMMENT comment] AS -INSERT INTO [ONCE] target_table BY NAME query ``` -**Parameters:** - -- `flow_name`: Unique identifier for the flow. Use distinct names when multiple flows target the same table. -- `ONCE`: Controls whether the flow runs continuously or once: - - **Omitted (default)**: Flow continuously processes new data as it arrives in streaming mode. **Query must use `STREAM()` for streaming reads**. - - **ONCE**: Flow processes data only once during pipeline execution and then stops. **Query uses non-streaming reads (without `STREAM()`)** for batch processing. Re-executes during pipeline complete refreshes to recreate data. -- `target_table_name`: The name of the target streaming table where data will be appended. Target must exist (created with `CREATE STREAMING TABLE`). **Required.** -- `SELECT ... FROM STREAM(source_table)`: The query to read source data - - For continuous flows (no ONCE): Use `STREAM()` to return streaming data - - For one-time flows (with ONCE): Omit `STREAM()` to return batch data - -**Two Ways to Define Streaming Tables:** - -1. **CREATE STREAMING TABLE with AS SELECT (MOST COMMON)** - - Defines schema and query in one statement - - Schema can be inferred from query or explicitly defined - - **This automatically creates a continuous streaming pipeline - no separate flow needed** - - ```sql - CREATE STREAMING TABLE events_stream - AS SELECT * FROM STREAM(source_catalog.schema.events); - ``` - -2. **CREATE STREAMING TABLE without AS SELECT** - - Creates an empty streaming table target - - Required for multi-source append patterns - - Schema definition is optional - - **Requires separate `CREATE FLOW` statements to populate the table** - - ```sql - CREATE STREAMING TABLE users ( - user_id INT, - name STRING, - updated_at TIMESTAMP - ); - ``` - -**CRITICAL: WHEN TO USE WHICH:** +Key clause notes: -Use **CREATE STREAMING TABLE with AS SELECT** when: +- `PRIVATE` — pipeline-scoped, not published to the catalog. Use for internal staging. +- `CLUSTER BY (...)` — Liquid Clustering, mutually exclusive with `PARTITIONED BY`. Always prefer. +- `MASK catalog.schema.mask_fn USING COLUMNS (other_col)` — UC column masking (Public Preview). +- `WITH ROW FILTER func ON (col, ...)` — UC row filter (Public Preview). `func` must be a UC SQL UDF returning BOOLEAN; rows are dropped when it returns FALSE/NULL. Forces full refresh of downstream MVs. Cannot define the UDF inside the pipeline. +- `CONSTRAINT ... EXPECT (...)` — see [expectations-sql.md](expectations-sql.md). +- Table-level constraints (primary key, foreign key) are **informational only** — not enforced. Useful as query-optimizer hints and documentation. -- Reading and transforming streaming data from a single source -- Creating streaming tables from Delta tables, Auto Loader sources, etc. -- This is the standard pattern for most streaming use cases -- **DO NOT add a separate `CREATE FLOW` - the AS SELECT clause already handles continuous processing** +## `STREAM(...)` source -Use **CREATE STREAMING TABLE without AS SELECT + CREATE FLOW** when: +Streaming queries require `FROM STREAM(table_name)` for table sources (function form, with parens) or `FROM STREAM read_files(...)` / `STREAM read_kafka(...)` for function sources (no extra parens). Batch reads (no `STREAM`) fail in a streaming table definition. -- Creating a target table for multiple `INSERT INTO` flows from different sources -- Need to explicitly define table schema before data flows in -- Using `AUTO CDC INTO` for CDC. See 'autoCdc' API guide for details. -- **In this case, you MUST create separate flows - the table definition alone does not process data** +## Single source vs multi-source -**NEVER:** - -- Create both `CREATE STREAMING TABLE ... AS SELECT` AND `CREATE FLOW` for the same source - this is redundant and incorrect -- The AS SELECT clause already provides continuous streaming; adding a flow duplicates the work - -**Common Patterns:** - -**Pattern 1: Simple streaming transformation** +**Single source — `CREATE OR REFRESH STREAMING TABLE ... AS SELECT ...`**: handles continuous processing automatically. No separate flow. Most common case. ```sql --- Bronze layer: ingest raw data with Auto Loader -CREATE STREAMING TABLE bronze -AS SELECT * FROM STREAM(read_files( - '/path/to/data', - format => 'json' -)); - --- Silver layer: filter and clean data -CREATE STREAMING TABLE silver -AS SELECT * -FROM STREAM(bronze) -WHERE id IS NOT NULL; +CREATE OR REFRESH STREAMING TABLE events_stream +AS SELECT * FROM STREAM(source_catalog.schema.events); ``` -**Pattern 2: Multi-source aggregation with flows** +**Multi-source — empty `CREATE OR REFRESH STREAMING TABLE` + `CREATE FLOW`s**: required to fan multiple sources into one table, or to use `AUTO CDC INTO` (see [auto-cdc-sql.md](auto-cdc-sql.md)). ```sql --- Create target table for multiple sources. Schema is optional. -CREATE STREAMING TABLE all_events ( - event_id STRING, - event_type STRING, - event_timestamp TIMESTAMP, - source STRING +CREATE OR REFRESH STREAMING TABLE all_events ( + event_id STRING, event_type STRING, event_timestamp TIMESTAMP, source STRING ); --- Flow from mobile source -CREATE FLOW mobile_flow -AS INSERT INTO all_events BY NAME -SELECT event_id, event_type, event_timestamp, 'mobile' as source +CREATE FLOW mobile_flow AS INSERT INTO all_events BY NAME +SELECT event_id, event_type, event_timestamp, 'mobile' AS source FROM STREAM(mobile.events); - --- Flow from web source -CREATE FLOW web_flow -AS INSERT INTO all_events BY NAME -SELECT event_id, event_type, event_timestamp, 'web' as source -FROM STREAM(web.events); +-- Add CREATE FLOW web_flow ... etc. for additional sources. ``` -**Pattern 3: Row filters for data security** +**Never** combine `AS SELECT` and `CREATE FLOW` on the same target — the `AS SELECT` already provides continuous processing, the flow is redundant. -```sql --- Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN +## `CREATE FLOW` syntax -CREATE STREAMING TABLE employees ( - emp_id INT, - emp_name STRING, - dept STRING, - salary DECIMAL(10,2) -) -WITH ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept) -AS SELECT * FROM STREAM(source.employees); +```sql +CREATE FLOW flow_name [COMMENT '...'] AS +INSERT INTO [ONCE] target_table BY NAME query ``` -**Pattern 4: Partitioning and clustering** +- Default (no `ONCE`): continuous flow. Query must use `STREAM(...)`. +- `ONCE`: one-shot batch flow. Query must NOT use `STREAM(...)`. Re-executes on full refresh. +- One target can have many flows, each with a distinct `flow_name`. + +## Common Patterns + +### Auto Loader + filter (single source) ```sql --- Using partitioning (traditional approach) -CREATE STREAMING TABLE orders_partitioned -PARTITIONED BY (order_date) -AS SELECT * FROM STREAM(source.orders); - --- Using liquid clustering (recommended) -CREATE STREAMING TABLE orders_clustered -CLUSTER BY (order_date, customer_id) -AS SELECT * FROM STREAM(source.orders); +CREATE OR REFRESH STREAMING TABLE bronze +AS SELECT * FROM STREAM read_files('/path/to/data', format => 'json'); + +CREATE OR REFRESH STREAMING TABLE silver +AS SELECT * FROM STREAM(bronze) WHERE id IS NOT NULL; ``` -**Pattern 5: Sensitive data masking** +### Row filter for data security ```sql -CREATE STREAMING TABLE customers ( - customer_id INT, - name STRING, - email STRING, - ssn STRING MASK catalog.schema.ssn_mask USING COLUMNS (customer_id) +CREATE OR REFRESH STREAMING TABLE employees ( + emp_id INT, emp_name STRING, dept STRING, salary DECIMAL(10,2) ) -AS SELECT * FROM STREAM(source.customers); +WITH ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept) +AS SELECT * FROM STREAM(source.employees); ``` -**Pattern 6: Private streaming table (pipeline-internal staging)** +Column masking via `MASK fn USING COLUMNS (other_col)` follows the same shape inside the column definition. + +### Private staging table ```sql CREATE OR REFRESH PRIVATE STREAMING TABLE staging_events -AS SELECT * -FROM STREAM(raw_events) -WHERE event_type IS NOT NULL; +AS SELECT * FROM STREAM(raw_events) WHERE event_type IS NOT NULL; ``` -Use `PRIVATE` for internal staging datasets that should not be published to the catalog. Private tables are only accessible within the pipeline. - -**Pattern 7: One-time backfill with flow** +### Backfill + live stream into the same table ```sql -CREATE STREAMING TABLE transactions ( - transaction_id STRING, - customer_id STRING, - amount DECIMAL(10,2), - transaction_date TIMESTAMP +CREATE OR REFRESH STREAMING TABLE transactions ( + transaction_id STRING, customer_id STRING, amount DECIMAL(10,2), transaction_date TIMESTAMP ); --- Continuous streaming flow for new data -CREATE FLOW live_stream -AS INSERT INTO transactions +CREATE FLOW live_stream AS INSERT INTO transactions SELECT * FROM STREAM(source.transactions); --- One-time backfill flow for historical data (uses batch read without STREAM) -CREATE FLOW historical_backfill -AS INSERT INTO ONCE transactions -SELECT * FROM archive.historical_transactions; +CREATE FLOW historical_backfill AS INSERT INTO ONCE transactions +SELECT * FROM archive.historical_transactions; -- no STREAM = batch ``` -**Pattern 8: Stream-static join (enrich streaming data with dimension table)** +### Stream-static join (enrich with dimension) ```sql CREATE OR REFRESH STREAMING TABLE enriched_transactions @@ -263,26 +120,22 @@ FROM STREAM(transactions) t JOIN customers c ON t.customer_id = c.id; ``` -The dimension table (`customers`) is read as a static snapshot at stream start, while the streaming source (`transactions`) is read incrementally. This is the standard pattern for enriching streaming data with lookup/dimension tables. +`customers` is read as a static snapshot at stream start; `transactions` is read incrementally. -**Pattern 9: Reading from upstream ST with updates/deletes (skipChangeCommits)** +### Reading from a streaming table that has updates/deletes ```sql CREATE OR REFRESH STREAMING TABLE downstream -AS SELECT * FROM STREAM read_stream("upstream_with_deletes", skipChangeCommits => true) +AS SELECT * FROM STREAM read_stream("upstream_with_deletes", skipChangeCommits => true); ``` -Use `skipChangeCommits` when reading from a streaming table that has updates/deletes (e.g., GDPR compliance, Auto CDC targets). Without this flag, change commits cause errors. +`skipChangeCommits` ignores update/delete commits on the upstream (e.g. GDPR purges, Auto CDC targets). Without it, change commits cause errors. -**KEY RULES:** +## Key rules -- Streaming tables require `STREAM()` keyword for streaming reads -- Never use batch reads (`SELECT * FROM table` without `STREAM()`) in streaming table definitions -- `ALTER TABLE` commands are not supported - use `CREATE OR REFRESH` or `ALTER STREAMING TABLE` instead -- Generated columns, identity columns, and default columns are not currently supported -- Row filters force full refresh of downstream materialized views -- Only table owners can refresh streaming tables -- Table renaming and ownership changes prohibited -- `CLUSTER BY` is recommended over `PARTITIONED BY` for most use cases -- For batch processing, use materialized views instead (see the `materializedView` API guide) -- Use `skipChangeCommits` when reading from STs that have updates/deletes +- Streaming queries require `STREAM(...)` (or `STREAM read_files(...)` / etc.). Batch reads inside a streaming-table definition fail. +- `ALTER TABLE` is not supported — use `CREATE OR REFRESH` to redefine, or `ALTER STREAMING TABLE` for table-level adjustments. +- Generated columns, identity columns, and default columns are not supported. +- Row filters on source tables force full refresh of downstream MVs. +- Only the table owner can refresh; table renaming and ownership changes are prohibited. +- `CLUSTER BY` over `PARTITIONED BY` for new tables. diff --git a/skills/databricks-pipelines/references/streaming-table.md b/skills/databricks-pipelines/references/streaming-table.md deleted file mode 100644 index f57baf9..0000000 --- a/skills/databricks-pipelines/references/streaming-table.md +++ /dev/null @@ -1,19 +0,0 @@ -# Streaming Tables in Spark Declarative Pipelines - -Streaming tables enable continuous processing of data streams with exactly-once semantics and automatic checkpointing. - -## Key Concepts - -Streaming tables in Spark Declarative Pipelines: - -- Process data continuously as it arrives -- Provide exactly-once processing guarantees -- Support stateful operations (aggregations, joins, deduplication) -- Automatically manage checkpoints and state - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [streaming-table-python.md](streaming-table-python.md) -- **SQL**: [streaming-table-sql.md](streaming-table-sql.md) diff --git a/skills/databricks-pipelines/references/temporary-view-python.md b/skills/databricks-pipelines/references/temporary-view-python.md index dab90cd..1a7f085 100644 --- a/skills/databricks-pipelines/references/temporary-view-python.md +++ b/skills/databricks-pipelines/references/temporary-view-python.md @@ -1,66 +1,37 @@ -Temporary Views in Spark Declarative Pipelines create temporary logical datasets without persisting data to storage. Use views for intermediate transformations that drive downstream workloads but don't need materialization. +# Temporary Views (Python) -**API Reference:** +Pipeline-scoped logical datasets — not materialized, not published to UC. Used for shared intermediate transformations that drive multiple downstream tables. -**@dp.temporary_view() (preferred) / @dp.view() (alias) / @dlt.view() (deprecated)** -Decorator to define a temporary view. +`@dp.temporary_view()` is the current decorator. Legacy `@dlt.view()` (and `@dp.view()` if present in older code) should be migrated — see [SKILL.md Legacy DLT Syntax](../SKILL.md#legacy-dlt-syntax--always-migrate). ```python -@dp.temporary_view( - name="", - comment="" -) +@dp.temporary_view(name="", comment="") # both optional def my_view(): - return spark.read.table("source.data") + return spark.read.table("source.data") # batch — or spark.readStream.table(...) for streaming ``` -Parameters: +Downstream tables reference the view by name via `spark.read.table("my_view")` or `spark.readStream.table("my_view")`. -- `name` (str): View name (defaults to function name) -- `comment` (str): Description for the view - -**Common Patterns:** - -**Pattern 1: Intermediate transformation layer** +## Example ```python -# View for shared filtering logic @dp.temporary_view() def valid_events(): - return spark.read.table("raw.events") \ - .filter("event_type IS NOT NULL") \ - .filter("timestamp IS NOT NULL") + return (spark.read.table("raw.events") + .filter("event_type IS NOT NULL") + .filter("timestamp IS NOT NULL")) -# Multiple tables consume the view @dp.materialized_view() def user_events(): - return spark.read.table("valid_events") \ - .filter("event_type = 'user_action'") - -@dp.materialized_view() -def system_events(): - return spark.read.table("valid_events") \ - .filter("event_type = 'system_event'") + return spark.read.table("valid_events").filter("event_type = 'user_action'") +# Other downstream MVs follow the same shape. ``` -**Pattern 2: Streaming views** - -```python -# Views work with streaming DataFrames too -@dp.temporary_view() -def streaming_events(): - return spark.readStream.table("bronze.events") \ - .filter("event_id IS NOT NULL") - -@dp.table() -def filtered_stream(): - return spark.readStream.table("streaming_events") \ - .filter("event_type = 'critical'") -``` +Streaming variant: return `spark.readStream.*` from the temp view; downstream `@dp.table()` reads it via `spark.readStream.table(...)`. -**KEY RULES:** +## Key rules -- Views can return either batch (`spark.read`) or streaming (`spark.readStream`) DataFrames -- Views are not materialized - they're computed on demand when referenced -- Reference views using `spark.read.table("view_name")` or `spark.readStream.table("view_name")` -- Views prevent code duplication when multiple downstream tables need the same transformation +- Computed on demand — not materialized. +- Either batch or streaming, depending on the returned DataFrame type. +- Pipeline-scoped — not visible outside the pipeline. +- Cannot apply column masks, row filters, or `cluster_by` (it's not a table). diff --git a/skills/databricks-pipelines/references/temporary-view-sql.md b/skills/databricks-pipelines/references/temporary-view-sql.md index f1d8bb6..9472c2d 100644 --- a/skills/databricks-pipelines/references/temporary-view-sql.md +++ b/skills/databricks-pipelines/references/temporary-view-sql.md @@ -1,82 +1,46 @@ -Temporary Views in Spark Declarative Pipelines create temporary logical datasets without persisting data to storage. Use views for intermediate transformations that drive downstream workloads but don't need materialization. +# Temporary Views (SQL) -**API Reference:** - -**CREATE TEMPORARY VIEW** -SQL statement to define a temporary view. +Pipeline-scoped logical datasets — not materialized, not published to UC. Used for shared intermediate transformations that drive multiple downstream tables. ```sql CREATE TEMPORARY VIEW view_name - [(col_name [COMMENT col_comment] [, ...])] - [COMMENT view_comment] - [TBLPROPERTIES (key = value [, ...])] -AS query + [ (col_name [COMMENT 'col_comment'], ...) ] + [ COMMENT 'view_comment' ] + [ TBLPROPERTIES (key = 'value', ...) ] +AS query -- batch or streaming ``` -Parameters: - -- `view_name` (identifier): Name of the temporary view -- `col_name` (identifier): Optional column name specifications -- `col_comment` (string): Optional description for individual columns -- `view_comment` (string): Optional description for the view -- `TBLPROPERTIES` (key-value pairs): Optional table properties -- `query` (SELECT statement): Query that defines the view's data - -**Common Patterns:** - -**Pattern 1: Intermediate transformation layer** +## Example ```sql --- View for shared filtering logic +-- Shared filtering logic, consumed by multiple downstream MVs CREATE TEMPORARY VIEW valid_events AS SELECT * FROM raw.events -WHERE event_type IS NOT NULL - AND timestamp IS NOT NULL; - --- Multiple tables consume the view -CREATE MATERIALIZED VIEW user_events -AS SELECT * FROM valid_events -WHERE event_type = 'user_action'; - -CREATE MATERIALIZED VIEW system_events -AS SELECT * FROM valid_events -WHERE event_type = 'system_event'; -``` - -**Pattern 2: Views with streaming sources** - -```sql --- Temporary views work with streaming sources too -CREATE TEMPORARY VIEW streaming_events -AS SELECT * FROM STREAM(bronze.events) -WHERE event_id IS NOT NULL; +WHERE event_type IS NOT NULL AND timestamp IS NOT NULL; --- Downstream streaming table consuming the view -CREATE STREAMING TABLE filtered_stream -AS SELECT * FROM STREAM(streaming_events) -WHERE event_type = 'critical'; +CREATE OR REFRESH MATERIALIZED VIEW user_events +AS SELECT * FROM valid_events WHERE event_type = 'user_action'; +-- Other downstream MVs follow the same shape. ``` -**KEY RULES:** - -- Views are not materialized - they're computed on demand when referenced -- Views exist only during the pipeline execution lifetime and are private to the pipeline -- Reference views in downstream tables using `FROM view_name` or `FROM STREAM(view_name)` for streaming -- Views prevent code duplication when multiple downstream tables need the same transformation -- Temporary views work with both batch and streaming data sources (using `STREAM()` function) -- Views can share names with catalog objects; within the pipeline, references resolve to the temporary view +Streaming source: `CREATE TEMPORARY VIEW ... AS SELECT ... FROM STREAM(bronze.events) WHERE ...` — downstream STs read via `FROM STREAM(view_name)`. -**IMPORTANT - Using Expectations with Temporary Views:** +## Using Expectations with Temporary Views -`CREATE TEMPORARY VIEW` does not support CONSTRAINT clauses for expectations. If you need to include expectations (data quality constraints) with a temporary view, use `CREATE LIVE VIEW` syntax instead: +`CREATE TEMPORARY VIEW` does NOT support `CONSTRAINT` clauses. For the rare case where you need expectations on a temp view, use `CREATE LIVE VIEW` (older syntax, retained for this purpose): ```sql -CREATE LIVE VIEW view_name( +CREATE LIVE VIEW view_name ( CONSTRAINT constraint_name EXPECT (condition) [ON VIOLATION DROP ROW | FAIL UPDATE] -) -AS query +) AS query ``` -`CREATE LIVE VIEW` is the older syntax for temporary views, retained specifically for this use case. Use `CREATE TEMPORARY VIEW` for views without expectations, and `CREATE LIVE VIEW` when you need to add CONSTRAINT clauses. +See [expectations-sql.md](expectations-sql.md) for the full constraint semantics. Otherwise, prefer attaching the constraint to a downstream streaming table or MV. + +## Key rules -For detailed information on using expectations with temporary views, see the "expectations" API guide. +- Computed on demand — not materialized. +- Pipeline-scoped — not published to UC, gone after pipeline run. +- Reference downstream as `FROM view_name` (batch) or `FROM STREAM(view_name)` (streaming). +- Temp view name shadows a same-named catalog object inside the pipeline. +- For UC-published views, use `CREATE VIEW` ([view-sql.md](view-sql.md)). diff --git a/skills/databricks-pipelines/references/temporary-view.md b/skills/databricks-pipelines/references/temporary-view.md deleted file mode 100644 index 0ea0a88..0000000 --- a/skills/databricks-pipelines/references/temporary-view.md +++ /dev/null @@ -1,19 +0,0 @@ -# Temporary Views in Spark Declarative Pipelines - -Temporary views are pipeline-private views that exist only within the context of the pipeline and are not published to Unity Catalog. - -## Key Concepts - -Temporary views in Spark Declarative Pipelines: - -- Are private to the pipeline (not published to Unity Catalog) -- Can be referenced by other tables/views in the same pipeline -- Do not persist after pipeline execution -- Useful for organizing complex transformations - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [temporary-view-python.md](temporary-view-python.md) -- **SQL**: [temporary-view-sql.md](temporary-view-sql.md) diff --git a/skills/databricks-pipelines/references/view-sql.md b/skills/databricks-pipelines/references/view-sql.md index 2d47f36..9405385 100644 --- a/skills/databricks-pipelines/references/view-sql.md +++ b/skills/databricks-pipelines/references/view-sql.md @@ -1,76 +1,38 @@ -Views in Spark Declarative Pipelines create virtual tables published to the Unity Catalog metastore. Unlike temporary views (which are private to the pipeline), views created with CREATE VIEW are accessible outside the pipeline and persist in the catalog. +# Persistent Views (SQL, UC) -**API Reference:** +`CREATE VIEW` publishes a virtual table to Unity Catalog. Unlike `CREATE TEMPORARY VIEW` (pipeline-private), persistent views are accessible outside the pipeline and persist in the catalog. The query runs on access — no data is stored. -**CREATE VIEW** -SQL statement to define a persistent view in Unity Catalog. +For pipeline-private views, use `CREATE TEMPORARY VIEW` ([temporary-view-sql.md](temporary-view-sql.md)). For materialized output, use `CREATE OR REFRESH MATERIALIZED VIEW` ([materialized-view-sql.md](materialized-view-sql.md)). + +## Syntax ```sql CREATE VIEW view_name - [COMMENT view_comment] - [TBLPROPERTIES (key = value [, ...])] -AS query + [COMMENT 'view_comment'] + [TBLPROPERTIES (key = 'value', ...)] +AS query -- must be batch (no STREAM) ``` -Parameters: - -- `view_name` (identifier): Unique identifier within the catalog and schema -- `view_comment` (string): Optional description for the view -- `TBLPROPERTIES` (key-value pairs): Optional table properties -- `query` (SELECT statement): Query that defines the view's data (must be batch, not streaming) - -**Common Patterns:** - -**Pattern 1: Filtered view for reusable logic** +## Example ```sql --- View with filtering logic published to catalog CREATE VIEW valid_orders COMMENT 'Orders with valid data for analysis' +TBLPROPERTIES ('quality' = 'silver', 'owner' = 'analytics-team') AS SELECT * FROM raw.orders WHERE order_id IS NOT NULL AND customer_id IS NOT NULL AND order_date IS NOT NULL; - --- Multiple downstream tables can reference this view -CREATE MATERIALIZED VIEW orders_by_region -AS SELECT - region, - COUNT(*) AS order_count, - SUM(amount) AS total_revenue -FROM valid_orders -GROUP BY region; ``` -**Pattern 2: View with custom properties** - -```sql --- View with table properties for metadata -CREATE VIEW customer_summary -COMMENT 'Aggregated customer metrics' -TBLPROPERTIES ( - 'quality' = 'silver', - 'owner' = 'analytics-team', - 'refresh_frequency' = 'daily' -) -AS SELECT - customer_id, - COUNT(DISTINCT order_id) AS total_orders, - SUM(amount) AS lifetime_value, - MAX(order_date) AS last_order_date -FROM valid_orders -GROUP BY customer_id; -``` +Downstream MVs can reference `valid_orders` directly. -**KEY RULES:** +## Key rules -- Views are virtual tables - not materialized, computed on demand when referenced -- Views are published to Unity Catalog and accessible outside the pipeline -- Views require Unity Catalog pipelines with default publishing mode -- Does not support explicit column definitions with COMMENT -- Cannot use `STREAM()` function - views must use batch queries only -- Cannot define expectations (CONSTRAINT clauses) on views -- Views require appropriate permissions: SELECT on source tables, CREATE TABLE on target schema -- For pipeline-private views, use `CREATE TEMPORARY VIEW` instead -- For materialized data persistence, use `CREATE MATERIALIZED VIEW` instead +- Not materialized — query runs on access. +- Published to UC; requires UC pipeline with default publishing mode. +- Batch only — no `STREAM(...)`. +- No `CONSTRAINT` clauses (no expectations). +- No explicit column-list-with-COMMENT syntax — comment at the view level only. +- Permissions: `SELECT` on source tables, `CREATE TABLE` on the target schema. diff --git a/skills/databricks-pipelines/references/view.md b/skills/databricks-pipelines/references/view.md deleted file mode 100644 index f028227..0000000 --- a/skills/databricks-pipelines/references/view.md +++ /dev/null @@ -1,20 +0,0 @@ -# Views in Spark Declarative Pipelines - -Views provide a way to define reusable query logic and publish datasets to Unity Catalog for broader consumption. - -## Key Concepts - -Views in Spark Declarative Pipelines: - -- Are published to Unity Catalog when the pipeline runs -- Can reference other tables and views in the pipeline -- Support both SQL and Python (with limitations) -- Are refreshed when the pipeline updates - -## Language-Specific Implementations - -For detailed implementation guides: - -- **SQL**: [view-sql.md](view-sql.md) - -**Important**: Python in Spark Declarative Pipelines only supports temporary views (private to the pipeline), not persistent views published to Unity Catalog. For Unity Catalog-published views, use SQL syntax with `CREATE VIEW`. diff --git a/skills/databricks-pipelines/references/write-spark-declarative-pipelines.md b/skills/databricks-pipelines/references/write-spark-declarative-pipelines.md deleted file mode 100644 index 7806c71..0000000 --- a/skills/databricks-pipelines/references/write-spark-declarative-pipelines.md +++ /dev/null @@ -1,8 +0,0 @@ -# Write Spark Declarative Pipelines - -Core syntax and rules for writing Spark Declarative Pipelines datasets. - -## Language-specific guides - -- [Python basics](python-basics.md) - Python decorators, functions, and critical rules -- [SQL basics](sql-basics.md) - SQL statements and critical rules