Skip to content

skills(pipelines): tighten databricks-pipelines for correctness and density#102

Open
QuentinAmbard wants to merge 1 commit into
databricks:mainfrom
QuentinAmbard:skills/databricks-pipelines-density-and-correctness
Open

skills(pipelines): tighten databricks-pipelines for correctness and density#102
QuentinAmbard wants to merge 1 commit into
databricks:mainfrom
QuentinAmbard:skills/databricks-pipelines-density-and-correctness

Conversation

@QuentinAmbard
Copy link
Copy Markdown

Why this PR exists

The databricks-pipelines skill is the agent-facing reference for building Lakeflow Spark Declarative Pipelines. The initial port from ai-dev-kit got the content into the repo — this PR is the second pass that turns it into a tight, internally consistent tool an LLM can rely on in a fresh session: every claim grounded in the current Databricks docs, every workflow with one clear entry point, every concept stated once.

What this PR improves

One authoritative story per topic. SDP/LDP naming, the legacy-DLT migration map, the start-update polling pattern with error.exceptions[0].message extraction, the dev/iteration canonical create JSON — each now lives in exactly one place and is referenced from everywhere else that needs it. The DAB workflows (A and B) and the no-bundle CLI workflow (C) have dedicated detail files; the SKILL.md entry point gives a one-liner per option and links out. The Common Traps / Common Issues split is now intent-driven: Traps cover design-time decisions, Issues cover concrete error → fix mappings agents will copy-paste.

Verified against the current Databricks docs. Where official guidance is nuanced — CREATE TEMPORARY VIEW doesn't support CONSTRAINT clauses, so CREATE LIVE VIEW is retained as the only path for expectations on a temp view — that nuance is captured explicitly with a citation. STREAM(table) (function form, with parens) is normalized everywhere for table sources; STREAM read_files(...) (no extra parens) for function sources, matching the docs. @dp.view is correctly marked as legacy per the official temporary_view reference. sequence_by is documented as accepting both string and Column per the API.

Dense and skimmable. Roughly 40% shorter (~5,700 → ~3,400 lines) with zero concept removed. SQL is canonical where SQL/Python differ only mechanically, with a one-line Python equivalent noted inline. Pattern-variation sections that repeated the same SQL with one different clause each are now single examples with the variations listed. Verbose boilerplate (full pyproject.toml, identical second/third examples of the same pattern) leans on agent world-knowledge instead of restating it. Less context to load, less drift between repeated explanations.

Tuned defaults for the loop agents actually run. The canonical pipelines create JSON ships with development: true and the retry overrides set to \"0\", so a doomed update fails fast (~30s) instead of retrying for 10+ min on serverless. A clearly-labelled "drop these for production" instruction is right next to the snippet so the iteration defaults can't quietly leak into prod pipelines.

Summary of changes

Area What changed
Workflow narrative Split into a self-contained A/B/C entry block in SKILL.md, plus dedicated detail files: 1-project-initialization-with-dab.md (Workflows A + B), 2-rapid-iteration-with-cli.md (Workflow C). Running-a-pipeline and refresh-mode guidance now explicitly distinguish bundle-deploy vs CLI-driven paths.
Correctness Normalized SQL syntax against the current Databricks docs — FROM STREAM(table) for table sources, FROM STREAM read_files(...) for function sources, CREATE OR REFRESH on every example. Fixed the broken jq for FAILED-update debugging so it extracts the actual exception body. Resolved contradictions around @dp.view, apply_changes aliases, CREATE LIVE VIEW, sequence_by, and partition_cols.
API tables in SKILL.md Added a Description column so every row tells the reader what the feature is, not just the syntax. Dropped the Python/SQL deprecation columns. Added a single canonical "Legacy DLT Syntax — always migrate" table covering import dlt, decorators, reads, apply_changes, LIVE. prefix, CREATE LIVE TABLE, partition_cols, input_file_name(), and the target= parameter — with the CREATE LIVE VIEW-for-expectations carve-out explicitly noted.
Density Removed near-duplicate examples (same SQL with one different filter value), "Complete Example" sections that re-piled patterns shown individually above, and parallel SQL+Python code blocks for mechanically-translatable patterns. Compressed full pyproject.toml and option-table boilerplate into pointer-style content. ~40% line reduction overall, zero concept lost.
File layout Deleted ten redirect-only "parent stub" files (streaming-table.md, expectations.md, etc.) — SKILL.md API tables now link straight at the -python.md / -sql.md siblings, with no broken subdirectory paths.
High-value additions from a-d-k Merged in SDP-specific gotchas that weren't in the original DAS port: CREATE OR REPLACE rejection (standard SQL ≠ SDP), dbfs: prefix required for UC Volume paths, CLUSTER BY column-type rules with the DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED mapping, the error.exceptions[0].message debug-extraction pattern, the upstream-trace protocol for validation failures, and the Gold-layer preserve-dimensions guidance.
Dev defaults Canonical create JSON now ships with development: true + retry overrides for fast-fail iteration, alongside a clearly-labelled prod conversion. Mirrored consistently in the SDK alternative and the workflow file.

Reviewer aid

REVIEW-NOTES-databricks-pipelines.md at the repo root walks through every category of change with file/line citations and links to the Databricks docs sections that motivated each correctness fix.

This pull request and its description were written by Isaac.

…ness and density

Rework of the databricks-pipelines skill: fixes correctness/consistency
issues introduced during the initial port from ai-dev-kit, restructures
the workflow narrative around DAB-vs-CLI iteration, and compresses
~40% of the line count without losing any concept.

See REVIEW-NOTES-databricks-pipelines.md at the repo root for the full
rationale and per-category change log.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant