skills(pipelines): tighten databricks-pipelines for correctness and density#102
Open
QuentinAmbard wants to merge 1 commit into
Open
Conversation
…ness and density Rework of the databricks-pipelines skill: fixes correctness/consistency issues introduced during the initial port from ai-dev-kit, restructures the workflow narrative around DAB-vs-CLI iteration, and compresses ~40% of the line count without losing any concept. See REVIEW-NOTES-databricks-pipelines.md at the repo root for the full rationale and per-category change log. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this PR exists
The
databricks-pipelinesskill is the agent-facing reference for building Lakeflow Spark Declarative Pipelines. The initial port fromai-dev-kitgot the content into the repo — this PR is the second pass that turns it into a tight, internally consistent tool an LLM can rely on in a fresh session: every claim grounded in the current Databricks docs, every workflow with one clear entry point, every concept stated once.What this PR improves
One authoritative story per topic. SDP/LDP naming, the legacy-DLT migration map, the
start-updatepolling pattern witherror.exceptions[0].messageextraction, the dev/iteration canonical create JSON — each now lives in exactly one place and is referenced from everywhere else that needs it. The DAB workflows (A and B) and the no-bundle CLI workflow (C) have dedicated detail files; the SKILL.md entry point gives a one-liner per option and links out. The Common Traps / Common Issues split is now intent-driven: Traps cover design-time decisions, Issues cover concrete error → fix mappings agents will copy-paste.Verified against the current Databricks docs. Where official guidance is nuanced —
CREATE TEMPORARY VIEWdoesn't supportCONSTRAINTclauses, soCREATE LIVE VIEWis retained as the only path for expectations on a temp view — that nuance is captured explicitly with a citation.STREAM(table)(function form, with parens) is normalized everywhere for table sources;STREAM read_files(...)(no extra parens) for function sources, matching the docs.@dp.viewis correctly marked as legacy per the officialtemporary_viewreference.sequence_byis documented as accepting both string andColumnper the API.Dense and skimmable. Roughly 40% shorter (~5,700 → ~3,400 lines) with zero concept removed. SQL is canonical where SQL/Python differ only mechanically, with a one-line Python equivalent noted inline. Pattern-variation sections that repeated the same SQL with one different clause each are now single examples with the variations listed. Verbose boilerplate (full
pyproject.toml, identical second/third examples of the same pattern) leans on agent world-knowledge instead of restating it. Less context to load, less drift between repeated explanations.Tuned defaults for the loop agents actually run. The canonical
pipelines createJSON ships withdevelopment: trueand the retry overrides set to\"0\", so a doomed update fails fast (~30s) instead of retrying for 10+ min on serverless. A clearly-labelled "drop these for production" instruction is right next to the snippet so the iteration defaults can't quietly leak into prod pipelines.Summary of changes
1-project-initialization-with-dab.md(Workflows A + B),2-rapid-iteration-with-cli.md(Workflow C). Running-a-pipeline and refresh-mode guidance now explicitly distinguish bundle-deploy vs CLI-driven paths.FROM STREAM(table)for table sources,FROM STREAM read_files(...)for function sources,CREATE OR REFRESHon every example. Fixed the brokenjqfor FAILED-update debugging so it extracts the actual exception body. Resolved contradictions around@dp.view,apply_changesaliases,CREATE LIVE VIEW,sequence_by, andpartition_cols.import dlt, decorators, reads,apply_changes,LIVE.prefix,CREATE LIVE TABLE,partition_cols,input_file_name(), and thetarget=parameter — with theCREATE LIVE VIEW-for-expectations carve-out explicitly noted.pyproject.tomland option-table boilerplate into pointer-style content. ~40% line reduction overall, zero concept lost.streaming-table.md,expectations.md, etc.) — SKILL.md API tables now link straight at the-python.md/-sql.mdsiblings, with no broken subdirectory paths.CREATE OR REPLACErejection (standard SQL ≠ SDP),dbfs:prefix required for UC Volume paths,CLUSTER BYcolumn-type rules with theDELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTEDmapping, theerror.exceptions[0].messagedebug-extraction pattern, the upstream-trace protocol for validation failures, and the Gold-layer preserve-dimensions guidance.development: true+ retry overrides for fast-fail iteration, alongside a clearly-labelled prod conversion. Mirrored consistently in the SDK alternative and the workflow file.Reviewer aid
REVIEW-NOTES-databricks-pipelines.mdat the repo root walks through every category of change with file/line citations and links to the Databricks docs sections that motivated each correctness fix.This pull request and its description were written by Isaac.