skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k by jamesbroadhead · Pull Request #85 · databricks/databricks-agent-skills

jamesbroadhead · 2026-05-24T21:12:42Z

Summary

Ports the databricks-spark-declarative-pipelines skill from databricks-solutions/ai-dev-kit into stable skills/databricks-pipelines/. Source: databricks-solutions/ai-dev-kit:experimental.

Completes d-a-s PR #73's TODO #5. Pairs with a-d-k PR #546, which tombstones the a-d-k skill once this lands.

Stable's databricks-pipelines already covered the per-feature × per-language API/options surface (decision tree, common traps, format options, dataset/flow/quality references). a-d-k's version covered scaffolding/workflows, configuration, performance tuning, DLT migration, and several streaming patterns + Kafka ingestion + SCD-2 query patterns that stable lacked. This PR adds a-d-k's net-new content as new references/ files; the per-feature reference structure is preserved.

Changes

New `references/`

dlt-migration.md — both migration paths (DLT Python → SDP Python via pyspark.pipelines, DLT Python → SDP SQL) with side-by-side conversions for decorators, reads, expectations, CDC/SCD, and partitioning → liquid clustering.
workflows.md — Workflow A/B/C chooser (standalone bundle via databricks pipelines init, pipeline-in-existing-bundle, rapid CLI iteration with no bundle); language-selection rules; start-update + poll-the-update pattern (with the "never poll top-level pipeline state because RETRY_ON_FAILURE flips it back to RUNNING" rationale); edit/re-upload/restart flow; Python SDK alternative.
pipeline-configuration.md — Full JSON config reference for pipelines create|update (top-level fields, clusters, event_log, notifications, configuration, run_as, restart_window, environment, deployment); variant snippets (dev mode, non-serverless, continuous, notifications, autoscaling, custom event log, serverless Python deps); multi-schema patterns; platform constraints.
performance.md — Liquid Clustering with per-layer key guidance (bronze/silver/gold); cluster-key type rules; table properties; state-management strategies for streaming; join optimization (stream-to-static, stream-to-stream with time bounds); query optimization; pre-aggregation; compute config; monitoring.
streaming-patterns.md — Deduplication (by key, with time window, composite); windowed aggregations (tumbling, multi-size, session windows); event-time vs processing-time; rescue-data quarantine (Auto Loader _rescued_data → bronze_quarantine + silver_clean fanout); stream-to-stream join as a pattern; running totals; anomaly detection (rolling z-score outlier flag); end-to-end lag monitoring.
kafka.md — Basic Kafka read (Python + SQL); JSON payload parsing with explicit schemas; Databricks Secrets SASL/PLAIN auth; mTLS notes; Event Hubs via the Kafka protocol; pipeline-config plumbing for brokers/topics; pointer to sink.md for writing back to Kafka. Fills a full gap — stable's SKILL.md API table listed read_kafka and format(\"kafka\") with no linked skill.
scd-2-querying.md — __START_AT / __END_AT temporal semantics; current-state materialized views; point-in-time queries with the inclusive-lower / exclusive-upper boundary; per-entity history; period-bounded change analysis; joining facts with historical dimensions (as-of-transaction-time and current-dim variants); pre-filter MV optimization; clustering on (entity_key, __START_AT).

`SKILL.md`

New "Choose Your Workflow" and "Language Selection" sections near scaffolding.
Scaffolding section documents both databricks pipelines init (newer, focused) and databricks bundle init lakeflow-pipelines (template-based).
Pipeline API Reference list reorganized: Project & Lifecycle (workflows, configuration, performance, DLT migration) and Datasets, Flows & Quality (the existing per-feature refs + new kafka, scd-2-querying, streaming-patterns).
Version bumped to 0.3.0.

Cross-references in existing references

auto-loader.md → streaming-patterns.md (quarantine), kafka.md, lag monitoring.
auto-cdc.md → scd-2-querying.md for reading SCD-2 history tables.

Deliberately dropped from a-d-k

a-d-k file	Why dropped
`references/2-mcp-approach.md`	a-d-k experimental already renamed this to `2-cli-approach.md`; MCP tool refs stripped per d-a-s PR #73 policy. CLI flow now lives in `workflows.md` as Workflow C.
`references/python/1-syntax-basics.md`, `references/sql/1-syntax-basics.md`	Covered by stable's `python-basics.md`, `sql-basics.md`, and the per-feature references (streaming-table, materialized-view, temporary-view, view-sql).
`references/python/{2,3,4}-.md`, `references/sql/{2,3,4}-.md`	Pattern content ported into `streaming-patterns.md`, `kafka.md`, `scd-2-querying.md` (this PR); API/options content already covered by stable's per-feature × per-language references.
`scripts/exploration_notebook.py`	Stable convention has no `scripts/` directory under a skill. `databricks pipelines init` generates an `explorations/` folder; users use the CLI or the generated notebook directly.

Test plan

python3 scripts/skills.py generate clean.
python3 scripts/skills.py validate passes.
Merged origin/main mid-port (resolved version conflict — kept 0.3.0; took main's CLI install command + compatibility bump).
CI green on this branch.
Owner review (@lennartkats-db / @camielstee-db per CODEOWNERS).

This pull request and its description were written by Claude.

Phase 1 of d-a-s #73's TODO #5 — port a-d-k's databricks-spark-declarative-pipelines content into stable skills/databricks-pipelines/. Adds references/dlt-migration.md covering both migration paths (DLT Python → SDP Python via the modern pyspark.pipelines API, and DLT Python → SDP SQL) with side-by-side conversions for decorators, reads, expectations, CDC/SCD, and partitioning → liquid clustering. Source clean — no MCP-tool refs to strip, no docs.databricks.com URLs to rewrite. SKILL.md updates: - bump version to 0.2.0 - new "Migrating from DLT" section pointing at the reference Subsequent phases (separate commits) port the remaining a-d-k content: workflow A/B/C decision matrix (project initialization), per-language performance reference, language-selection rules. Co-authored-by: Isaac

Phase 2 of the a-d-k → d-a-s port for databricks-spark-declarative-pipelines. Adds three new references that fill the dev-side gaps that stable's per-feature × per-language reference files don't cover: - references/workflows.md — Workflow A/B/C chooser (standalone bundle via `databricks pipelines init`, pipeline-in-existing-bundle, rapid CLI iteration with no bundle); language selection rules; start-update + poll-the-update pattern with the "never poll top-level pipeline state" rationale; edit/ re-upload/restart flow. - references/pipeline-configuration.md — Full JSON config reference for `pipelines create|update` (top-level fields, clusters, event_log, notifications, configuration, run_as, restart_window, environment, deployment); variant snippets (dev mode, non-serverless, continuous, notifications, autoscaling, custom event log, serverless Python deps); multi-schema patterns; platform constraints. - references/performance.md — Liquid Clustering with per-layer key guidance (bronze/silver/gold), cluster-key type rules, table properties, state management strategies for streaming, join optimization, query optimization, pre-aggregation, compute config, monitoring. SKILL.md updates: - New "Choose Your Workflow" and "Language Selection" sections. - Scaffolding section documents both `databricks pipelines init` and `databricks bundle init lakeflow-pipelines`. - Pipeline API Reference list reorganized into Project & Lifecycle and Datasets, Flows & Quality groups. - Version bumped to 0.3.0. Deliberately dropped from a-d-k's databricks-spark-declarative-pipelines: - 2-mcp-approach.md (a-d-k experimental already replaced with 2-cli-approach.md — MCP tool refs removed per PR #73 policy). - python/{1..4}-*.md and sql/{1..4}-*.md (covered by stable's existing per- feature × per-language refs: python-basics, sql-basics, auto-loader-*, auto-cdc-*, streaming-table-*, sink-*, foreach-batch-sink-*, etc.). - scripts/exploration_notebook.py (stable convention has no scripts/; users use the CLI directly or the explorations/ folder generated by `pipelines init`). Source: databricks-solutions/ai-dev-kit@experimental. Co-authored-by: Isaac

# Conflicts: # manifest.json # skills/databricks-pipelines/SKILL.md

QuentinAmbard · 2026-05-26T13:01:12Z

@jamesbroadhead are you sure you're not merging it from main (now old/outdated) instead of the experimental branch?
We re-worked the experimental branch quite a bit (and removed all mcp for example).

https://github.com/databricks-solutions/ai-dev-kit/tree/experimental/databricks-skills/databricks-spark-declarative-pipelines

QuentinAmbard · 2026-05-26T13:05:18Z

should I suggest an alternative merge on another PR just to see & discuss options?

Phase 3 of the a-d-k → d-a-s port. Closes the remaining customer-facing gaps from the python/{2,3,4} and sql/{2,3,4} reference files that the phase-2 port had judged as "covered" but on audit weren't. New references: - streaming-patterns.md — combined SQL + Python. Deduplication (by key, with time window, composite); windowed aggregations (tumbling, multi-size, session windows); event-time vs processing-time guidance; rescue-data quarantine (Auto Loader `_rescued_data` → bronze_quarantine + silver_clean fanout); stream-to-stream join as a pattern with cross-link to performance.md; running totals; anomaly detection (rolling z-score outlier flag); end-to-end lag monitoring. - kafka.md — combined SQL + Python. Basic Kafka read (`spark.readStream. format("kafka")` and `read_kafka()`); JSON payload parsing with explicit schemas; Databricks Secrets-based SASL/PLAIN auth; mTLS notes; Event Hubs via the Kafka protocol; pipeline-configuration plumbing for brokers/topics; pointer to sink.md for writing back to Kafka. Fills a full gap — stable's SKILL.md API table listed `read_kafka` and `format("kafka")` with no linked skill. - scd-2-querying.md — combined SQL + Python. `__START_AT` / `__END_AT` temporal semantics; current-state materialized views; point-in-time queries with the inclusive-lower / exclusive-upper boundary; per- entity history; period-bounded change analysis; joining facts with historical dimensions (as-of-transaction-time and current-dim variants); pre-filter MV optimization; clustering on (entity_key, __START_AT). Cross-references added: - auto-loader.md → streaming-patterns.md (quarantine), kafka.md, lag monitoring. - auto-cdc.md → scd-2-querying.md. - SKILL.md API reference list adds kafka, scd-2-querying, streaming- patterns. Co-authored-by: Isaac

jamesbroadhead · 2026-05-26T15:34:32Z

Hi Quentin — Claude here, replying on James's behalf.

are you sure you're not merging it from main (now old/outdated) instead of the experimental branch?

Yes, sourced from experimental, not main. A few concrete checks:

workflows.md Workflow C is the 2-cli-approach.md content (CLI-based rapid iteration with the start-update + poll-the-update rationale). main still has 2-mcp-approach.md — we did not pull from there.
pipeline-configuration.md is a condensed port of experimental's 3-advanced-configuration.md (~12.5 KB on experimental vs ~14.5 KB on main — different content).
The "Choose Your Workflow" A/B/C chooser and the databricks pipelines init + init-config.json flow only exist on experimental's SKILL.md.
All MCP tool refs are stripped (per d-a-s #73 policy), matching experimental's direction.

What's confusing is the file layout. Instead of keeping the numbered guides (1-project-initialization, 2-cli-approach, 3-advanced-configuration, 4-dlt-migration + python/ + sql/ subdirs), the port restructures into stable's per-feature reference convention. That's why filenames don't line up 1:1. Mapping is in the PR description under "New references/" and "Deliberately dropped from a-d-k".

should I suggest an alternative merge on another PR just to see & discuss options?

Not needed — the source is correct and the reorganization is intentional. Stable already has per-feature × per-language references for streaming-table / materialized-view / auto-loader / auto-cdc / temporary-view / view-sql, so experimental's python/{1..5}-*.md and sql/{1..5}-*.md would have duplicated them. The net-new patterns (streaming-patterns, kafka, scd-2-querying) and project-level content (workflows, pipeline-configuration, performance, dlt-migration) are pulled in as new references.

If you spot specific content from experimental that you think was dropped or distorted, please call it out inline on the relevant file and we'll fix it directly.

@lennartkats-db

That PyPI package is the legacy CLI; the modern CLI is a binary. Per @lennartkats-db review on #90, point readers at the `databricks-core` skill (which has install + auth references) instead. Two hits fixed: - experimental/databricks-execution-compute/SKILL.md intro - skills/databricks-pipelines/references/workflows.md troubleshooting table (carried over from the a-d-k port in #85) Co-authored-by: Isaac

jamesbroadhead requested review from camielstee-db and lennartkats-db May 24, 2026 21:12

lennartkats-db approved these changes May 26, 2026

View reviewed changes

jamesbroadhead added 2 commits May 26, 2026 09:38

Merge remote-tracking branch 'origin/main' into jb/pipelines-port-phase1

2372487

# Conflicts: # manifest.json # skills/databricks-pipelines/SKILL.md

jamesbroadhead marked this pull request as ready for review May 26, 2026 09:39

jamesbroadhead requested review from a team and simonfaltum as code owners May 26, 2026 09:39

jamesbroadhead changed the title ~~skills(pipelines): port DLT migration guide from a-d-k (phase 1)~~ skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k May 26, 2026

jamesbroadhead merged commit a4d1f3d into main May 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k#85

skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k#85
jamesbroadhead merged 4 commits into
mainfrom
jb/pipelines-port-phase1

jamesbroadhead commented May 24, 2026 •

edited

Loading

Uh oh!

QuentinAmbard commented May 26, 2026

Uh oh!

QuentinAmbard commented May 26, 2026

Uh oh!

jamesbroadhead commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jamesbroadhead commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New references/

SKILL.md

Cross-references in existing references

Deliberately dropped from a-d-k

Test plan

Uh oh!

QuentinAmbard commented May 26, 2026

Uh oh!

QuentinAmbard commented May 26, 2026

Uh oh!

jamesbroadhead commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jamesbroadhead commented May 24, 2026 •

edited

Loading

New `references/`

`SKILL.md`