Skip to content

skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k#85

Merged
jamesbroadhead merged 4 commits into
mainfrom
jb/pipelines-port-phase1
May 26, 2026
Merged

skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k#85
jamesbroadhead merged 4 commits into
mainfrom
jb/pipelines-port-phase1

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

@jamesbroadhead jamesbroadhead commented May 24, 2026

Summary

Ports the databricks-spark-declarative-pipelines skill from databricks-solutions/ai-dev-kit into stable skills/databricks-pipelines/. Source: databricks-solutions/ai-dev-kit:experimental.

Completes d-a-s PR #73's TODO #5. Pairs with a-d-k PR #546, which tombstones the a-d-k skill once this lands.

Stable's databricks-pipelines already covered the per-feature × per-language API/options surface (decision tree, common traps, format options, dataset/flow/quality references). a-d-k's version covered scaffolding/workflows, configuration, performance tuning, DLT migration, and several streaming patterns + Kafka ingestion + SCD-2 query patterns that stable lacked. This PR adds a-d-k's net-new content as new references/ files; the per-feature reference structure is preserved.

Changes

New references/

  • dlt-migration.md — both migration paths (DLT Python → SDP Python via pyspark.pipelines, DLT Python → SDP SQL) with side-by-side conversions for decorators, reads, expectations, CDC/SCD, and partitioning → liquid clustering.
  • workflows.md — Workflow A/B/C chooser (standalone bundle via databricks pipelines init, pipeline-in-existing-bundle, rapid CLI iteration with no bundle); language-selection rules; start-update + poll-the-update pattern (with the "never poll top-level pipeline state because RETRY_ON_FAILURE flips it back to RUNNING" rationale); edit/re-upload/restart flow; Python SDK alternative.
  • pipeline-configuration.md — Full JSON config reference for pipelines create|update (top-level fields, clusters, event_log, notifications, configuration, run_as, restart_window, environment, deployment); variant snippets (dev mode, non-serverless, continuous, notifications, autoscaling, custom event log, serverless Python deps); multi-schema patterns; platform constraints.
  • performance.md — Liquid Clustering with per-layer key guidance (bronze/silver/gold); cluster-key type rules; table properties; state-management strategies for streaming; join optimization (stream-to-static, stream-to-stream with time bounds); query optimization; pre-aggregation; compute config; monitoring.
  • streaming-patterns.md — Deduplication (by key, with time window, composite); windowed aggregations (tumbling, multi-size, session windows); event-time vs processing-time; rescue-data quarantine (Auto Loader _rescued_data → bronze_quarantine + silver_clean fanout); stream-to-stream join as a pattern; running totals; anomaly detection (rolling z-score outlier flag); end-to-end lag monitoring.
  • kafka.md — Basic Kafka read (Python + SQL); JSON payload parsing with explicit schemas; Databricks Secrets SASL/PLAIN auth; mTLS notes; Event Hubs via the Kafka protocol; pipeline-config plumbing for brokers/topics; pointer to sink.md for writing back to Kafka. Fills a full gap — stable's SKILL.md API table listed read_kafka and format(\"kafka\") with no linked skill.
  • scd-2-querying.md__START_AT / __END_AT temporal semantics; current-state materialized views; point-in-time queries with the inclusive-lower / exclusive-upper boundary; per-entity history; period-bounded change analysis; joining facts with historical dimensions (as-of-transaction-time and current-dim variants); pre-filter MV optimization; clustering on (entity_key, __START_AT).

SKILL.md

  • New "Choose Your Workflow" and "Language Selection" sections near scaffolding.
  • Scaffolding section documents both databricks pipelines init (newer, focused) and databricks bundle init lakeflow-pipelines (template-based).
  • Pipeline API Reference list reorganized: Project & Lifecycle (workflows, configuration, performance, DLT migration) and Datasets, Flows & Quality (the existing per-feature refs + new kafka, scd-2-querying, streaming-patterns).
  • Version bumped to 0.3.0.

Cross-references in existing references

  • auto-loader.mdstreaming-patterns.md (quarantine), kafka.md, lag monitoring.
  • auto-cdc.mdscd-2-querying.md for reading SCD-2 history tables.

Deliberately dropped from a-d-k

a-d-k file Why dropped
references/2-mcp-approach.md a-d-k experimental already renamed this to 2-cli-approach.md; MCP tool refs stripped per d-a-s PR #73 policy. CLI flow now lives in workflows.md as Workflow C.
references/python/1-syntax-basics.md, references/sql/1-syntax-basics.md Covered by stable's python-basics.md, sql-basics.md, and the per-feature references (streaming-table, materialized-view, temporary-view, view-sql).
references/python/{2,3,4}-*.md, references/sql/{2,3,4}-*.md Pattern content ported into streaming-patterns.md, kafka.md, scd-2-querying.md (this PR); API/options content already covered by stable's per-feature × per-language references.
scripts/exploration_notebook.py Stable convention has no scripts/ directory under a skill. databricks pipelines init generates an explorations/ folder; users use the CLI or the generated notebook directly.

Test plan

  • python3 scripts/skills.py generate clean.
  • python3 scripts/skills.py validate passes.
  • Merged origin/main mid-port (resolved version conflict — kept 0.3.0; took main's CLI install command + compatibility bump).
  • CI green on this branch.
  • Owner review (@lennartkats-db / @camielstee-db per CODEOWNERS).

This pull request and its description were written by Claude.

Phase 1 of d-a-s #73's TODO #5 — port a-d-k's
databricks-spark-declarative-pipelines content into stable
skills/databricks-pipelines/. Adds references/dlt-migration.md
covering both migration paths (DLT Python → SDP Python via the modern
pyspark.pipelines API, and DLT Python → SDP SQL) with side-by-side
conversions for decorators, reads, expectations, CDC/SCD, and
partitioning → liquid clustering.

Source clean — no MCP-tool refs to strip, no docs.databricks.com URLs
to rewrite.

SKILL.md updates:
- bump version to 0.2.0
- new "Migrating from DLT" section pointing at the reference

Subsequent phases (separate commits) port the remaining a-d-k content:
workflow A/B/C decision matrix (project initialization), per-language
performance reference, language-selection rules.

Co-authored-by: Isaac
Phase 2 of the a-d-k → d-a-s port for databricks-spark-declarative-pipelines.

Adds three new references that fill the dev-side gaps that stable's per-feature
× per-language reference files don't cover:

- references/workflows.md — Workflow A/B/C chooser (standalone bundle via
  `databricks pipelines init`, pipeline-in-existing-bundle, rapid CLI iteration
  with no bundle); language selection rules; start-update + poll-the-update
  pattern with the "never poll top-level pipeline state" rationale; edit/
  re-upload/restart flow.
- references/pipeline-configuration.md — Full JSON config reference for
  `pipelines create|update` (top-level fields, clusters, event_log,
  notifications, configuration, run_as, restart_window, environment,
  deployment); variant snippets (dev mode, non-serverless, continuous,
  notifications, autoscaling, custom event log, serverless Python deps);
  multi-schema patterns; platform constraints.
- references/performance.md — Liquid Clustering with per-layer key guidance
  (bronze/silver/gold), cluster-key type rules, table properties, state
  management strategies for streaming, join optimization, query optimization,
  pre-aggregation, compute config, monitoring.

SKILL.md updates:
- New "Choose Your Workflow" and "Language Selection" sections.
- Scaffolding section documents both `databricks pipelines init` and
  `databricks bundle init lakeflow-pipelines`.
- Pipeline API Reference list reorganized into Project & Lifecycle and
  Datasets, Flows & Quality groups.
- Version bumped to 0.3.0.

Deliberately dropped from a-d-k's databricks-spark-declarative-pipelines:
- 2-mcp-approach.md (a-d-k experimental already replaced with 2-cli-approach.md
  — MCP tool refs removed per PR #73 policy).
- python/{1..4}-*.md and sql/{1..4}-*.md (covered by stable's existing per-
  feature × per-language refs: python-basics, sql-basics, auto-loader-*,
  auto-cdc-*, streaming-table-*, sink-*, foreach-batch-sink-*, etc.).
- scripts/exploration_notebook.py (stable convention has no scripts/; users
  use the CLI directly or the explorations/ folder generated by `pipelines
  init`).

Source: databricks-solutions/ai-dev-kit@experimental.

Co-authored-by: Isaac
# Conflicts:
#	manifest.json
#	skills/databricks-pipelines/SKILL.md
@jamesbroadhead jamesbroadhead marked this pull request as ready for review May 26, 2026 09:39
@jamesbroadhead jamesbroadhead requested review from a team and simonfaltum as code owners May 26, 2026 09:39
@jamesbroadhead jamesbroadhead changed the title skills(pipelines): port DLT migration guide from a-d-k (phase 1) skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k May 26, 2026
@QuentinAmbard
Copy link
Copy Markdown

@jamesbroadhead are you sure you're not merging it from main (now old/outdated) instead of the experimental branch?
We re-worked the experimental branch quite a bit (and removed all mcp for example).

https://github.com/databricks-solutions/ai-dev-kit/tree/experimental/databricks-skills/databricks-spark-declarative-pipelines

@QuentinAmbard
Copy link
Copy Markdown

should I suggest an alternative merge on another PR just to see & discuss options?

Phase 3 of the a-d-k → d-a-s port. Closes the remaining customer-facing
gaps from the python/{2,3,4} and sql/{2,3,4} reference files that the
phase-2 port had judged as "covered" but on audit weren't.

New references:

- streaming-patterns.md — combined SQL + Python. Deduplication (by key,
  with time window, composite); windowed aggregations (tumbling,
  multi-size, session windows); event-time vs processing-time guidance;
  rescue-data quarantine (Auto Loader `_rescued_data` → bronze_quarantine
  + silver_clean fanout); stream-to-stream join as a pattern with
  cross-link to performance.md; running totals; anomaly detection
  (rolling z-score outlier flag); end-to-end lag monitoring.

- kafka.md — combined SQL + Python. Basic Kafka read (`spark.readStream.
  format("kafka")` and `read_kafka()`); JSON payload parsing with
  explicit schemas; Databricks Secrets-based SASL/PLAIN auth; mTLS notes;
  Event Hubs via the Kafka protocol; pipeline-configuration plumbing for
  brokers/topics; pointer to sink.md for writing back to Kafka. Fills a
  full gap — stable's SKILL.md API table listed `read_kafka` and
  `format("kafka")` with no linked skill.

- scd-2-querying.md — combined SQL + Python. `__START_AT` / `__END_AT`
  temporal semantics; current-state materialized views; point-in-time
  queries with the inclusive-lower / exclusive-upper boundary; per-
  entity history; period-bounded change analysis; joining facts with
  historical dimensions (as-of-transaction-time and current-dim
  variants); pre-filter MV optimization; clustering on
  (entity_key, __START_AT).

Cross-references added:

- auto-loader.md → streaming-patterns.md (quarantine), kafka.md, lag
  monitoring.
- auto-cdc.md → scd-2-querying.md.
- SKILL.md API reference list adds kafka, scd-2-querying, streaming-
  patterns.

Co-authored-by: Isaac
@jamesbroadhead
Copy link
Copy Markdown
Contributor Author

Hi Quentin — Claude here, replying on James's behalf.

are you sure you're not merging it from main (now old/outdated) instead of the experimental branch?

Yes, sourced from experimental, not main. A few concrete checks:

  • workflows.md Workflow C is the 2-cli-approach.md content (CLI-based rapid iteration with the start-update + poll-the-update rationale). main still has 2-mcp-approach.md — we did not pull from there.
  • pipeline-configuration.md is a condensed port of experimental's 3-advanced-configuration.md (~12.5 KB on experimental vs ~14.5 KB on main — different content).
  • The "Choose Your Workflow" A/B/C chooser and the databricks pipelines init + init-config.json flow only exist on experimental's SKILL.md.
  • All MCP tool refs are stripped (per d-a-s #73 policy), matching experimental's direction.

What's confusing is the file layout. Instead of keeping the numbered guides (1-project-initialization, 2-cli-approach, 3-advanced-configuration, 4-dlt-migration + python/ + sql/ subdirs), the port restructures into stable's per-feature reference convention. That's why filenames don't line up 1:1. Mapping is in the PR description under "New references/" and "Deliberately dropped from a-d-k".

should I suggest an alternative merge on another PR just to see & discuss options?

Not needed — the source is correct and the reorganization is intentional. Stable already has per-feature × per-language references for streaming-table / materialized-view / auto-loader / auto-cdc / temporary-view / view-sql, so experimental's python/{1..5}-*.md and sql/{1..5}-*.md would have duplicated them. The net-new patterns (streaming-patterns, kafka, scd-2-querying) and project-level content (workflows, pipeline-configuration, performance, dlt-migration) are pulled in as new references.

If you spot specific content from experimental that you think was dropped or distorted, please call it out inline on the relevant file and we'll fix it directly.

@jamesbroadhead jamesbroadhead merged commit a4d1f3d into main May 26, 2026
1 check passed
jamesbroadhead added a commit that referenced this pull request May 26, 2026
That PyPI package is the legacy CLI; the modern CLI is a binary. Per
@lennartkats-db review on #90, point readers at the `databricks-core`
skill (which has install + auth references) instead.

Two hits fixed:
- experimental/databricks-execution-compute/SKILL.md intro
- skills/databricks-pipelines/references/workflows.md troubleshooting
  table (carried over from the a-d-k port in #85)

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants