skills(pipelines): port databricks-spark-declarative-pipelines from a-d-k#85
Conversation
Phase 1 of d-a-s #73's TODO #5 — port a-d-k's databricks-spark-declarative-pipelines content into stable skills/databricks-pipelines/. Adds references/dlt-migration.md covering both migration paths (DLT Python → SDP Python via the modern pyspark.pipelines API, and DLT Python → SDP SQL) with side-by-side conversions for decorators, reads, expectations, CDC/SCD, and partitioning → liquid clustering. Source clean — no MCP-tool refs to strip, no docs.databricks.com URLs to rewrite. SKILL.md updates: - bump version to 0.2.0 - new "Migrating from DLT" section pointing at the reference Subsequent phases (separate commits) port the remaining a-d-k content: workflow A/B/C decision matrix (project initialization), per-language performance reference, language-selection rules. Co-authored-by: Isaac
Phase 2 of the a-d-k → d-a-s port for databricks-spark-declarative-pipelines. Adds three new references that fill the dev-side gaps that stable's per-feature × per-language reference files don't cover: - references/workflows.md — Workflow A/B/C chooser (standalone bundle via `databricks pipelines init`, pipeline-in-existing-bundle, rapid CLI iteration with no bundle); language selection rules; start-update + poll-the-update pattern with the "never poll top-level pipeline state" rationale; edit/ re-upload/restart flow. - references/pipeline-configuration.md — Full JSON config reference for `pipelines create|update` (top-level fields, clusters, event_log, notifications, configuration, run_as, restart_window, environment, deployment); variant snippets (dev mode, non-serverless, continuous, notifications, autoscaling, custom event log, serverless Python deps); multi-schema patterns; platform constraints. - references/performance.md — Liquid Clustering with per-layer key guidance (bronze/silver/gold), cluster-key type rules, table properties, state management strategies for streaming, join optimization, query optimization, pre-aggregation, compute config, monitoring. SKILL.md updates: - New "Choose Your Workflow" and "Language Selection" sections. - Scaffolding section documents both `databricks pipelines init` and `databricks bundle init lakeflow-pipelines`. - Pipeline API Reference list reorganized into Project & Lifecycle and Datasets, Flows & Quality groups. - Version bumped to 0.3.0. Deliberately dropped from a-d-k's databricks-spark-declarative-pipelines: - 2-mcp-approach.md (a-d-k experimental already replaced with 2-cli-approach.md — MCP tool refs removed per PR #73 policy). - python/{1..4}-*.md and sql/{1..4}-*.md (covered by stable's existing per- feature × per-language refs: python-basics, sql-basics, auto-loader-*, auto-cdc-*, streaming-table-*, sink-*, foreach-batch-sink-*, etc.). - scripts/exploration_notebook.py (stable convention has no scripts/; users use the CLI directly or the explorations/ folder generated by `pipelines init`). Source: databricks-solutions/ai-dev-kit@experimental. Co-authored-by: Isaac
# Conflicts: # manifest.json # skills/databricks-pipelines/SKILL.md
|
@jamesbroadhead are you sure you're not merging it from main (now old/outdated) instead of the experimental branch? |
|
should I suggest an alternative merge on another PR just to see & discuss options? |
Phase 3 of the a-d-k → d-a-s port. Closes the remaining customer-facing
gaps from the python/{2,3,4} and sql/{2,3,4} reference files that the
phase-2 port had judged as "covered" but on audit weren't.
New references:
- streaming-patterns.md — combined SQL + Python. Deduplication (by key,
with time window, composite); windowed aggregations (tumbling,
multi-size, session windows); event-time vs processing-time guidance;
rescue-data quarantine (Auto Loader `_rescued_data` → bronze_quarantine
+ silver_clean fanout); stream-to-stream join as a pattern with
cross-link to performance.md; running totals; anomaly detection
(rolling z-score outlier flag); end-to-end lag monitoring.
- kafka.md — combined SQL + Python. Basic Kafka read (`spark.readStream.
format("kafka")` and `read_kafka()`); JSON payload parsing with
explicit schemas; Databricks Secrets-based SASL/PLAIN auth; mTLS notes;
Event Hubs via the Kafka protocol; pipeline-configuration plumbing for
brokers/topics; pointer to sink.md for writing back to Kafka. Fills a
full gap — stable's SKILL.md API table listed `read_kafka` and
`format("kafka")` with no linked skill.
- scd-2-querying.md — combined SQL + Python. `__START_AT` / `__END_AT`
temporal semantics; current-state materialized views; point-in-time
queries with the inclusive-lower / exclusive-upper boundary; per-
entity history; period-bounded change analysis; joining facts with
historical dimensions (as-of-transaction-time and current-dim
variants); pre-filter MV optimization; clustering on
(entity_key, __START_AT).
Cross-references added:
- auto-loader.md → streaming-patterns.md (quarantine), kafka.md, lag
monitoring.
- auto-cdc.md → scd-2-querying.md.
- SKILL.md API reference list adds kafka, scd-2-querying, streaming-
patterns.
Co-authored-by: Isaac
|
Hi Quentin — Claude here, replying on James's behalf.
Yes, sourced from
What's confusing is the file layout. Instead of keeping the numbered guides (
Not needed — the source is correct and the reorganization is intentional. Stable already has per-feature × per-language references for streaming-table / materialized-view / auto-loader / auto-cdc / temporary-view / view-sql, so experimental's If you spot specific content from experimental that you think was dropped or distorted, please call it out inline on the relevant file and we'll fix it directly. |
That PyPI package is the legacy CLI; the modern CLI is a binary. Per @lennartkats-db review on #90, point readers at the `databricks-core` skill (which has install + auth references) instead. Two hits fixed: - experimental/databricks-execution-compute/SKILL.md intro - skills/databricks-pipelines/references/workflows.md troubleshooting table (carried over from the a-d-k port in #85) Co-authored-by: Isaac
Summary
Ports the
databricks-spark-declarative-pipelinesskill fromdatabricks-solutions/ai-dev-kitinto stableskills/databricks-pipelines/. Source:databricks-solutions/ai-dev-kit:experimental.Completes d-a-s PR #73's TODO #5. Pairs with a-d-k PR #546, which tombstones the a-d-k skill once this lands.
Stable's
databricks-pipelinesalready covered the per-feature × per-language API/options surface (decision tree, common traps, format options, dataset/flow/quality references). a-d-k's version covered scaffolding/workflows, configuration, performance tuning, DLT migration, and several streaming patterns + Kafka ingestion + SCD-2 query patterns that stable lacked. This PR adds a-d-k's net-new content as newreferences/files; the per-feature reference structure is preserved.Changes
New
references/dlt-migration.md— both migration paths (DLT Python → SDP Python viapyspark.pipelines, DLT Python → SDP SQL) with side-by-side conversions for decorators, reads, expectations, CDC/SCD, and partitioning → liquid clustering.workflows.md— Workflow A/B/C chooser (standalone bundle viadatabricks pipelines init, pipeline-in-existing-bundle, rapid CLI iteration with no bundle); language-selection rules; start-update + poll-the-update pattern (with the "never poll top-level pipeline state because RETRY_ON_FAILURE flips it back to RUNNING" rationale); edit/re-upload/restart flow; Python SDK alternative.pipeline-configuration.md— Full JSON config reference forpipelines create|update(top-level fields,clusters,event_log,notifications,configuration,run_as,restart_window,environment,deployment); variant snippets (dev mode, non-serverless, continuous, notifications, autoscaling, custom event log, serverless Python deps); multi-schema patterns; platform constraints.performance.md— Liquid Clustering with per-layer key guidance (bronze/silver/gold); cluster-key type rules; table properties; state-management strategies for streaming; join optimization (stream-to-static, stream-to-stream with time bounds); query optimization; pre-aggregation; compute config; monitoring.streaming-patterns.md— Deduplication (by key, with time window, composite); windowed aggregations (tumbling, multi-size, session windows); event-time vs processing-time; rescue-data quarantine (Auto Loader_rescued_data→ bronze_quarantine + silver_clean fanout); stream-to-stream join as a pattern; running totals; anomaly detection (rolling z-score outlier flag); end-to-end lag monitoring.kafka.md— Basic Kafka read (Python + SQL); JSON payload parsing with explicit schemas; Databricks Secrets SASL/PLAIN auth; mTLS notes; Event Hubs via the Kafka protocol; pipeline-config plumbing for brokers/topics; pointer tosink.mdfor writing back to Kafka. Fills a full gap — stable's SKILL.md API table listedread_kafkaandformat(\"kafka\")with no linked skill.scd-2-querying.md—__START_AT/__END_ATtemporal semantics; current-state materialized views; point-in-time queries with the inclusive-lower / exclusive-upper boundary; per-entity history; period-bounded change analysis; joining facts with historical dimensions (as-of-transaction-time and current-dim variants); pre-filter MV optimization; clustering on(entity_key, __START_AT).SKILL.mddatabricks pipelines init(newer, focused) anddatabricks bundle init lakeflow-pipelines(template-based).0.3.0.Cross-references in existing references
auto-loader.md→streaming-patterns.md(quarantine),kafka.md, lag monitoring.auto-cdc.md→scd-2-querying.mdfor reading SCD-2 history tables.Deliberately dropped from a-d-k
references/2-mcp-approach.md2-cli-approach.md; MCP tool refs stripped per d-a-s PR #73 policy. CLI flow now lives inworkflows.mdas Workflow C.references/python/1-syntax-basics.md,references/sql/1-syntax-basics.mdpython-basics.md,sql-basics.md, and the per-feature references (streaming-table, materialized-view, temporary-view, view-sql).references/python/{2,3,4}-*.md,references/sql/{2,3,4}-*.mdstreaming-patterns.md,kafka.md,scd-2-querying.md(this PR); API/options content already covered by stable's per-feature × per-language references.scripts/exploration_notebook.pyscripts/directory under a skill.databricks pipelines initgenerates anexplorations/folder; users use the CLI or the generated notebook directly.Test plan
python3 scripts/skills.py generateclean.python3 scripts/skills.py validatepasses.origin/mainmid-port (resolved version conflict — kept0.3.0; took main's CLI install command + compatibility bump).@lennartkats-db/@camielstee-dbper CODEOWNERS).This pull request and its description were written by Claude.