Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,7 @@
"references/deploy-and-run.md",
"references/resource-permissions.md",
"references/sdp-pipelines.md"
],
"base_revision": "e742f36e8ab1"
]
},
"databricks-jobs": {
"version": "0.2.0",
Expand Down Expand Up @@ -104,7 +103,7 @@
]
},
"databricks-pipelines": {
"version": "0.1.1",
"version": "0.3.0",
"description": "Databricks Spark Declarative Pipelines (SDP) for ETL and streaming",
"repo_dir": "skills",
"files": [
Expand All @@ -118,11 +117,13 @@
"references/auto-loader-python.md",
"references/auto-loader-sql.md",
"references/auto-loader.md",
"references/dlt-migration.md",
"references/expectations-python.md",
"references/expectations-sql.md",
"references/expectations.md",
"references/foreach-batch-sink-python.md",
"references/foreach-batch-sink.md",
"references/kafka.md",
"references/materialized-view-python.md",
"references/materialized-view-sql.md",
"references/materialized-view.md",
Expand All @@ -133,10 +134,14 @@
"references/options-parquet.md",
"references/options-text.md",
"references/options-xml.md",
"references/performance.md",
"references/pipeline-configuration.md",
"references/python-basics.md",
"references/scd-2-querying.md",
"references/sink-python.md",
"references/sink.md",
"references/sql-basics.md",
"references/streaming-patterns.md",
"references/streaming-table-python.md",
"references/streaming-table-sql.md",
"references/streaming-table.md",
Expand All @@ -145,9 +150,9 @@
"references/temporary-view.md",
"references/view-sql.md",
"references/view.md",
"references/workflows.md",
"references/write-spark-declarative-pipelines.md"
],
"base_revision": "5c4b4fb0a82a"
]
},
"databricks-serverless-migration": {
"version": "0.1.0",
Expand Down
68 changes: 65 additions & 3 deletions skills/databricks-pipelines/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: databricks-pipelines
description: Develop Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables) on Databricks. Use when building batch or streaming data pipelines with Python or SQL. Invoke BEFORE starting implementation.
compatibility: Requires databricks CLI (>= v1.0.0)
metadata:
version: "0.1.1"
version: "0.3.0"
parent: databricks-core
---

Expand Down Expand Up @@ -168,19 +168,69 @@ Some features require reading multiple skills together:

Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables / DLT) is a framework for building batch and streaming data pipelines.

## Migrating from DLT

If you have an existing DLT pipeline (`import dlt`, `@dlt.table`, `dlt.read(...)`, `dlt.apply_changes(...)`) and want to move to SDP, see [references/dlt-migration.md](references/dlt-migration.md). It covers both migration paths — DLT Python → SDP Python (`from pyspark import pipelines as dp`) and DLT Python → SDP SQL — with side-by-side conversions for the table decorators, reads, expectations, CDC/SCD, and partitioning → liquid clustering.

## Choose Your Workflow

Three project shapes exist — pick before scaffolding:

| Situation | Workflow |
|-----------|----------|
| New standalone pipeline project with its own bundle | **A. Standalone bundle** |
| Pipeline added to an existing DAB project | **B. Existing bundle** |
| Quick prototyping, no bundle (yet) | **C. Rapid CLI iteration** |

Default to A for production-bound work and C for exploration. Full details, generated structures, polling patterns, and edit/re-upload flow in [references/workflows.md](references/workflows.md).

## Language Selection (Python vs SQL)

Decide before scaffolding — the choice picks template files (`.py` vs `.sql`) and which reference docs apply. Both can coexist, but pick a primary.

| User signal | Pick |
|-------------|------|
| "Python pipeline", UDF, pandas, ML inference, pyspark | **Python** |
| "SQL pipeline", "SQL files" | **SQL** |
| "Simple pipeline", "create a table", "an aggregation" | **SQL** (simpler) |
| Complex parameterized logic, custom UDFs, ML | **Python** |

If ambiguous, ask. Stick with the chosen language unless the user explicitly switches.

## Scaffolding a New Pipeline Project

Use `databricks bundle init` with a config file to scaffold non-interactively. This creates a project in the `<project_name>/` directory:
The newer `databricks pipelines init` is focused on pipeline projects:

```bash
databricks pipelines init --output-dir . --config-file init-config.json
```

`init-config.json`:

```json
{
"project_name": "my_pipeline",
"initial_catalog": "prod_catalog",
"use_personal_schema": "no",
"initial_language": "sql"
}
```

The template-based `databricks bundle init lakeflow-pipelines` also works:

```bash
databricks bundle init lakeflow-pipelines --config-file <(echo '{"project_name": "my_pipeline", "language": "python", "serverless": "yes"}') --profile <PROFILE> < /dev/null
```

Field constraints:

- `project_name`: letters, numbers, underscores only
- `language`: `python` or `sql`. Ask the user which they prefer:
- `language` / `initial_language`: `python` or `sql` (lowercase)
- SQL: Recommended for straightforward transformations (filters, joins, aggregations)
- Python: Recommended for complex logic (custom UDFs, ML, advanced processing)

See [references/workflows.md](references/workflows.md) for the full generated structure, `databricks.yml` essentials, and per-target catalog/schema patterns.

After scaffolding, create `CLAUDE.md` and `AGENTS.md` in the project directory. These files are essential to provide agents with guidance on how to work with the project. Use this content:

```
Expand Down Expand Up @@ -259,13 +309,25 @@ resources:

Detailed reference guides for each pipeline API. **Read the relevant guide before writing pipeline code.**

### Project & Lifecycle

- [Workflows](references/workflows.md) — Standalone bundle / existing bundle / rapid CLI iteration; language selection; `pipelines init`; start-update + poll-the-update pattern; edit/re-upload/restart flow
- [Pipeline Configuration](references/pipeline-configuration.md) — Full JSON config reference (top-level, clusters, event_log, notifications, configuration, restart_window, environment) + variant snippets (dev mode, non-serverless, continuous, notifications, autoscaling, custom event log, serverless Python deps) + multi-schema patterns + platform constraints
- [Performance Tuning](references/performance.md) — Liquid Clustering by layer (bronze/silver/gold), key-type rules, state-management strategies for streaming, join optimization, pre-aggregation, monitoring
- [Migrating from DLT](references/dlt-migration.md) — Side-by-side conversions (decorators, reads, expectations, CDC/SCD, partitioning → liquid clustering)

### Datasets, Flows & Quality

- [Write Spark Declarative Pipelines](references/write-spark-declarative-pipelines.md) — Core syntax and rules ([Python](references/python-basics.md), [SQL](references/sql-basics.md))
- [Streaming Tables](references/streaming-table.md) — Continuous data stream processing ([Python](references/streaming-table-python.md), [SQL](references/streaming-table-sql.md))
- [Materialized Views](references/materialized-view.md) — Physically stored query results with incremental refresh ([Python](references/materialized-view-python.md), [SQL](references/materialized-view-sql.md))
- [Views](references/view.md) — Reusable query logic published to Unity Catalog ([SQL](references/view-sql.md))
- [Temporary Views](references/temporary-view.md) — Pipeline-private views ([Python](references/temporary-view-python.md), [SQL](references/temporary-view-sql.md))
- [Auto Loader](references/auto-loader.md) — Incrementally ingest files from cloud storage ([Python](references/auto-loader-python.md), [SQL](references/auto-loader-sql.md))
- [Kafka Ingestion](references/kafka.md) — Read from Kafka / Event Hubs with JSON parsing, Secrets-based auth
- [Auto CDC](references/auto-cdc.md) — Process Change Data Capture feeds, SCD Type 1 & 2 ([Python](references/auto-cdc-python.md), [SQL](references/auto-cdc-sql.md))
- [SCD Type 2 Querying](references/scd-2-querying.md) — Current-state views, point-in-time queries, joining facts with historical dimensions
- [Streaming Patterns](references/streaming-patterns.md) — Deduplication, windowed aggregations (tumbling/multi-size/session), late-arriving data, rescue-data quarantine, monitoring lag, anomaly detection
- [Expectations](references/expectations.md) — Define and enforce data quality constraints ([Python](references/expectations-python.md), [SQL](references/expectations-sql.md))
- [Sinks](references/sink.md) — Write to Kafka, Event Hubs, external Delta tables ([Python](references/sink-python.md))
- [ForEachBatch Sinks](references/foreach-batch-sink.md) — Custom streaming sink with per-batch Python logic ([Python](references/foreach-batch-sink-python.md))
4 changes: 4 additions & 0 deletions skills/databricks-pipelines/references/auto-cdc.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,8 @@ For detailed implementation guides:
- **Python**: [auto-cdc-python.md](auto-cdc-python.md)
- **SQL**: [auto-cdc-sql.md](auto-cdc-sql.md)

## Reading SCD Type 2 Tables

For querying the history tables produced by SCD Type 2 (`__START_AT` / `__END_AT`), point-in-time queries, change analysis, and joining facts with historical dimensions, see [scd-2-querying.md](scd-2-querying.md).

**Note**: The API is also known as `applyChanges` in some contexts.
6 changes: 6 additions & 0 deletions skills/databricks-pipelines/references/auto-loader.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ For detailed implementation guides:
- **Python**: [auto-loader-python.md](auto-loader-python.md)
- **SQL**: [auto-loader-sql.md](auto-loader-sql.md)

## Related Patterns

- [Rescue-data quarantine](streaming-patterns.md#rescue-data-quarantine) — route rows where Auto Loader rescued malformed fields to a side table instead of dropping them.
- [Kafka ingestion](kafka.md) — for message-bus sources (Kafka, Event Hubs).
- [Monitoring lag](streaming-patterns.md#monitoring-lag) — track end-to-end freshness.

## Format-Specific Options

For format-specific configuration options, refer to:
Expand Down
Loading
Loading