diff --git a/experimental/databricks-lakeflow-connect/SKILL.md b/experimental/databricks-lakeflow-connect/SKILL.md new file mode 100644 index 0000000..7463195 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/SKILL.md @@ -0,0 +1,199 @@ +--- +name: databricks-lakeflow-connect +description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing." +--- + +# Lakeflow Connect + +Build managed ingestion pipelines that pull from SaaS apps and databases into Unity Catalog Delta tables, governed end-to-end and powered by serverless Lakeflow Spark Declarative Pipelines. + +**Status:** mixed catalog as of May 2026 — 9 GA connectors, plus a Public Preview / Beta / Private Preview pipeline that ships new sources monthly. + +**Documentation:** +- [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) +- [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) +- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect) + +--- + +## What Is Lakeflow Connect? + +Managed connectors for ingesting data from SaaS applications and databases. The resulting ingestion pipeline is governed by Unity Catalog and powered by serverless compute and Lakeflow Spark Declarative Pipelines. + +Three frames to keep in mind: + +- **Simple and low-maintenance** — no client code to write, no message bus to operate; connector + UC Connection + a serverless pipeline. +- **Unified with the lakehouse** — credentials stored in UC, output is governed Delta, runs on Jobs and SDP like any other workload. +- **Efficient incremental processing** — change tracking / CDC / schema evolution / retries are built in. + +There are four architecture patterns: + +1. **SaaS pull** — connector reads from an external SaaS via OAuth or API key, lands in a streaming Delta table. +2. **Database CDC via gateway** — an ingestion gateway runs in the customer's network, stages change events to a UC Volume, a serverless ingestion pipeline applies them as CDC into Delta. +3. **Query-based** — for sources without native CDC (Oracle / Teradata / SQL Server / PG / MySQL query-based, Snowflake / Redshift / Synapse / BigQuery via Foreign Catalog), the connector issues periodic queries instead of subscribing to a change feed. +4. **Community connectors** — template-based, out of scope for this skill. + +--- + +## Connector catalog + +Lakeflow Connect ships connectors at multiple release stages. **GA** and **Public Preview** connectors are production-supported; **Beta** and **Private Preview** are early-access and not production-supported. + +### GA connectors + +Full coverage in this skill. + +| Source | Type | Auth | Reference | +|--------|------|------|-----------| +| Salesforce (Sales / Service / etc.) | SaaS pull | OAuth U2M | [1-saas-connectors.md](references/1-saas-connectors.md) | +| Workday Reports (RaaS) | SaaS pull | OAuth refresh token / basic | [1-saas-connectors.md](references/1-saas-connectors.md) | +| ServiceNow | SaaS pull | OAuth U2M / basic | [1-saas-connectors.md](references/1-saas-connectors.md) | +| Google Analytics 4 | SaaS pull (via BigQuery) | Service-account JSON | [1-saas-connectors.md](references/1-saas-connectors.md) | +| HubSpot | SaaS pull | OAuth | [1-saas-connectors.md](references/1-saas-connectors.md) | +| Confluence | SaaS pull | OAuth | [1-saas-connectors.md](references/1-saas-connectors.md) | +| SQL Server (cloud) | Database CDC | DB user + change tracking / CDC | [2-database-connectors.md](references/2-database-connectors.md) | +| SQL Server (on-prem) | Database CDC | DB user + ExpressRoute / Direct Connect | [2-database-connectors.md](references/2-database-connectors.md) | +| Zerobus Ingest | Push (gRPC) | Service principal | See [databricks-zerobus-ingest](../databricks-zerobus-ingest/SKILL.md) | + +### Public Preview connectors + +Production-supported. Configuration may evolve before GA. Deep coverage is being added incrementally; until then, see the [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current setup steps. + +| Source | Type | Auth | GA target | +|--------|------|------|-----------| +| NetSuite | SaaS pull | OAuth | May 31, 2026 | +| Dynamics 365 | SaaS pull | OAuth | May 31, 2026 | +| PostgreSQL CDC | Database CDC | DB user + gateway | Jun 30, 2026 (tentative); ungated PuPr May 29 | +| MySQL CDC | Database CDC | DB user + gateway | Jul 15, 2026 (tentative); ungated PuPr May 29 | +| Oracle / Teradata / SQL Server / PG / MySQL (query-based) | Database query | DB user | Jun 30, 2026 | +| Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog) | Database query | Foreign Catalog | Jun 30, 2026 | +| SFTP | File pull | Key / password | Jun 30, 2026 | + +### Beta and Private Preview + +Early-access connectors are not production-supported. The list changes month to month; check the [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current availability. + +For the Lakeflow-Connect-vs-Auto-Loader-vs-Federation-vs-Delta-Sharing decision, see [4-ingestion-decision-tree.md](references/4-ingestion-decision-tree.md). + +--- + +## Required Tools + +- **Databricks CLI v1.0.0+** for `databricks pipelines create` and `databricks connections create`. Verify with `databricks --version`. +- **Databricks SDK for Python** (`databricks-sdk>=0.85.0`) if you prefer SDK over CLI. +- **Databricks Asset Bundles** if authoring as IaC (recommended for any pipeline that ships to a customer environment). + +No extra connector-specific SDK is needed. Lakeflow Connect reuses the pipelines API surface — pipelines are created with an `ingestion_definition` block instead of a `libraries` block, but the API and CLI are otherwise the same. + +--- + +## Prerequisites + +Confirm before creating any pipeline: + +1. **A Unity Catalog target** — catalog and schema must exist; the service principal or user creating the pipeline needs `USE CATALOG`, `USE SCHEMA`, `CREATE TABLE`, and `MODIFY` on the target schema. +2. **A UC `CONNECTION` object** with credentials for the source. SaaS OAuth U2M connections must be created via the UI (Catalog Explorer); API-key and basic-auth connections can be created via CLI / DAB. +3. **For database connectors**: network reachability between the gateway (classic compute, customer VPC) and the source database. On-prem requires ExpressRoute (Azure) or Direct Connect (AWS). +4. **For file connectors**: OAuth scope grants on the SaaS file repo (SharePoint / Google Drive). + +--- + +## Minimal Example — Salesforce ingestion pipeline + +The canonical authoring path is JSON to `databricks pipelines create --json`. (There is no SQL `CREATE TABLE … FROM CONNECTION` syntax for Lakeflow Connect — that syntax exists only for Lakehouse Federation, which is a different product.) + +```bash +databricks pipelines create --json '{ + "name": "salesforce_to_uc", + "ingestion_definition": { + "connection_name": "my_salesforce_oauth_connection", + "objects": [ + {"table": {"source_schema": "salesforce", "source_table": "Account", + "destination_catalog": "main", "destination_schema": "salesforce_raw"}}, + {"table": {"source_schema": "salesforce", "source_table": "Opportunity", + "destination_catalog": "main", "destination_schema": "salesforce_raw"}} + ] + }, + "channel": "PREVIEW" +}' +``` + +For a DAB-authored version (the production path), see [1-saas-connectors.md](references/1-saas-connectors.md). + +--- + +## Detailed guides + +| Topic | File | When to read | +|-------|------|--------------| +| SaaS connectors (Salesforce, Workday Reports, ServiceNow, GA4, HubSpot, Confluence) | [1-saas-connectors.md](references/1-saas-connectors.md) | Unified SaaS pattern, per-connector deltas, OAuth flows, DAB stubs | +| Database connectors (SQL Server cloud + on-prem) | [2-database-connectors.md](references/2-database-connectors.md) | Gateway pattern, change tracking vs CDC, network setup | +| Ingestion decision tree | [4-ingestion-decision-tree.md](references/4-ingestion-decision-tree.md) | Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing | +| Troubleshooting and monitoring | [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md) | Event log queries, common errors, escalation pointers | + +--- + +## Workflow + +For each new ingestion pipeline: + +1. **Pick the connector category** — SaaS / database / file / push — and read the matching reference file. +2. **Verify prerequisites** — UC target, source credentials, network path (for databases), region availability. +3. **Create the UC `CONNECTION`** — UI for OAuth U2M, CLI / DAB for everything else. +4. **Author the pipeline** — `databricks pipelines create --json` for one-offs, DAB YAML for anything shipping to a customer. +5. **Trigger the first run** and watch the event log; see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md) for the SQL. +6. **Schedule** via Jobs (`pipeline_task`) or `continuous: false` on the pipeline itself. Lakeflow Connect supports triggered only as of May 2026. + +--- + +## Important + +- **Triggered only, no continuous mode** — pipelines run on a schedule or on-demand, never continuously. Check the connector reference for the latest status. +- **Compute-only billing** — Lakeflow Connect is billed in DBUs (no per-row fee). Database connectors also incur classic-compute gateway DBUs in addition to the serverless ingestion pipeline DBUs. See the [pricing page](https://www.databricks.com/product/pricing/lakeflow-connect) for current rates. +- **Salesforce auth is OAuth U2M only** — no machine-to-machine, no basic auth. Connection creation requires a UI walk-through. +- **Database staging retention is 30 days** by default in the UC Volume between the gateway and the ingestion pipeline. +- **Limits per pipeline** — most SaaS connectors cap at 250 tables per pipeline. Split across multiple pipelines if needed. + +--- + +## Key Concepts + +- **UC `CONNECTION` is the credential anchor** — every Lakeflow Connect pipeline points at a UC connection. The connection owns the auth; the pipeline references it by name. +- **Serverless ingestion pipeline + (optional) classic gateway** — SaaS connectors are pure serverless. Database connectors split into a customer-network gateway (classic) and a serverless ingestion pipeline (Delta-bound). +- **CDC and schema evolution are built in** — for sources that support change tracking or CDC, the connector applies changes incrementally and evolves the target schema. Data-type changes typically require a full snapshot reload. +- **Streaming Delta output** — destination tables are governed Delta tables with `applyAsChangesFrom` semantics for CDC sources. Compatible with downstream materialized views and Spark streaming. +- **OAuth U2M is UI-only** — DAB / CLI cannot bootstrap OAuth U2M connections. Plan for a one-time human step. + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **Pipeline fails with `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION`** | Primary key collision in the source snapshot. Inspect the source for duplicate rows on the declared PK column. | +| **Watermark not advancing on a SaaS source** | Cursor field misconfigured. Check the connector reference for the supported cursor column per source object. | +| **Column added in source but missing from target** | Schema evolution may need to be explicitly re-enabled per connector. Check connector docs. | +| **Gateway requires an instance type unavailable in your region** | Apply a cluster policy override on the gateway pipeline; see [2-database-connectors.md](references/2-database-connectors.md). | +| **`channel: PREVIEW` warning at pipeline create** | Expected for new connectors. Switch to `channel: CURRENT` once the connector is GA in your region. | +| **`databricks pipelines create` succeeds but no data flows** | Confirm UC connection is in `READY` state and the destination schema exists. Check the event log for any `pre-flight` failures. | +| **Ingestion run shows GB ingested >> source row size** | Expected for CDC sources — change log columns + schema metadata add overhead. | + +For a deeper troubleshooting reference, see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md). + +--- + +## Related Skills + +- **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)** — the SDP runtime that Lakeflow Connect pipelines run on. For Auto Loader and downstream pipeline patterns. +- **[databricks-zerobus-ingest](../databricks-zerobus-ingest/SKILL.md)** — push-based gRPC ingestion. Sibling to Lakeflow Connect's pull-based connectors. +- **[databricks-dabs](../../skills/databricks-dabs/SKILL.md)** — author Lakeflow Connect pipelines as IaC. +- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** — managing catalogs, schemas, and the UC `CONNECTION` objects that LFC credentials live in. +- **[databricks-jobs](../../skills/databricks-jobs/SKILL.md)** — schedule ingestion pipelines with `pipeline_task`. + +--- + +## Resources + +- [Lakeflow Connect public docs hub](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) +- [Connector reference (per-connector setup)](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) +- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect) diff --git a/experimental/databricks-lakeflow-connect/agents/openai.yaml b/experimental/databricks-lakeflow-connect/agents/openai.yaml new file mode 100644 index 0000000..b27696c --- /dev/null +++ b/experimental/databricks-lakeflow-connect/agents/openai.yaml @@ -0,0 +1,7 @@ +interface: + display_name: "Databricks Lakeflow Connect" + short_description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect." + icon_small: "./assets/databricks.svg" + icon_large: "./assets/databricks.png" + brand_color: "#FF3621" + default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect." diff --git a/experimental/databricks-lakeflow-connect/assets/databricks.png b/experimental/databricks-lakeflow-connect/assets/databricks.png new file mode 100644 index 0000000..263fe98 Binary files /dev/null and b/experimental/databricks-lakeflow-connect/assets/databricks.png differ diff --git a/experimental/databricks-lakeflow-connect/assets/databricks.svg b/experimental/databricks-lakeflow-connect/assets/databricks.svg new file mode 100644 index 0000000..9d19110 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/assets/databricks.svg @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/experimental/databricks-lakeflow-connect/references/1-saas-connectors.md b/experimental/databricks-lakeflow-connect/references/1-saas-connectors.md new file mode 100644 index 0000000..304a5d1 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/1-saas-connectors.md @@ -0,0 +1,136 @@ +# SaaS connectors + +The six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence) all share the same authoring pattern. This reference covers the unified flow once, then captures per-connector deltas. + +--- + +## The unified SaaS pattern + +Three steps for every SaaS connector: + +1. **Create a UC `CONNECTION`** that owns the source credentials. + - **OAuth U2M** connections (Salesforce, ServiceNow, HubSpot, Confluence) must be created in Catalog Explorer — the OAuth handshake requires a browser. CLI and DAB cannot bootstrap U2M. + - **API-key / basic / refresh-token** connections (Workday Reports, GA4 via service account, ServiceNow basic) can be created with `databricks connections create` or a DAB resource. +2. **Create the ingestion pipeline** with `databricks pipelines create --json` (or DAB). The pipeline carries the `ingestion_definition` block that names the connection and lists the source objects to land. +3. **Schedule the pipeline**. Lakeflow Connect supports triggered runs only — schedule with a Jobs `pipeline_task` or with the pipeline's own `continuous: false` cron block. + +A minimal pipeline JSON: + +```json +{ + "name": "salesforce_to_uc", + "ingestion_definition": { + "connection_name": "my_salesforce_oauth_connection", + "objects": [ + {"table": {"source_schema": "salesforce", "source_table": "Account", + "destination_catalog": "main", "destination_schema": "salesforce_raw"}} + ] + }, + "channel": "PREVIEW" +} +``` + +Keys to know: + +- `ingestion_definition.connection_name` — the UC connection name (not URL, not ID). +- `objects[].table` — one entry per source table. Use `objects[].schema` to ingest a whole source schema in one block. +- `channel: PREVIEW` is required for connectors not yet fully GA in your region. Switch to `CURRENT` once available. + +--- + +## Salesforce + +- **Auth**: OAuth U2M only. No machine-to-machine, no basic auth, no API key. The connection must be created in Catalog Explorer with a browser-based login. +- **Limit**: 250 tables per pipeline. Split larger workloads into multiple pipelines partitioned by object family. +- **Formula fields**: ingested as full snapshots only — incremental CDC is not available for computed columns. Plan for higher DBU usage on objects with many formula fields. +- **Data-type changes**: source data-type changes are not auto-handled. A reload from snapshot is required when the source column type changes. +- **Sandboxes**: a separate UC connection per sandbox vs production org. Don't reuse connections across orgs. + +--- + +## Workday Reports (RaaS) + +The Workday connector is **Report-as-a-Service** — it ingests Workday custom reports, not raw HCM tables. Workday HCM is a separate (Beta) connector. + +- **Auth**: OAuth refresh token (recommended for production) or HTTP basic. The refresh token must be minted in Workday and stored in the UC connection. +- **Source objects**: each "table" is a Workday custom report. Configure the report in Workday first, then reference it by name in the pipeline. +- **Limits**: same 250-table-per-pipeline cap; per-report row limits inherit from the Workday report itself. +- **No auto data-type evolution**: report schema changes require a pipeline edit + reload. + +--- + +## ServiceNow + +- **Auth**: OAuth U2M (recommended) or HTTP basic. OAuth requires a registered ServiceNow OAuth application; basic auth requires a service account with read access to the target tables. +- **Source objects**: ServiceNow table names (e.g., `incident`, `change_request`). Reference fields (sys_id -> related record) are kept as `sys_id` strings — joins happen downstream. +- **Limits**: 250 tables per pipeline. Long-running ServiceNow instances with custom tables may need multiple pipelines. +- **Pagination**: handled by the connector; no client-side configuration needed. + +--- + +## Google Analytics 4 + +GA4 ingestion goes **via BigQuery** — Lakeflow Connect reads from the GA4 BigQuery export, not from the GA4 API directly. The customer must enable BigQuery export in their GA4 property before the connector can run. + +- **Auth**: GCP service-account JSON key. The service account needs `BigQuery Data Viewer` on the GA4 export dataset. +- **Prereq**: GA4 -> BigQuery export must be enabled (Admin -> BigQuery Links). Daily export is the typical setup; streaming export is supported. +- **Source objects**: the `events_*` tables in the GA4 export dataset. The connector handles the daily-shard pattern transparently. +- **Latency**: bounded by the GA4 -> BigQuery export cadence (typically next-day for daily export). + +--- + +## HubSpot + +- **Auth**: OAuth U2M. +- **Source objects**: HubSpot CRM objects (Contacts, Companies, Deals, Tickets, etc.) plus engagements. Check the connector reference for the current object list. +- **Status caveat**: status may differ by region — check the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) to confirm GA in your region before relying on production SLAs. + +--- + +## Confluence + +- **Auth**: OAuth U2M. +- **Source objects**: spaces, pages, comments. Markup is preserved in the page body column. +- **Status caveat**: same as HubSpot — confirm regional availability in the connector reference. + +--- + +## DAB pattern for SaaS connectors + +The production authoring path is a Databricks Asset Bundle resource. A minimal pipeline resource: + +```yaml +resources: + pipelines: + salesforce_ingestion: + name: salesforce_to_uc + channel: PREVIEW + ingestion_definition: + connection_name: my_salesforce_oauth_connection + objects: + - table: + source_schema: salesforce + source_table: Account + destination_catalog: ${var.catalog} + destination_schema: salesforce_raw + - table: + source_schema: salesforce + source_table: Opportunity + destination_catalog: ${var.catalog} + destination_schema: salesforce_raw +``` + +Schedule it via a Jobs resource with a `pipeline_task` pointing at this pipeline. See [databricks-dabs](../../../skills/databricks-dabs/SKILL.md) for bundle structure, target overrides, and the recommended layout for multi-pipeline bundles. + +--- + +## Common SaaS gotchas + +| Symptom | Likely cause | Fix | +|---|---|---| +| Watermark not advancing on an object | Cursor field misconfigured for that source object | Check the per-connector cursor-column docs; some objects need an explicit cursor override. | +| Duplicate-key error after a snapshot reload | Source has duplicate PKs (Salesforce composite keys, ServiceNow merged records) | Inspect the source for the duplicates; the connector won't auto-resolve. | +| New source column missing from the target | Schema evolution disabled or not yet propagated | Re-enable schema evolution on the destination table and trigger a snapshot run. | +| OAuth connection stuck in `PENDING` | U2M authorization not completed in Catalog Explorer | Re-open the connection in Catalog Explorer and complete the browser flow. | +| `channel: PREVIEW` warning at create time | Expected for connectors not yet GA in your region | Switch to `CURRENT` once the connector is GA where the pipeline runs. | +| Pipeline succeeds but no rows land | Destination schema missing, or the connection account lacks read on the source object | Check the event log; pre-flight errors are surfaced there. | diff --git a/experimental/databricks-lakeflow-connect/references/2-database-connectors.md b/experimental/databricks-lakeflow-connect/references/2-database-connectors.md new file mode 100644 index 0000000..91bde98 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/2-database-connectors.md @@ -0,0 +1,145 @@ +# Database connectors + +SQL Server (cloud and on-prem) is the GA database connector. Postgres CDC, MySQL CDC, query-based variants, and Foreign Catalog connectors are Public Preview — production-supported but covered briefly here; see the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for their setup until deep coverage lands in a follow-up. + +--- + +## The gateway pattern + +Database connectors are **not** pure serverless. They split into two pipelines: + +``` + ┌───────────────────────┐ + │ Customer database │ + │ (SQL Server, etc.) │ + └──────────┬────────────┘ + │ CDC / change tracking + ▼ + ┌───────────────────────┐ + │ Ingestion gateway │ classic compute, + │ (one pipeline) │ runs in customer VPC + └──────────┬────────────┘ + │ change events + ▼ + ┌───────────────────────┐ + │ UC Volume staging │ 30-day retention by default + └──────────┬────────────┘ + │ + ▼ + ┌───────────────────────┐ + │ Ingestion pipeline │ serverless, + │ (one pipeline) │ applies CDC into Delta + └──────────┬────────────┘ + ▼ + ┌───────────────────────┐ + │ Delta tables in UC │ + └───────────────────────┘ +``` + +Why each piece: + +- **Gateway** runs in the customer's network so the source database is never exposed to Databricks-managed compute. It reads the CDC / change-tracking stream and writes change events into a UC Volume. +- **Staging Volume** decouples the two pipelines: the gateway can run on its own cadence, and the ingestion pipeline can re-process from the Volume without re-reading the source. +- **Ingestion pipeline** is the serverless half — it applies the staged events to Delta with CDC semantics and handles schema evolution. + +Trade-offs: + +- Two pipelines, two pieces of state. Both must be healthy. +- Gateway is **classic compute** — billed separately from the serverless ingestion DBUs. See the [pricing page](https://www.databricks.com/product/pricing/lakeflow-connect) for current rates. +- Staging Volume retention is 30 days. Reprocessing further back requires a snapshot reload. + +--- + +## SQL Server: change tracking vs CDC + +SQL Server offers two source mechanisms; pick one per database. + +**Change Tracking (CT)** — lightweight. The source tracks "which rows changed since version X" but not the actual change history. The gateway re-reads changed rows from the base table. + +- Lower overhead on the source. +- Adequate when downstream only needs the latest state per PK. +- Cannot reconstruct historical change order. + +**Change Data Capture (CDC)** — full change log. The source writes inserts/updates/deletes into change tables that the gateway reads directly. + +- Higher overhead on the source (separate change tables, log reader job). +- Required when downstream needs per-event history (audit, SCD2 from raw deltas, etc.). + +Most pipelines start with CT and switch to CDC only when audit or SCD2 demands it. + +--- + +## SQL Server cloud setup + +Prerequisites: + +1. **SQL Server 2012+** (cloud-managed: Azure SQL DB, Azure SQL MI, RDS for SQL Server). +2. **A dedicated database user** with `db_owner` on the source database, or the minimum grants for CT/CDC (see the connector reference). +3. **CT or CDC enabled** on the source tables (`ALTER DATABASE ... SET CHANGE_TRACKING = ON` for CT; `sys.sp_cdc_enable_table` for CDC). +4. **Network reachability** — the gateway compute must reach the source database. For cloud SQL Server this is usually VPC peering or PrivateLink. + +A DAB stub with both pipelines: + +```yaml +resources: + pipelines: + sqlserver_gateway: + name: sqlserver_gateway + channel: PREVIEW + gateway_definition: + connection_name: my_sqlserver_connection + gateway_storage_catalog: ${var.catalog} + gateway_storage_schema: ingestion_staging + gateway_storage_name: sqlserver_gateway_storage + + sqlserver_ingestion: + name: sqlserver_to_uc + channel: PREVIEW + ingestion_definition: + ingestion_gateway_id: ${resources.pipelines.sqlserver_gateway.id} + objects: + - table: + source_catalog: sales_db + source_schema: dbo + source_table: orders + destination_catalog: ${var.catalog} + destination_schema: sqlserver_raw +``` + +The SDK Python equivalent uses `w.pipelines.create` twice — once with `gateway_definition`, once with `ingestion_definition` referencing the gateway's pipeline ID. + +--- + +## SQL Server on-prem + +Same setup as cloud, plus private networking from the gateway to the on-prem source: + +- **Azure**: ExpressRoute or VPN gateway between the customer VNet and the on-prem network. +- **AWS**: Direct Connect or Site-to-Site VPN between the customer VPC and the on-prem network. +- **GCP**: Cloud Interconnect or Cloud VPN. + +The gateway compute itself runs on Databricks-managed VPC infrastructure inside the customer's workspace, so the private link only needs to extend that far. + +--- + +## Database-specific gotchas + +| Symptom | Likely cause | Fix | +|---|---|---| +| Gateway requires an instance type unavailable in the region | Default gateway cluster shape not stocked in the target region | Apply a cluster policy override on the gateway pipeline to pin a regionally-available instance type. | +| Snapshot-only mode silently disabled | Snapshot-only is not supported for CDC sources | Use CT instead, or accept incremental mode. | +| Pipeline state diverges from source after 30+ days | Staging Volume retention expired | Resnapshot the affected tables. Increase the Volume retention if reprocessing further back is a recurring need. | +| "Continuous mode not supported" error at create | Lakeflow Connect is triggered-only as of May 2026 | Use `continuous: false` plus a Jobs schedule. | +| Gateway pipeline succeeds but ingestion pipeline shows no new data | Staging path mismatch between the two pipelines | Confirm `gateway_storage_*` on the gateway matches the staging path the ingestion pipeline reads from. | + +--- + +## Public Preview database connectors (brief) + +The following are production-supported but ship more pattern variance than SQL Server. Use the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current setup steps: + +- **Postgres CDC, MySQL CDC** — same gateway pattern as SQL Server; logical decoding (Postgres) or binlog (MySQL) replaces CT/CDC. +- **Oracle / Teradata / SQL Server / Postgres / MySQL query-based** — no gateway; the connector issues periodic queries instead of reading a change feed. Trade-off: simpler, but higher source load and no per-event history. +- **Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog)** — Lakeflow Connect creates the foreign catalog and materializes the queried subset to Delta. Most useful for warehouse-to-lakehouse migration scenarios. + +Deep coverage for these connectors will land as they stabilize. diff --git a/experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md b/experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md new file mode 100644 index 0000000..a4c9071 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md @@ -0,0 +1,128 @@ +# Ingestion decision tree + +Databricks ships several first-party ingestion approaches and the right pick depends on **where the data lives** and **whether you need a copy in your lakehouse**. This reference is the map for choosing between them. + +The four approaches: + +- **Lakeflow Connect** — managed pull for SaaS apps and databases. Fastest path when a connector for your source exists. +- **Auto Loader** — code-yours pull for files on cloud object storage. Full control, file sources only. +- **Lakehouse Federation** — query-in-place; the data stays in the source. +- **Delta Sharing** — the inbound side of someone else's lakehouse; you accept a share rather than build a pipeline. + +For event-driven push (the source pushes to you instead of you pulling) the relevant approach is **Zerobus Ingest**, covered separately in the [databricks-zerobus-ingest](../../databricks-zerobus-ingest/SKILL.md) skill. + +--- + +## Decision table + +Pick the row that matches your source type and constraint. + +| Where does the data live? | Need a copy? | Approach | Read more | +|---|---|---|---| +| SaaS app with a Lakeflow Connect connector (Salesforce, Workday, ServiceNow, GA4, HubSpot, Confluence, etc.) | Yes | Lakeflow Connect | [SKILL.md](../SKILL.md), [1-saas-connectors.md](1-saas-connectors.md) | +| Operational database (SQL Server, PostgreSQL, MySQL) with a Lakeflow Connect connector | Yes, with CDC | Lakeflow Connect | [2-database-connectors.md](2-database-connectors.md) | +| Operational database, low query volume, source can absorb the load | No copy needed | Lakehouse Federation | [docs](https://docs.databricks.com/aws/en/query-federation/) | +| Cloud object storage (S3, ADLS, GCS) with files | Yes | Auto Loader | [databricks-pipelines](../../../skills/databricks-pipelines/SKILL.md) | +| SaaS file repo (SharePoint, Google Drive, SFTP) | Yes | Lakeflow Connect | [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) | +| Application or device pushing events at you | Yes (push, not pull) | Zerobus Ingest | [databricks-zerobus-ingest](../../databricks-zerobus-ingest/SKILL.md) | +| Another lakehouse / partner data product offering a Delta share | Yes (accept, not build) | Delta Sharing | [docs](https://docs.databricks.com/aws/en/delta-sharing/) | +| None of the above | — | Hand-rolled Structured Streaming or `read_files` from object storage | [databricks-spark-structured-streaming](../../databricks-spark-structured-streaming/SKILL.md) | + +--- + +## Lakeflow Connect vs Auto Loader + +Both pull data into Delta tables, but they cover different source types. + +**Lakeflow Connect wins when:** +- The source is a SaaS application or a database (not files on object storage). +- The source has its own auth (OAuth, API key, DB user). +- You want CDC, schema evolution, and retries handled by the platform. +- You prefer declarative configuration over code. + +**Auto Loader wins when:** +- The source is files on cloud object storage (S3, ADLS, GCS). +- You need custom file format parsing or inline transforms. +- You want full control over checkpointing, schema hints, and trigger cadence. + +**Common confusion**: SFTP and SharePoint look like file sources but go through Lakeflow Connect, not Auto Loader. Auto Loader is for **cloud object storage** specifically. + +--- + +## Lakeflow Connect vs Lakehouse Federation + +Both let you work with data that lives outside your lakehouse, but the difference is whether the data gets copied. + +**Lakeflow Connect wins when:** +- You need a governed Delta copy in your lakehouse for performance, ML training, or downstream pipelines. +- Query volume against the source data is high. +- The source is performance-sensitive (you don't want to add query load to your production OLTP). +- You need point-in-time history (CDC into a Delta table with `applyAsChangesFrom`). + +**Lakehouse Federation wins when:** +- Data should stay in the source for governance or residency reasons. +- Query patterns are sparse (a few analysts, occasional ad-hoc queries). +- The source can comfortably absorb additional query load. +- You don't need history beyond what the source already retains. + +**Common confusion**: both use a Unity Catalog `CONNECTION` object. The difference is what you do with it — Lakeflow Connect creates an ingestion pipeline that materializes to Delta; Federation creates a foreign catalog that queries through to the source. + +--- + +## Lakeflow Connect vs Delta Sharing + +Delta Sharing is not really a build decision; it's the receiving end of someone else's pipeline. + +**Lakeflow Connect**: you build the ingestion pipeline. You own the connector configuration, the schedule, and the destination tables. Source can be anything LFC supports. + +**Delta Sharing**: a data provider (another lakehouse, a partner product) offers you a share. You accept it via a Delta Sharing client and the data appears in your catalog as a shared table. You don't operate the pipeline. + +Use Delta Sharing when a data partner offers it — there's nothing to build. Use Lakeflow Connect when you need to pull from a system the partner doesn't share to. + +--- + +## Lakeflow Connect vs Zerobus Ingest + +The push-vs-pull distinction. + +**Lakeflow Connect** is **pull-based**: the ingestion pipeline reaches out to the source on a schedule. + +**Zerobus Ingest** is **push-based**: an application or device pushes records into a Delta table via gRPC. There is no source system to pull from — the producer drives the cadence. + +Use Lakeflow Connect when the source is a system you query. Use Zerobus when the source is an application you control (or a device emitting events) that wants to write directly. + +--- + +## Cost considerations + +All four approaches are billed in DBUs (compute time), with no per-row or per-connector fee. + +- **Lakeflow Connect**: serverless ingestion pipeline DBUs; database connectors also incur classic-compute gateway DBUs. +- **Auto Loader**: serverless or classic compute DBUs depending on where the pipeline runs. +- **Lakehouse Federation**: SQL warehouse DBUs for the queries that read through the foreign catalog. Plus any costs the source charges. +- **Delta Sharing**: typically free for the recipient (the provider may charge separately outside Databricks). +- **Zerobus Ingest**: per-GB ingested, billed under the Lakeflow Jobs Serverless SKU. + +See the [Databricks pricing page](https://www.databricks.com/product/pricing) and the per-product pricing pages linked from there. + +--- + +## When Lakeflow Connect doesn't fit yet + +A few situations where you'll reach for one of the alternatives: + +- **The connector for your source isn't in the catalog.** Check the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) — if your source isn't listed, use Auto Loader (if it's files), a hand-rolled Structured Streaming job, or wait for the connector to ship. +- **You need continuous ingestion.** Lakeflow Connect runs triggered only as of May 2026. For sub-minute latency on file sources, use Auto Loader with `Trigger.AvailableNow` on a short interval, or Structured Streaming directly. +- **You need to push instead of pull.** That's Zerobus. +- **You want zero copy.** That's Lakehouse Federation. + +--- + +## Resources + +- [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) +- [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) +- [Auto Loader docs](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/) +- [Lakehouse Federation docs](https://docs.databricks.com/aws/en/query-federation/) +- [Delta Sharing docs](https://docs.databricks.com/aws/en/delta-sharing/) +- [Pricing](https://www.databricks.com/product/pricing) diff --git a/experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md b/experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md new file mode 100644 index 0000000..97796c4 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md @@ -0,0 +1,52 @@ +# Troubleshooting and monitoring + +This reference covers what to check when an ingestion pipeline misbehaves: where the logs live, the common error shapes, and the escalation path. + +--- + +## Where to look first + +Every Lakeflow Connect pipeline emits a structured event log. For SaaS pipelines that's the only artifact; for database pipelines you'll also want to inspect the gateway pipeline's events. + +The event log is a Delta table on the pipeline. Query it through SQL: + +```sql +SELECT timestamp, level, message, error +FROM event_log("") +WHERE level IN ('ERROR', 'WARN') + AND timestamp > current_timestamp() - INTERVAL 1 DAY +ORDER BY timestamp DESC +LIMIT 50; +``` + +For event-log table conventions (filtering by `event_type`, joining with metrics, etc.), see [databricks-pipelines](../../../skills/databricks-pipelines/SKILL.md). + +**Database pipelines have two event logs** — one for the gateway, one for the ingestion pipeline. A symptom on the ingestion side often has its root cause in the gateway side. When debugging database connectors, query both. + +--- + +## Common errors and resolutions + +| Error / symptom | Likely cause | Resolution | +|---|---|---| +| `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION` | Source snapshot contains duplicate values on the declared primary key. | Inspect the source for duplicate PKs (often a merged record or composite-key surprise). The connector won't auto-resolve — fix at the source or change the PK declaration. | +| `validate_only` update appears in pipeline run history | Expected. A dry-run validation run is logged alongside actual runs. | Filter `event_log` on `details:flow_progress.status != 'VALIDATING'` if the dry-runs are noisy. | +| SCD2 row count doesn't match raw source count | Expected. SCD2 multiplies rows per change (one row per version), so SCD2 row count >> source row count is normal. | Compare on PK count with `current = true` instead of total row count. | +| `NULL` values appear after switching SCD1 -> SCD2 | Expected. Pre-switch history is reconstructed as a single open version with `NULL`s for unknown deltas. | Re-snapshot the table if a clean SCD2 history is required from a specific point. | +| `GB ingested` >> source row size in the metrics | Expected for CDC sources. Change log columns, schema metadata, and per-batch overhead inflate ingested bytes. | Use source row count, not GB ingested, as the workload sizing signal. | +| Gateway pipeline fails: instance type unavailable in region | Default gateway cluster shape isn't stocked in the target region. | Apply a cluster policy override on the gateway pipeline to pin a regionally-available instance type. | +| Pipeline runs but the destination table never updates | UC `CONNECTION` not in `READY` state, OR destination schema missing. | `DESCRIBE CONNECTION ` — state must be `READY`. Verify the destination schema exists and the pipeline's service principal has `CREATE TABLE` + `MODIFY`. | +| OAuth U2M connection refreshes fail after weeks of working | Refresh token expired or revoked at the SaaS source. | Re-open the connection in Catalog Explorer and re-authorize. Plan for periodic re-auth if the SaaS source enforces a refresh-token lifetime. | +| `channel: PREVIEW` warning at pipeline create | Expected for connectors not yet GA in your region. | Switch to `CURRENT` once the connector is GA where the pipeline runs. | + +--- + +## Escalation pointers + +When the event log doesn't explain a failure: + +1. **Public docs hub** — [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) covers concepts and links to per-connector pages. +2. **Connector reference** — [per-connector setup](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) is the canonical source for current auth, limits, and supported objects per source. +3. **Workspace support** — file a support case from Help -> Contact Support inside the workspace; attach the pipeline ID and a relevant `event_log` extract. + +For monitoring patterns beyond event-log queries (dashboards, alerting on pipeline state, SLAs), see [databricks-pipelines](../../../skills/databricks-pipelines/SKILL.md). diff --git a/manifest.json b/manifest.json index e6925dc..c3a482e 100644 --- a/manifest.json +++ b/manifest.json @@ -213,6 +213,21 @@ "repo_dir": "skills", "version": "0.1.0" }, + "databricks-lakeflow-connect": { + "description": "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing.", + "files": [ + "SKILL.md", + "agents/openai.yaml", + "assets/databricks.png", + "assets/databricks.svg", + "references/1-saas-connectors.md", + "references/2-database-connectors.md", + "references/4-ingestion-decision-tree.md", + "references/5-troubleshooting-and-monitoring.md" + ], + "repo_dir": "experimental", + "version": "0.0.1" + }, "databricks-metric-views": { "description": "Unity Catalog metric views: define, create, query, and manage governed business metrics in YAML. Use when building standardized KPIs, revenue metrics, order analytics, or any reusable business metrics that need consistent definitions across teams and tools.", "files": [