Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 199 additions & 0 deletions experimental/databricks-lakeflow-connect/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
---
name: databricks-lakeflow-connect
description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing."
---
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this, related to what is changing in PR #105

parent: databricks-core
compatibility: Requires databricks CLI (>= v0.294.0)
metadata:
version: "0.1.0"


# Lakeflow Connect

Build managed ingestion pipelines that pull from SaaS apps and databases into Unity Catalog Delta tables, governed end-to-end and powered by serverless Lakeflow Spark Declarative Pipelines.

**Status:** mixed catalog as of May 2026 — 9 GA connectors, plus a Public Preview / Beta / Private Preview pipeline that ships new sources monthly.

**Documentation:**
- [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect)
- [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors)
- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect)
Comment on lines +12 to +15
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove. These are covered in references at the bottom of this file.


---

## What Is Lakeflow Connect?

Managed connectors for ingesting data from SaaS applications and databases. The resulting ingestion pipeline is governed by Unity Catalog and powered by serverless compute and Lakeflow Spark Declarative Pipelines.

Three frames to keep in mind:

- **Simple and low-maintenance** — no client code to write, no message bus to operate; connector + UC Connection + a serverless pipeline.
- **Unified with the lakehouse** — credentials stored in UC, output is governed Delta, runs on Jobs and SDP like any other workload.
- **Efficient incremental processing** — change tracking / CDC / schema evolution / retries are built in.

There are four architecture patterns:

1. **SaaS pull** — connector reads from an external SaaS via OAuth or API key, lands in a streaming Delta table.
2. **Database CDC via gateway** — an ingestion gateway runs in the customer's network, stages change events to a UC Volume, a serverless ingestion pipeline applies them as CDC into Delta.
3. **Query-based** — for sources without native CDC (Oracle / Teradata / SQL Server / PG / MySQL query-based, Snowflake / Redshift / Synapse / BigQuery via Foreign Catalog), the connector issues periodic queries instead of subscribing to a change feed.
4. **Community connectors** — template-based, out of scope for this skill.

---

## Connector catalog

Lakeflow Connect ships connectors at multiple release stages. **GA** and **Public Preview** connectors are production-supported; **Beta** and **Private Preview** are early-access and not production-supported.

### GA connectors

Full coverage in this skill.

| Source | Type | Auth | Reference |
|--------|------|------|-----------|
| Salesforce (Sales / Service / etc.) | SaaS pull | OAuth U2M | [1-saas-connectors.md](references/1-saas-connectors.md) |
| Workday Reports (RaaS) | SaaS pull | OAuth refresh token / basic | [1-saas-connectors.md](references/1-saas-connectors.md) |
| ServiceNow | SaaS pull | OAuth U2M / basic | [1-saas-connectors.md](references/1-saas-connectors.md) |
| Google Analytics 4 | SaaS pull (via BigQuery) | Service-account JSON | [1-saas-connectors.md](references/1-saas-connectors.md) |
| HubSpot | SaaS pull | OAuth | [1-saas-connectors.md](references/1-saas-connectors.md) |
| Confluence | SaaS pull | OAuth | [1-saas-connectors.md](references/1-saas-connectors.md) |
| SQL Server (cloud) | Database CDC | DB user + change tracking / CDC | [2-database-connectors.md](references/2-database-connectors.md) |
| SQL Server (on-prem) | Database CDC | DB user + ExpressRoute / Direct Connect | [2-database-connectors.md](references/2-database-connectors.md) |
| Zerobus Ingest | Push (gRPC) | Service principal | See [databricks-zerobus-ingest](../databricks-zerobus-ingest/SKILL.md) |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zerobus doesn't seem to fit as a connector since its a separate skill.


### Public Preview connectors

Production-supported. Configuration may evolve before GA. Deep coverage is being added incrementally; until then, see the [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current setup steps.

| Source | Type | Auth | GA target |
|--------|------|------|-----------|
| NetSuite | SaaS pull | OAuth | May 31, 2026 |
| Dynamics 365 | SaaS pull | OAuth | May 31, 2026 |
| PostgreSQL CDC | Database CDC | DB user + gateway | Jun 30, 2026 (tentative); ungated PuPr May 29 |
| MySQL CDC | Database CDC | DB user + gateway | Jul 15, 2026 (tentative); ungated PuPr May 29 |
| Oracle / Teradata / SQL Server / PG / MySQL (query-based) | Database query | DB user | Jun 30, 2026 |
| Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog) | Database query | Foreign Catalog | Jun 30, 2026 |
| SFTP | File pull | Key / password | Jun 30, 2026 |
Comment on lines +62 to +70
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove GA target, we have typically avoided specific dates in skills so they don't cause confusion or require constant maintenance.


### Beta and Private Preview

Early-access connectors are not production-supported. The list changes month to month; check the [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current availability.

For the Lakeflow-Connect-vs-Auto-Loader-vs-Federation-vs-Delta-Sharing decision, see [4-ingestion-decision-tree.md](references/4-ingestion-decision-tree.md).

---

## Required Tools

- **Databricks CLI v1.0.0+** for `databricks pipelines create` and `databricks connections create`. Verify with `databricks --version`.
- **Databricks SDK for Python** (`databricks-sdk>=0.85.0`) if you prefer SDK over CLI.
- **Databricks Asset Bundles** if authoring as IaC (recommended for any pipeline that ships to a customer environment).

No extra connector-specific SDK is needed. Lakeflow Connect reuses the pipelines API surface — pipelines are created with an `ingestion_definition` block instead of a `libraries` block, but the API and CLI are otherwise the same.

---

## Prerequisites

Confirm before creating any pipeline:

1. **A Unity Catalog target** — catalog and schema must exist; the service principal or user creating the pipeline needs `USE CATALOG`, `USE SCHEMA`, `CREATE TABLE`, and `MODIFY` on the target schema.
2. **A UC `CONNECTION` object** with credentials for the source. SaaS OAuth U2M connections must be created via the UI (Catalog Explorer); API-key and basic-auth connections can be created via CLI / DAB.
3. **For database connectors**: network reachability between the gateway (classic compute, customer VPC) and the source database. On-prem requires ExpressRoute (Azure) or Direct Connect (AWS).
4. **For file connectors**: OAuth scope grants on the SaaS file repo (SharePoint / Google Drive).

---

## Minimal Example — Salesforce ingestion pipeline

The canonical authoring path is JSON to `databricks pipelines create --json`. (There is no SQL `CREATE TABLE … FROM CONNECTION` syntax for Lakeflow Connect — that syntax exists only for Lakehouse Federation, which is a different product.)

```bash
databricks pipelines create --json '{
"name": "salesforce_to_uc",
"ingestion_definition": {
"connection_name": "my_salesforce_oauth_connection",
"objects": [
{"table": {"source_schema": "salesforce", "source_table": "Account",
"destination_catalog": "main", "destination_schema": "salesforce_raw"}},
{"table": {"source_schema": "salesforce", "source_table": "Opportunity",
"destination_catalog": "main", "destination_schema": "salesforce_raw"}}
]
},
"channel": "PREVIEW"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PREVIEW correct, not CURRENT?

}'
```

For a DAB-authored version (the production path), see [1-saas-connectors.md](references/1-saas-connectors.md).

---

## Detailed guides

| Topic | File | When to read |
|-------|------|--------------|
| SaaS connectors (Salesforce, Workday Reports, ServiceNow, GA4, HubSpot, Confluence) | [1-saas-connectors.md](references/1-saas-connectors.md) | Unified SaaS pattern, per-connector deltas, OAuth flows, DAB stubs |
| Database connectors (SQL Server cloud + on-prem) | [2-database-connectors.md](references/2-database-connectors.md) | Gateway pattern, change tracking vs CDC, network setup |
| Ingestion decision tree | [4-ingestion-decision-tree.md](references/4-ingestion-decision-tree.md) | Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing |
| Troubleshooting and monitoring | [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md) | Event log queries, common errors, escalation pointers |

---

## Workflow

For each new ingestion pipeline:

1. **Pick the connector category** — SaaS / database / file / push — and read the matching reference file.
2. **Verify prerequisites** — UC target, source credentials, network path (for databases), region availability.
3. **Create the UC `CONNECTION`** — UI for OAuth U2M, CLI / DAB for everything else.
4. **Author the pipeline** — `databricks pipelines create --json` for one-offs, DAB YAML for anything shipping to a customer.
5. **Trigger the first run** and watch the event log; see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md) for the SQL.
6. **Schedule** via Jobs (`pipeline_task`) or `continuous: false` on the pipeline itself. Lakeflow Connect supports triggered only as of May 2026.

---

## Important

- **Triggered only, no continuous mode** — pipelines run on a schedule or on-demand, never continuously. Check the connector reference for the latest status.
- **Compute-only billing** — Lakeflow Connect is billed in DBUs (no per-row fee). Database connectors also incur classic-compute gateway DBUs in addition to the serverless ingestion pipeline DBUs. See the [pricing page](https://www.databricks.com/product/pricing/lakeflow-connect) for current rates.
- **Salesforce auth is OAuth U2M only** — no machine-to-machine, no basic auth. Connection creation requires a UI walk-through.
- **Database staging retention is 30 days** by default in the UC Volume between the gateway and the ingestion pipeline.
- **Limits per pipeline** — most SaaS connectors cap at 250 tables per pipeline. Split across multiple pipelines if needed.

---

## Key Concepts

- **UC `CONNECTION` is the credential anchor** — every Lakeflow Connect pipeline points at a UC connection. The connection owns the auth; the pipeline references it by name.
- **Serverless ingestion pipeline + (optional) classic gateway** — SaaS connectors are pure serverless. Database connectors split into a customer-network gateway (classic) and a serverless ingestion pipeline (Delta-bound).
- **CDC and schema evolution are built in** — for sources that support change tracking or CDC, the connector applies changes incrementally and evolves the target schema. Data-type changes typically require a full snapshot reload.
- **Streaming Delta output** — destination tables are governed Delta tables with `applyAsChangesFrom` semantics for CDC sources. Compatible with downstream materialized views and Spark streaming.
- **OAuth U2M is UI-only** — DAB / CLI cannot bootstrap OAuth U2M connections. Plan for a one-time human step.

---

## Common Issues

| Issue | Solution |
|-------|----------|
| **Pipeline fails with `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION`** | Primary key collision in the source snapshot. Inspect the source for duplicate rows on the declared PK column. |
| **Watermark not advancing on a SaaS source** | Cursor field misconfigured. Check the connector reference for the supported cursor column per source object. |
| **Column added in source but missing from target** | Schema evolution may need to be explicitly re-enabled per connector. Check connector docs. |
| **Gateway requires an instance type unavailable in your region** | Apply a cluster policy override on the gateway pipeline; see [2-database-connectors.md](references/2-database-connectors.md). |
| **`channel: PREVIEW` warning at pipeline create** | Expected for new connectors. Switch to `channel: CURRENT` once the connector is GA in your region. |
| **`databricks pipelines create` succeeds but no data flows** | Confirm UC connection is in `READY` state and the destination schema exists. Check the event log for any `pre-flight` failures. |
| **Ingestion run shows GB ingested >> source row size** | Expected for CDC sources — change log columns + schema metadata add overhead. |

For a deeper troubleshooting reference, see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md).
Comment on lines +171 to +181
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate this to just point to troubleshooting which should already have this info?


---

## Related Skills

- **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)** — the SDP runtime that Lakeflow Connect pipelines run on. For Auto Loader and downstream pipeline patterns.
- **[databricks-zerobus-ingest](../databricks-zerobus-ingest/SKILL.md)** — push-based gRPC ingestion. Sibling to Lakeflow Connect's pull-based connectors.
- **[databricks-dabs](../../skills/databricks-dabs/SKILL.md)** — author Lakeflow Connect pipelines as IaC.
- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** — managing catalogs, schemas, and the UC `CONNECTION` objects that LFC credentials live in.
- **[databricks-jobs](../../skills/databricks-jobs/SKILL.md)** — schedule ingestion pipelines with `pipeline_task`.

---

## Resources

- [Lakeflow Connect public docs hub](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect)
- [Connector reference (per-connector setup)](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors)
- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect)
7 changes: 7 additions & 0 deletions experimental/databricks-lakeflow-connect/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
interface:
display_name: "Databricks Lakeflow Connect"
short_description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect."
icon_small: "./assets/databricks.svg"
icon_large: "./assets/databricks.png"
brand_color: "#FF3621"
default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect."
default_prompt: "Use $databricks-lakeflow-connect to build managed ingestion pipelines into databricks using Lakeflow Connect."

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading