From 801c399567218785df8a5ff2006e189714afc050 Mon Sep 17 00:00:00 2001 From: Jose Alfonso Date: Wed, 27 May 2026 17:14:33 +0200 Subject: [PATCH 1/4] experimental: add databricks-lakeflow-connect skill (GA scope, draft) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Initial scope-first commit for a draft PR. GA-first deep coverage, PuPr listed but deferred to follow-up commits. This commit includes: - SKILL.md (routing + 3-tier catalog: GA / PuPr / Beta-PrPr + workflow + key concepts + common issues) - references/4-ingestion-decision-tree.md (LFC vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches) - agents/openai.yaml + assets/ via scripts/skills.py generate - manifest.json updated To follow in subsequent commits: - references/1-saas-connectors.md (Salesforce, Workday Reports, ServiceNow, GA4, HubSpot, Confluence — all GA) - references/2-database-connectors.md (SQL Server cloud + on-prem + gateway pattern intro — GA) - references/5-troubleshooting-and-monitoring.md (GA-focused) Public Preview connectors (NetSuite, Dynamics 365, PG/MySQL CDC, query-based databases, Foreign Catalog query-based, SFTP) are production-supported and listed in SKILL.md; deep coverage will be added incrementally as PuPr connectors stabilize. SharePoint/Google Drive (Beta currently, GA Jun 1 target) and other Beta/PrPr connectors are not first-class in this skill. references/3-file-and-streaming-connectors.md will be created when SFTP + SharePoint/Drive get deep coverage (post-v1). Signed-off-by: Jose Alfonso --- .../databricks-lakeflow-connect/SKILL.md | 199 ++++++++++++++++++ .../agents/openai.yaml | 7 + .../assets/databricks.png | Bin 0 -> 15366 bytes .../assets/databricks.svg | 3 + .../references/4-ingestion-decision-tree.md | 128 +++++++++++ manifest.json | 12 ++ 6 files changed, 349 insertions(+) create mode 100644 experimental/databricks-lakeflow-connect/SKILL.md create mode 100644 experimental/databricks-lakeflow-connect/agents/openai.yaml create mode 100644 experimental/databricks-lakeflow-connect/assets/databricks.png create mode 100644 experimental/databricks-lakeflow-connect/assets/databricks.svg create mode 100644 experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md diff --git a/experimental/databricks-lakeflow-connect/SKILL.md b/experimental/databricks-lakeflow-connect/SKILL.md new file mode 100644 index 0000000..7463195 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/SKILL.md @@ -0,0 +1,199 @@ +--- +name: databricks-lakeflow-connect +description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing." +--- + +# Lakeflow Connect + +Build managed ingestion pipelines that pull from SaaS apps and databases into Unity Catalog Delta tables, governed end-to-end and powered by serverless Lakeflow Spark Declarative Pipelines. + +**Status:** mixed catalog as of May 2026 — 9 GA connectors, plus a Public Preview / Beta / Private Preview pipeline that ships new sources monthly. + +**Documentation:** +- [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) +- [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) +- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect) + +--- + +## What Is Lakeflow Connect? + +Managed connectors for ingesting data from SaaS applications and databases. The resulting ingestion pipeline is governed by Unity Catalog and powered by serverless compute and Lakeflow Spark Declarative Pipelines. + +Three frames to keep in mind: + +- **Simple and low-maintenance** — no client code to write, no message bus to operate; connector + UC Connection + a serverless pipeline. +- **Unified with the lakehouse** — credentials stored in UC, output is governed Delta, runs on Jobs and SDP like any other workload. +- **Efficient incremental processing** — change tracking / CDC / schema evolution / retries are built in. + +There are four architecture patterns: + +1. **SaaS pull** — connector reads from an external SaaS via OAuth or API key, lands in a streaming Delta table. +2. **Database CDC via gateway** — an ingestion gateway runs in the customer's network, stages change events to a UC Volume, a serverless ingestion pipeline applies them as CDC into Delta. +3. **Query-based** — for sources without native CDC (Oracle / Teradata / SQL Server / PG / MySQL query-based, Snowflake / Redshift / Synapse / BigQuery via Foreign Catalog), the connector issues periodic queries instead of subscribing to a change feed. +4. **Community connectors** — template-based, out of scope for this skill. + +--- + +## Connector catalog + +Lakeflow Connect ships connectors at multiple release stages. **GA** and **Public Preview** connectors are production-supported; **Beta** and **Private Preview** are early-access and not production-supported. + +### GA connectors + +Full coverage in this skill. + +| Source | Type | Auth | Reference | +|--------|------|------|-----------| +| Salesforce (Sales / Service / etc.) | SaaS pull | OAuth U2M | [1-saas-connectors.md](references/1-saas-connectors.md) | +| Workday Reports (RaaS) | SaaS pull | OAuth refresh token / basic | [1-saas-connectors.md](references/1-saas-connectors.md) | +| ServiceNow | SaaS pull | OAuth U2M / basic | [1-saas-connectors.md](references/1-saas-connectors.md) | +| Google Analytics 4 | SaaS pull (via BigQuery) | Service-account JSON | [1-saas-connectors.md](references/1-saas-connectors.md) | +| HubSpot | SaaS pull | OAuth | [1-saas-connectors.md](references/1-saas-connectors.md) | +| Confluence | SaaS pull | OAuth | [1-saas-connectors.md](references/1-saas-connectors.md) | +| SQL Server (cloud) | Database CDC | DB user + change tracking / CDC | [2-database-connectors.md](references/2-database-connectors.md) | +| SQL Server (on-prem) | Database CDC | DB user + ExpressRoute / Direct Connect | [2-database-connectors.md](references/2-database-connectors.md) | +| Zerobus Ingest | Push (gRPC) | Service principal | See [databricks-zerobus-ingest](../databricks-zerobus-ingest/SKILL.md) | + +### Public Preview connectors + +Production-supported. Configuration may evolve before GA. Deep coverage is being added incrementally; until then, see the [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current setup steps. + +| Source | Type | Auth | GA target | +|--------|------|------|-----------| +| NetSuite | SaaS pull | OAuth | May 31, 2026 | +| Dynamics 365 | SaaS pull | OAuth | May 31, 2026 | +| PostgreSQL CDC | Database CDC | DB user + gateway | Jun 30, 2026 (tentative); ungated PuPr May 29 | +| MySQL CDC | Database CDC | DB user + gateway | Jul 15, 2026 (tentative); ungated PuPr May 29 | +| Oracle / Teradata / SQL Server / PG / MySQL (query-based) | Database query | DB user | Jun 30, 2026 | +| Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog) | Database query | Foreign Catalog | Jun 30, 2026 | +| SFTP | File pull | Key / password | Jun 30, 2026 | + +### Beta and Private Preview + +Early-access connectors are not production-supported. The list changes month to month; check the [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current availability. + +For the Lakeflow-Connect-vs-Auto-Loader-vs-Federation-vs-Delta-Sharing decision, see [4-ingestion-decision-tree.md](references/4-ingestion-decision-tree.md). + +--- + +## Required Tools + +- **Databricks CLI v1.0.0+** for `databricks pipelines create` and `databricks connections create`. Verify with `databricks --version`. +- **Databricks SDK for Python** (`databricks-sdk>=0.85.0`) if you prefer SDK over CLI. +- **Databricks Asset Bundles** if authoring as IaC (recommended for any pipeline that ships to a customer environment). + +No extra connector-specific SDK is needed. Lakeflow Connect reuses the pipelines API surface — pipelines are created with an `ingestion_definition` block instead of a `libraries` block, but the API and CLI are otherwise the same. + +--- + +## Prerequisites + +Confirm before creating any pipeline: + +1. **A Unity Catalog target** — catalog and schema must exist; the service principal or user creating the pipeline needs `USE CATALOG`, `USE SCHEMA`, `CREATE TABLE`, and `MODIFY` on the target schema. +2. **A UC `CONNECTION` object** with credentials for the source. SaaS OAuth U2M connections must be created via the UI (Catalog Explorer); API-key and basic-auth connections can be created via CLI / DAB. +3. **For database connectors**: network reachability between the gateway (classic compute, customer VPC) and the source database. On-prem requires ExpressRoute (Azure) or Direct Connect (AWS). +4. **For file connectors**: OAuth scope grants on the SaaS file repo (SharePoint / Google Drive). + +--- + +## Minimal Example — Salesforce ingestion pipeline + +The canonical authoring path is JSON to `databricks pipelines create --json`. (There is no SQL `CREATE TABLE … FROM CONNECTION` syntax for Lakeflow Connect — that syntax exists only for Lakehouse Federation, which is a different product.) + +```bash +databricks pipelines create --json '{ + "name": "salesforce_to_uc", + "ingestion_definition": { + "connection_name": "my_salesforce_oauth_connection", + "objects": [ + {"table": {"source_schema": "salesforce", "source_table": "Account", + "destination_catalog": "main", "destination_schema": "salesforce_raw"}}, + {"table": {"source_schema": "salesforce", "source_table": "Opportunity", + "destination_catalog": "main", "destination_schema": "salesforce_raw"}} + ] + }, + "channel": "PREVIEW" +}' +``` + +For a DAB-authored version (the production path), see [1-saas-connectors.md](references/1-saas-connectors.md). + +--- + +## Detailed guides + +| Topic | File | When to read | +|-------|------|--------------| +| SaaS connectors (Salesforce, Workday Reports, ServiceNow, GA4, HubSpot, Confluence) | [1-saas-connectors.md](references/1-saas-connectors.md) | Unified SaaS pattern, per-connector deltas, OAuth flows, DAB stubs | +| Database connectors (SQL Server cloud + on-prem) | [2-database-connectors.md](references/2-database-connectors.md) | Gateway pattern, change tracking vs CDC, network setup | +| Ingestion decision tree | [4-ingestion-decision-tree.md](references/4-ingestion-decision-tree.md) | Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing | +| Troubleshooting and monitoring | [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md) | Event log queries, common errors, escalation pointers | + +--- + +## Workflow + +For each new ingestion pipeline: + +1. **Pick the connector category** — SaaS / database / file / push — and read the matching reference file. +2. **Verify prerequisites** — UC target, source credentials, network path (for databases), region availability. +3. **Create the UC `CONNECTION`** — UI for OAuth U2M, CLI / DAB for everything else. +4. **Author the pipeline** — `databricks pipelines create --json` for one-offs, DAB YAML for anything shipping to a customer. +5. **Trigger the first run** and watch the event log; see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md) for the SQL. +6. **Schedule** via Jobs (`pipeline_task`) or `continuous: false` on the pipeline itself. Lakeflow Connect supports triggered only as of May 2026. + +--- + +## Important + +- **Triggered only, no continuous mode** — pipelines run on a schedule or on-demand, never continuously. Check the connector reference for the latest status. +- **Compute-only billing** — Lakeflow Connect is billed in DBUs (no per-row fee). Database connectors also incur classic-compute gateway DBUs in addition to the serverless ingestion pipeline DBUs. See the [pricing page](https://www.databricks.com/product/pricing/lakeflow-connect) for current rates. +- **Salesforce auth is OAuth U2M only** — no machine-to-machine, no basic auth. Connection creation requires a UI walk-through. +- **Database staging retention is 30 days** by default in the UC Volume between the gateway and the ingestion pipeline. +- **Limits per pipeline** — most SaaS connectors cap at 250 tables per pipeline. Split across multiple pipelines if needed. + +--- + +## Key Concepts + +- **UC `CONNECTION` is the credential anchor** — every Lakeflow Connect pipeline points at a UC connection. The connection owns the auth; the pipeline references it by name. +- **Serverless ingestion pipeline + (optional) classic gateway** — SaaS connectors are pure serverless. Database connectors split into a customer-network gateway (classic) and a serverless ingestion pipeline (Delta-bound). +- **CDC and schema evolution are built in** — for sources that support change tracking or CDC, the connector applies changes incrementally and evolves the target schema. Data-type changes typically require a full snapshot reload. +- **Streaming Delta output** — destination tables are governed Delta tables with `applyAsChangesFrom` semantics for CDC sources. Compatible with downstream materialized views and Spark streaming. +- **OAuth U2M is UI-only** — DAB / CLI cannot bootstrap OAuth U2M connections. Plan for a one-time human step. + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **Pipeline fails with `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION`** | Primary key collision in the source snapshot. Inspect the source for duplicate rows on the declared PK column. | +| **Watermark not advancing on a SaaS source** | Cursor field misconfigured. Check the connector reference for the supported cursor column per source object. | +| **Column added in source but missing from target** | Schema evolution may need to be explicitly re-enabled per connector. Check connector docs. | +| **Gateway requires an instance type unavailable in your region** | Apply a cluster policy override on the gateway pipeline; see [2-database-connectors.md](references/2-database-connectors.md). | +| **`channel: PREVIEW` warning at pipeline create** | Expected for new connectors. Switch to `channel: CURRENT` once the connector is GA in your region. | +| **`databricks pipelines create` succeeds but no data flows** | Confirm UC connection is in `READY` state and the destination schema exists. Check the event log for any `pre-flight` failures. | +| **Ingestion run shows GB ingested >> source row size** | Expected for CDC sources — change log columns + schema metadata add overhead. | + +For a deeper troubleshooting reference, see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md). + +--- + +## Related Skills + +- **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)** — the SDP runtime that Lakeflow Connect pipelines run on. For Auto Loader and downstream pipeline patterns. +- **[databricks-zerobus-ingest](../databricks-zerobus-ingest/SKILL.md)** — push-based gRPC ingestion. Sibling to Lakeflow Connect's pull-based connectors. +- **[databricks-dabs](../../skills/databricks-dabs/SKILL.md)** — author Lakeflow Connect pipelines as IaC. +- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** — managing catalogs, schemas, and the UC `CONNECTION` objects that LFC credentials live in. +- **[databricks-jobs](../../skills/databricks-jobs/SKILL.md)** — schedule ingestion pipelines with `pipeline_task`. + +--- + +## Resources + +- [Lakeflow Connect public docs hub](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) +- [Connector reference (per-connector setup)](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) +- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect) diff --git a/experimental/databricks-lakeflow-connect/agents/openai.yaml b/experimental/databricks-lakeflow-connect/agents/openai.yaml new file mode 100644 index 0000000..b27696c --- /dev/null +++ b/experimental/databricks-lakeflow-connect/agents/openai.yaml @@ -0,0 +1,7 @@ +interface: + display_name: "Databricks Lakeflow Connect" + short_description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect." + icon_small: "./assets/databricks.svg" + icon_large: "./assets/databricks.png" + brand_color: "#FF3621" + default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect." diff --git a/experimental/databricks-lakeflow-connect/assets/databricks.png b/experimental/databricks-lakeflow-connect/assets/databricks.png new file mode 100644 index 0000000000000000000000000000000000000000..263fe98b84e8ff3516edc93e7c99230fb8fb3113 GIT binary patch literal 15366 zcmeHuwGvL_%Pi2UiaF2#a{7SAsT9mWLN30;^Ey{$?uT3zOpOTyHdK{j^8hhfvh5NwU}M^})*R($^?%1Z)$O zv6^?FX-TiMYx?H}6GoR%4rDx|NW8K`Drt!9hxATaunL8TRATtgiasJ(Q{PhVpMNi)U<1Ugcz_bCAU;~4{5W3% z5W$DAqRanFT@??0-Ol$8^xsbi-0mN(|4;h=odd}sIzOAK*$SI5KK$-&e7Gh)_%Zq+ z5gXOfSo)LD>!=#p3^9Dt*&Pw%L%YAZ7t5}5PezRE``s=U8v*!45K%_IS;-gEhzPwMWE=_QBR8TZ%nf|E~Ns4QQ zo-p>@gQ2AH!E-$3T;wb(ezjD?`9$2^z1i=E37uyZNW?>m-ne;7dKS-L%BhhsVf62=-{0M$lV7G(h{>xM zez-sySdqqLiJY|~n91da9X4#Rki0qBIQOGKZ1je*!}|j9kfVf@80lyH{sW@3(_z1l zcN)%mi%w3NcbX-Gk4Oo800%y|ovstTrVu?p5xl__L2R53ty&+g{&ve(luo~h0gZ>e z(bA{wVou=Fir+@7wKdSi?VERb2G3ylXh#?yIk~zt)m#OS!OStL!k+^1MK$G=RvHg( zgrdktULkPyvC$XoA$2iXkNs#FZI%ofFf-WR+i9(wj!NpexzT%w(&)FN>)#KF}R5`TG0 zX$YOQ&K;pp=}1Dg3ZC&iq_$|tk=d2@7yXn#44k%_!^exNKj2555E?j!V1COIMuMk3;Tcp)oyUGH^1MNc-BQ^FjKlg>g6;DfuE&ezv z;K5yA!a^UgXmH^DOQVlGb3W&Y1Be=!? z-fFUfE`kR&fkQbYR9%=d<1n9li=TwY9GUsmI(_mcfMj}8X*QY=NwnfS_zb%x)8143 zv%uFw^=CQn)(Bc!o&aJdKl(z55WiYxQTUY*kWL8Ej<46QPtr7sS8E1eI^KfA0m8_N zf-&Q@^s141gb{Wci)J&F^S_2}J&B7uGBXWxN|6QMk96Z>J)QiXdr}C?Pu0YA z{#ZO@l}0n)X9MRz-_)f+Cco!~y!AVij|t5prz?5eGAB8My`E#Y1ZMdpUvT@nBY}L( zA)%##sCc=r(<~aLi0YFsIFxuF<7x8IoQxbJ9zKOM#F|8k)FXt@fyBR89^8+6qI{3*CXCh|Hk)~swk8o258(Z#m*~= z6x-A!ryOMHY7c8|p~-X(nL6Lr*RJ`j6bRQDZ8jPhJp6@SvD>e{i{e4tXnawv&OGY5 z-5RHFqHgZH^;jZ&c5qS@9%ZUTbuIXdIWe_-&cM9 z$D#I@?3_>X<=2RrE4j)af>`^1$90^XNa!eF(8BYLFZkeBcAAqfT z>Cvjn?v9bp*`0)+tuGwg%qWs(S(u+^a-Ov=MZoxm%ZX`X?rzGA^9(ynWqdBIdSbgC zK%Z!<9S^jWeuLwL62WMJR1;IdgG+Rw9o|y1l8eUbO|*GCCoGX3+filj{UCxNpA!nE z)-wk~zio(B_wWqXT2GG^3Nj%H&6V~c9u#uK$X{QNkuFFGoz19zfjpz5PF;`jWMHi} za0$XAu65Km;CtpFI-Tz@Lx#}gKGqx6tp?M~`7|F6d2^RSnGvO_(Z!47L&7&WevQ~L zCIEjb80bqBOn5L)eRi9F>&Y;G#55(zFBUi>`tIy^=fFGUw+|C61_C#Cy<6~k?F2X+ zNO5$Pn{-b^y%4;4X}C5U1P5dTsM8i6)w%6I8zyszCB}y#UiDe+goeX@d_A6{Pp{MK zA>knRNTgBRW6ua0u9&S^WYn6>K6^#8eH$&1(eJJ^rp2f+2p8lHC}$){!(qn#o}6gB zXN~^H;8T9d!%<#O7)kjjQ+~j}{#|UA%2UG}!=bO$yD2c_GI4T+$21Pb6DC|xJ!L$p zaw<@)){LTeSN){p4{I{`RKLhBlBcjwL9+KU$z$=I2(EPJWlsxqzM{5*o*iu~fIK}s zDZh4n1`Zl&l~m3%{EOi+rlj0Iw@$JZ`df0X#-$}!QYfmrD**OWb+l`Lk$9hyWT=j> z=Qo?ovzW$$Ata>(xFeB_eT_%o%l2RhZA0C<6Pxj?;X@MO&T*r6Gxs1EXRiOJ?1&h(wT&~VcLVkR_1Wf7YoKbRuj z#WB*CYYp}+t#UWooon{H@#iuaqMO!wjSBv%!>9${u#DECf;Gc?;V5JiMPE;|#4SsE zz1>9s+++gx1F5J<-lPyWc>q(ZZJ9s6XGQd4>;sCB)+;dUT;N-GUFL zz_~*T1ifJ;tU}2*C7SU_{;fMX9P?~ zcIp8;FZcdz(dHXzny0&2dUWNv{i0t;2|W8f-2@BY=iHn+5igkRa3=x$kR+~|QjvU# zOu0@-SrW-&GbY13sNf84ta9PD)rchGrgop_Nz@H}(l@5^I3PT!sIA&{-L)t_d5AIg zOL{HOQ#6u-B>b7h+DaTs%-7GOdJSbGhef}mj>mIA)8_OVcInN$p(o1vF0fJYvj1xP zAQgTqw8CL|GDM`Qnl!3L3MqPZqCF;a-EwlK-!I<#?R$(gP@DT%LK}l%{ORAoANXq zs&b%A?RU;JFDU`+HoFET;zje?{R5hx3Xt3Y9uX^(YV*&wk=5k&Kkp5Rk`SChQY1`U zPfonN%C7|pe|h=hHR79sA{?!S8Y!`bt_R#9Z!fdb(u=i~eunUuT~Dyl=>EmPqfwgv z@q;Nv6Qvf6geSAD8fNvOQP@3S;{LcbQd})M8IFQQ-urcE?Xmw~LtSw*7`WvC`2u+% zb1F|MBXM69dll^%JO(cRRX6fd#x^8Kc19neJkjF)uTC9L53L0r8^masTgBb*c`2Xv z29D={1zdE{nNt60# zIyJtpBS;B@{j_^gEfYfcQ#EN_mX#Qve>a6rfdQT}n30^@Y_sP=xik(7mThdtYX}-p zfGi7cG@SRjUA$9pzQPzJbW5j*%wOX&l5&o!T_SuNa8KQ0CTY3MXg?P9x^#F&dw;>I zMer(!;34?D+fHYlYb?D{@C9O)Fc`UlFb-WD1N$t4IGGBLd3Xh6)AF;fQqSl~|LW46 zsE0zYSpDr}Lk~Wf@|E9T**f>ye>>{lirLh~83KCwMCH>v7s@6ag;3?sAn^V2_OF`) zrCdD|Zns!9?UZ#k0VURUHOtE4Mm@#*&y>O3XalWYuP5R=&7a9qyIE)7OK;sMaCyww z*DZH1L9eUo*Y-j{$!pn?t8cK7bXc$qD}Pwd4q@7pKF(qDYi9N+O1X%QAuMM+cSC;J z-|%djb_Rl7%I%Wht3~u|=-PIozb0GCcBbZ=AFHBYXE!A4_(SV>l{0h{pJu@{82g&P z<(_!3qfb_dPI|&vmsB56v^*n=UgHZzi4D^;jx_k z=37Kd>={Qht6tx&DSW29bsk-FSUGmtRU6;kLm0-HQyjCQwGuT6_0G5Ns|Kix4;cQ~ z7#`ubv#2_tKCLDuvVrwUX8+8Q>vI-(e1P|eNI%GivWp!Z^X!ce1=+LX&96nZ+jFtH zOuoEiEBy>?CoJTaZ8t&A%ZeX+O?!ep|EqP}P=}7CVQ*sAVWDzq`x|{*SH7>E7C1Bc z<4Nxq=W|B!OEoH7j_;SC2&LU%UgHd0!Osdgjl*5+CyzzCj;Dsn0Y4yqCY=$x?l6tO zxY{AFbF)a6`xlOw%p-aNLx@Kb^!7;ZFw>7KHLXg6K0&swV(}Bj<;;jlU%6C=5SHP& z$9#5F(G1!Ss9`7RYMI}YhGcJLh%6nnAafhRrj{8;nbR<^j7?1nwxa$_lH95qVI4`> zDerlYtcm?8aHBfkPsOM}B74zuXSBX+#57C7Ibm9L2Mv9)eeug($e9CrDRz-O;sopH zB%=O1t3B$3W<%CkMj!j9e$8(vkk^jrt3FmX_r|O@8^DtMtojMs2MPUuYu(b|(^!|W z`<=7wpMzXUU)*gBR1NP$!RY-FJvuwG`*=e1-S-RIpuAl&)37`l);9c$JM~u#ev%9s zrgJr4%% z=vGGx9@5G_VB|{sj;!rryVq%-*H$dughu26=}M1!s>CMDjEO{BZ^2c+e2A2*8pNqJ z0?h{S1-2bZ#86=1zRE~=v)>2yjGrkO(G0Pg7W^q;7Ia^MH8o-=LiV(;An102C4r|H zTFeEU^i=cOBqfRKggFdZne*t17XSOBOv`JU!coA+KjSznRSzDt}wAhb%m_M#7WM9S5MzV{y zK+MWSxc%GlO>tSl(E$&35J%?{d}7Ysced+Byei%2$?bkfo+l5NXu%(2&L6`D!LT+p z8+=b;(*`h_FrcuzP(wS_%{Xr$_lHc#L*}lJ6(hdF%H6(b2I=Ip!Exi{?~7)*zZ^2p zYzVw|y-(vd88X&BQDmfH70nS%ALA9mOCaf4F*L=4i2c~8^ieyG&mLquvWjHS;qscb z1FVTL%UGHLWsUS4QC4x;(Qz}}q*H|2^`TSe1cXTsqWpCNB|^~QYS`^YFUGznM4YH( zX#0Uun3c|lIBAGT2+!Q_{i8YWWjkAt|Hb=tRd-Fb9Vqb{mL!$2qk9r<?_`Z;DkOAPJbxH0{{io= zZ~NbD@#{3Fw$(DfggF9+2jQd4+}h9OAnZwoDw3W_etDJ6Z9i0=$(PkH*DQ%v+Qt^%ZhUF4^uq^}3xX1HD4-Zs0hmHdqQw_E(DAb%ACK_$X2%lr|n z$*yFv$`;8*>%45m4I3Cq>dyRL&7*3d3!MFku9ENi$TwQ=KsZ5e zD3WWg=R6@F@*G%zklf@QP%dFJ6*2WBV4eiI$qVmKHCXo)hPz$uqLhus1y64m<#_mK zD-GF5dx>jJsx=Su#^_z#xi=CRT7I2?vT@#_W@}aVG=`A< z+23Ccf{eIoocMCwuCQ}B^|wp<801(@^6K{{Rf;7uOYls`%YS(P5UD5kpa{0sV`=qU z8~SpS)aqgTr}jxg^@B3sd=JClt)eg{%q>u>(>poix64Jg!)hE7zDZId^<2AJsgd=o zw`oO%xr=hHxVr_)E74D_nthfP+$zKtO0ek_X@E>1cQV8ID}M)BO$0ZxDnj}9?;AXW z48O+x`-l5u4M&?_tK)v2Hdnk|*>3Rk=CGv*FPD5T#(c3CMb02AE{}^S8Pe&;>CsPl zUZ(g1ufLY>h7gYHX*&e`8GcODJMt0N0Ku&qe7%9JBF${d{4f{xxzjn}jsk@RbBc%Z^~Idh$Ng1?r8%|+(gbII%mGEr4 zo8=&dTXzEAEYmg_Wn1mY@;iOBnso90aH4ha%N3#;R`Ph9d3D&*_!c}o+Gbi8Vuq&O zuZgTc3$RBFxjan`VR_lbKd*HdmLU4fWYzlSi^Gv7b7n4N)eKkI-wG`3rN=uQ#+*kt za$mr!G^$0ivVvR6wtD_=tXoLe$IZ``7N<1uNsG8Ez7CvdBzI-1z0FxzbW8G)nRh%F zxz+Orwp)5y=S#AxRWP- zhOvfvfAUPPHdC*c(EYa6r(TVy2=<{C$#Y{An7CExS#dE-3f~*WQ=6i8ui09uYa38C z<>KEOxXnZX>I_O-qWqItZW3M+rLZRE=x*|mC9IbE@ll8CHAs}uWdBKlN5_CyXRgDn zwR~cA{PJ3tc(SuQoBUU!f0%U#Hd`{W5`Q1Z_jD|NfTCk3 zFOxyj;j8+p`@2aelPbk^k8rCj$Q;CqD_72G!lH^&jz4SFAnnle4_kf2WQAW>$*>mj0` z|Lc^N&jZi(V2=>R4oc@h1*bMoT+y=h_A%-?3lU^a&sj{}fuwjKUU1lfn_?bQSzI`p zFj~B4FZ>0)bmT%qMdiJ& zeb*NhhHp=MvFqHDHm}-S2=XqER1A6@1dEq8p=)}>#2=9NL4Jn-hZ(!SFSQ)b;-#N< zH(C7WwL>MQC@*&&sxCW1c#3zhFCu?eTEm!HRIen3K08V@Ou z#xr5H+7?NJcDd``%Au=P`lYn<7!#aknI&!GZ71oT?=QSL)ZPDVSja$BV23nKDHu>A zHrQITFgWPEueyo-`b8zZ&hS4+AJ%xFe-ex5^{K5J#3BnX*-4XGP9w|z1EasF4 zTdIr4VY~LN7qwiZgf29itdfQ(#)PMkJ9#IjuT$SH8*;@(ZN6JGWHo5ZlA23>$OdxN zy(6;bmeWVsgw_#n9oM_FjZyI>apGo^p_eXXO(1 z?%d+4kyEcYkzRrdpFk@SFVzJ5uiSJJR{2$rp<9#Uis%xN&%7jRaP)DRmw@*Trm8Nx zgs$pxb#*3+Ygi!F>Rt9)!yF|hsLD@Ca1VM1jC+m7A`>5MFJQGA&sW~?jg=wP(~2iw z3`5%uV)uoc4wNda8ai$nw9Tvb7+*l)JokDTk%?Whrvfo~C#%`)_jL=is^=G+56UE; ze2Q!{+H}tWB@51H%hbrYu9ykFDCgL9*VRXpLJ8eo>J9l1*X_`o?{ac#F1-51m`|*$ z&S(J1o=HG5hvaL59j1(Rsf_n{P)qR>`oRQyTpxpGI4x4oyk#8F{d|-;`Fcrqi@#8Z zIb@cP<|Llz-gueSaa#LfaH}QIm@D&1e*p1D1JcaSjSjoqN^-3C9qUJfsSLn!EFOwg z_$?8tjYQrxC57|g+jFCh4NULa488p5;%L;^(*UwsO3N#zJ` zfj)@vO2S->k2XH_1jRRV6ghtqB=BG?{Q2Thpe*$PmRId^ran zPwEB6k;B(L-9BrJD=?%(d|Ii*T79R_1Gi55marduE&1ye>t49LjJk9E+*4TiEb-Tu zsk?_T5^a4!8n(`Wh0n$22hgpb^Ii%Nc~PZg703fAQge>VR1dGw3jqT~j>^jM^iV}l zeNbGMZQr7JLbe5hvG~23NSAA73XlHzt$dRC=3FP{=!%uKZ)V<~49aH~v|_}OM{iGX z7rKC8yOq5-xU~Ls9i`l57K#O_TxQE#mM9px)5H_`pf_@BCy)}wjk&WzD z^m24(T0Hmofs)Q71+q)Wff+0(LO^Wd2q`_QY_v=2-(EkHPZ}CF~@lw-7@Ef9S)CPLeB8hX&l}v0LsiPOe-a z9p0GvTr}x4o-O?vgts`REWB9tazJd? zoxJ((en?IFuv4Q?g*@c2pY+d%1QH<9R7|C_&`wsl%VCe8!1Gm*yU$C!isJZ*U^PSYwqEcC-4g9U7&{rbFi7NEyGwKhGkCH%upmP_j_UF4whhwK0?K%e*T0q#XDe zv+SZ4IM2n(7g^D=Ay>qwy9k4hGF6*xBfOd8i}f-r-ZShCQj^qqbd*u=|FMUYS49yc ztD?vo&g4)>bLzK0pDRLLHy;<1hgj5{I)ys;EO<+kIZ1z@#rb<%_fRxF3HbiAH*0H% zuw1zXYS-?)xMiXDZ8D=DtN2F%>JnOu4^Ig&FR@+OY-w&9=x3I|`JNbz)c^g7{r(6h z$wSD+e0hu0l2E%tonAR14ig4nMMuEKrL>ZNPIU!VSBUE5_e)mgKP9=l^^p8%Q)G8T zXL>{`lyIr=$KSNUKTe==w7#Resk75s+?CIUUcBCH`n@?&-+|jR*o!*Q!%-{Djh0X{ zV@XYCV(-{d4h4SHAVcRK@a|0RQ9~KiM(`=_Noui>&cV}~SIESk%S0e8U zqj#Rl&Gp@ z1g@xmDiB6H_hI`*wdGJ^2#Xq|qAk0wD&L#hEnfB`Z`iOiqoXOIG_baZM+`_0nC2p3 zokZfu)ay8k#22}4Diyx@M~X^bRP{+X%}^>urQE|`y7^y^q1cMlyXT=W+A*-{CXxA$ zTfMsYZ`Oge%IwYNq?PKYs`*DV2zHiPse}@4+0tFsXi5BJAyBsaQjNe)4Nf(n>d--> z9D6TMGJ~-PNHNdSP(lNDF54E+tNBP90eCKO{gZEatI+uK5BnswDQoLg*+TalaL+lD z&ScSN+~1v+M=q5v@@xP4S6CCh6YpfyW;`}cb+tXvFdax7P>d6|nvj`P+_K~syD z7BFh*T1MiY|JnC=e(!hI*Sm8JW=WEU{FBh9t`owNn4i|i?*E$2g*_nFV2cDiJdo0l zDFNlGNY`*$#E@Y%v)J&U(Q6^e>iP0Jk7GqM5-Rt3P&$_%sND)SnPErQ3p|u+2Z;c5 z+N#Uh6WkCYsyI}V+>|>1Bt;WjEj#G0F{bj!)TQfmdL$`y>T}^RZma8La^kWp5gf?( z3`1RBU(%M$0K-P z!R6AU7yHTS*?RHQ)P+x4j^>=aG}dQQLPTd%T-{9#cLf}BR%LaZ8mBl7TMHYG_P??J zJv%r_Fg^@p-F-X~%nBOTw1=iRj4)Sfn0Jfl7h$ep)QzqMPT<4v9YBA8@stcM5p0D| zvhw|ln_xv+4$v}0icOLP4kcW8FqGxQUVX$rwt(Q!MLR+YAGEQdEu(pku7!yX3#WKG zVif5Hv47%yb(^NV^n9(SFloKt!ev7WL2aPT$=e46&*do zyBv6+w|QdlXIRk_&sJ5rLqS-)w66O~F`<*VABXv{^`t_iS|;l;rcW}bzv7RY9+%UF zV7+H0446IX|CqfPTS$Oc?o6%>rx&Yrnoex_@iXk_N*t!Y>f%V25nsEOja>9>DhNh< zPaYEFp!-|UfKi2KNVd{M@d2BAsj1hK%P?0eN0!GSP|p;q<(u z6_qeJy^0&bhafPh8o?Qer*BxBNzJ}l09t}|3eq4`bdaifbdqD`#M2`_blMCL|69iB zY$UiPJL(~Ybx?HK61z>Fn7h%Pn(%8Myz}Q?kOuSsI?X_$=D=-W!wtCM5}ZnPrk4{0ajC)4SJ zzPR!8%4_TjTtpA|XfJa&LAL#vX2)zwjeCqy&DR29 zX^Y=R9hPXDuJa{Ns2O2+cv2s#o;^BswE0-D-ocbBno)-r|A>is_<3p@=*dbL8JkqZ z!9g|s%C6OJR!!iOntcCGe#f(G^&nO9!7%8{gtE>DRL5@P%Ab+o!2 zTsw}vW>caC+A29-*URlt8beobjH2>X@HuxIy+2N4+WrFTg!~^LmKN9nJAi6mAr0y>v762K6!3Y zo2z=Q=i>1FJVvMa&q$#HXm=}2T_}EdI}pPMdgs*kwZ<=&E=;I`D_Y}TJ&TK{MeI?N zXfLb1VdvnV3?baIvxCf;NUSP?ik?;1+kT3NL-$Y4_#&Ane}W*?1XsM~C}$8^G-K?v zbQ`^W{XLVZ2Sp2K?It#JB=Gd0qR2b$!1(LqLZ`$3LSPbw-6Nyx9DBExGX~*rg@u4? zyQl8J-5*bu#XiuGO-ZuRYtzXt>8R@yOh}Hs1F)+4Vln|97xp9dkJkkZxT+iz~v+ zm0A}eFg4Yom|Bix*0r8?FZh1-JO&aOa_B(Sb4*TpEkL&}MzoNWOPLTb;W4 z&h3gAIIdoQIsW_reBs%z_&}IW>zUXCeh2Psd+}?pIX%6+9?T3=Jmj6xE!ekHS=0Bj zFOPgbT?HCo<-Y`dq>xtDQqU~Mrx%PTA@}JXN&2ZuCv5BGOi~`SnyatC38!q)Vi2zSa5XRt0-za`p{j_gY$Y24GeSx8+Ko#sq2Gh z;lYi2z~|rpWnRb2Qh&Cr(Sn^h{t3XzRsmcsydQf~8UsfPSk!#ppO5sut8oN}N+18Tpv>OZmkvH1$`;W7H8xw_x zCjZdfp+fKr;7|qw);J&8_o*Y+Yr+zgiw1sH10ZSXFQjrPjISk{q`<#+kM-ji`Hhp8 z%=gGwENHT`c8%S4)l$?ZUPUxSTw_Pujat_{QHDu=dC%q^!z}1EEB8oN*gR60skQ4s z3CL7)tDbOQ4g$q9*yw#e^CX8DdAe=)5lGcJa0y?*0TfLWroRc5+#OI_u0b0EMXeE5 z4{BP4?`^mr6nt9f&(>5xuJxS$jT4zlGny+MFrbB)+0uT<& z_Ihx!gDDCt=gR{fC2;l)v^W0%kmaW_cWr|J(bn5-ObNfwDyFYiOvPY**myCh@r1=Y z5=jfS`NmVqS&p1&4hY93CU=$xABV9_+V1hV?0`>h$8&!?8GG+8>&cELyIJ6+xywo{ z+0GM#!@R?Km%Wg5n`XwLKF1J$&trOY{Sj;w^xEz5=6cBpTA(7RZA*TFw0L=duEl0wH zUrFqKq(TUa_*>%xR9Nx!GcmOFb1QKdbcx-`@$LH8x_zgzmnhxut!T0UQVz6RJN__Iy~HR*VFb3CY$)N9_ZxOXy0R} z)lB;_Z4xgJpd&UTl%Oj7T(`&>lm|b-yJbN^LBhz^G;Z^q4b?3KAwWdq?`i1Qc*ZEz z*4j$T$4>DoGbBxK>~FvKJt?noWl^|520 zyHeOmx{b`@p}Z_4;kn9%(7W0_(N^ew@BP$o7N1B*=(OrW_THirDi5=E3q)BRemlHs zzIe%rg8d9l2y@_^<@WX1RU29T&hW6~vI;RE2w5Ujun&3ZrU>cji*W)~JRufUsCkzw z-0D<79kUIDk!$TIjYGw|%dFHUtn~%4ie;}>7wQ4L$|&%=9}{ew8c}zcxBl~N06&&Al(h)3;3cwC z7dM#!;Jf~s)h*gU>iJ5}sn!6To`7XWbUR5yfz8p?mUmXv4T((i&J?;T;)%ZebkElrYUNLnpMdKk?`fgKOli-tn z#2nkxYwT|4H}q#a2bM_=ME+5hWoVg zQ>+;wv%LeZ*Xp?qI15Aa_XP@Z7$Ea50)TJO;eGe!8bB*w*%tkU@1M6Lt3JVQEr!mK zk7n(=4Ql3^&6*oz1e7gc-eT*QI|D-dTx)lJDdjewNM6F`J8ktPLLK{OFGMbJr{tfF z1|u6mV8s*ma-^u;vmYFY{09-o#l7KAoD)1T3JG^!>xhL+sxWRN4KPn@-2gE9)zbw@ zLhvqep3wlQ|L{EKP6K9@?N!vmCmF5qYQUGvyTuw`h>}K;X(^NO9eTyzxgwthOBVXJ zXn}F>A1LKsi=KJr&4J)edfzRF2Tvj8NpMrj2s(m0Nn^+`Y`m!e!QgsLtD%x~#ZKj# zhIcSa5&y`Ng?{;^lFepT4p-*?p(9EL;1^*#sjG-&RPIqG)=SJ&s^(5iiV~wEWw_BN zRbaFqUS|zyQS{Zx_8?*f6du##|FsS^G^wL5Js7%4a0!j1Cm@n*R5>k)7vexA4`(jU zC6a4~i5-gq1r`&!-g=)kqq38X>4L-ag%#IAVWX&YUu4}B*ceMgWAH@I&}XMmX%J>}?KxWL=S5x-i4qRp9$&7QGPLK9Ug+`h09z!H82HdBzv z#SVD|Yu{;R@na>g@_&nfYs7#zV}v~~v8pJPIKl6#yA)Tz^2?D$Z6F98w5kpmEh1u_ zjRL3H%M}Z`oO~p@17A3oj#3^9tYz$5L@5}!)2Xy7%P~ + + \ No newline at end of file diff --git a/experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md b/experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md new file mode 100644 index 0000000..a4c9071 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md @@ -0,0 +1,128 @@ +# Ingestion decision tree + +Databricks ships several first-party ingestion approaches and the right pick depends on **where the data lives** and **whether you need a copy in your lakehouse**. This reference is the map for choosing between them. + +The four approaches: + +- **Lakeflow Connect** — managed pull for SaaS apps and databases. Fastest path when a connector for your source exists. +- **Auto Loader** — code-yours pull for files on cloud object storage. Full control, file sources only. +- **Lakehouse Federation** — query-in-place; the data stays in the source. +- **Delta Sharing** — the inbound side of someone else's lakehouse; you accept a share rather than build a pipeline. + +For event-driven push (the source pushes to you instead of you pulling) the relevant approach is **Zerobus Ingest**, covered separately in the [databricks-zerobus-ingest](../../databricks-zerobus-ingest/SKILL.md) skill. + +--- + +## Decision table + +Pick the row that matches your source type and constraint. + +| Where does the data live? | Need a copy? | Approach | Read more | +|---|---|---|---| +| SaaS app with a Lakeflow Connect connector (Salesforce, Workday, ServiceNow, GA4, HubSpot, Confluence, etc.) | Yes | Lakeflow Connect | [SKILL.md](../SKILL.md), [1-saas-connectors.md](1-saas-connectors.md) | +| Operational database (SQL Server, PostgreSQL, MySQL) with a Lakeflow Connect connector | Yes, with CDC | Lakeflow Connect | [2-database-connectors.md](2-database-connectors.md) | +| Operational database, low query volume, source can absorb the load | No copy needed | Lakehouse Federation | [docs](https://docs.databricks.com/aws/en/query-federation/) | +| Cloud object storage (S3, ADLS, GCS) with files | Yes | Auto Loader | [databricks-pipelines](../../../skills/databricks-pipelines/SKILL.md) | +| SaaS file repo (SharePoint, Google Drive, SFTP) | Yes | Lakeflow Connect | [public connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) | +| Application or device pushing events at you | Yes (push, not pull) | Zerobus Ingest | [databricks-zerobus-ingest](../../databricks-zerobus-ingest/SKILL.md) | +| Another lakehouse / partner data product offering a Delta share | Yes (accept, not build) | Delta Sharing | [docs](https://docs.databricks.com/aws/en/delta-sharing/) | +| None of the above | — | Hand-rolled Structured Streaming or `read_files` from object storage | [databricks-spark-structured-streaming](../../databricks-spark-structured-streaming/SKILL.md) | + +--- + +## Lakeflow Connect vs Auto Loader + +Both pull data into Delta tables, but they cover different source types. + +**Lakeflow Connect wins when:** +- The source is a SaaS application or a database (not files on object storage). +- The source has its own auth (OAuth, API key, DB user). +- You want CDC, schema evolution, and retries handled by the platform. +- You prefer declarative configuration over code. + +**Auto Loader wins when:** +- The source is files on cloud object storage (S3, ADLS, GCS). +- You need custom file format parsing or inline transforms. +- You want full control over checkpointing, schema hints, and trigger cadence. + +**Common confusion**: SFTP and SharePoint look like file sources but go through Lakeflow Connect, not Auto Loader. Auto Loader is for **cloud object storage** specifically. + +--- + +## Lakeflow Connect vs Lakehouse Federation + +Both let you work with data that lives outside your lakehouse, but the difference is whether the data gets copied. + +**Lakeflow Connect wins when:** +- You need a governed Delta copy in your lakehouse for performance, ML training, or downstream pipelines. +- Query volume against the source data is high. +- The source is performance-sensitive (you don't want to add query load to your production OLTP). +- You need point-in-time history (CDC into a Delta table with `applyAsChangesFrom`). + +**Lakehouse Federation wins when:** +- Data should stay in the source for governance or residency reasons. +- Query patterns are sparse (a few analysts, occasional ad-hoc queries). +- The source can comfortably absorb additional query load. +- You don't need history beyond what the source already retains. + +**Common confusion**: both use a Unity Catalog `CONNECTION` object. The difference is what you do with it — Lakeflow Connect creates an ingestion pipeline that materializes to Delta; Federation creates a foreign catalog that queries through to the source. + +--- + +## Lakeflow Connect vs Delta Sharing + +Delta Sharing is not really a build decision; it's the receiving end of someone else's pipeline. + +**Lakeflow Connect**: you build the ingestion pipeline. You own the connector configuration, the schedule, and the destination tables. Source can be anything LFC supports. + +**Delta Sharing**: a data provider (another lakehouse, a partner product) offers you a share. You accept it via a Delta Sharing client and the data appears in your catalog as a shared table. You don't operate the pipeline. + +Use Delta Sharing when a data partner offers it — there's nothing to build. Use Lakeflow Connect when you need to pull from a system the partner doesn't share to. + +--- + +## Lakeflow Connect vs Zerobus Ingest + +The push-vs-pull distinction. + +**Lakeflow Connect** is **pull-based**: the ingestion pipeline reaches out to the source on a schedule. + +**Zerobus Ingest** is **push-based**: an application or device pushes records into a Delta table via gRPC. There is no source system to pull from — the producer drives the cadence. + +Use Lakeflow Connect when the source is a system you query. Use Zerobus when the source is an application you control (or a device emitting events) that wants to write directly. + +--- + +## Cost considerations + +All four approaches are billed in DBUs (compute time), with no per-row or per-connector fee. + +- **Lakeflow Connect**: serverless ingestion pipeline DBUs; database connectors also incur classic-compute gateway DBUs. +- **Auto Loader**: serverless or classic compute DBUs depending on where the pipeline runs. +- **Lakehouse Federation**: SQL warehouse DBUs for the queries that read through the foreign catalog. Plus any costs the source charges. +- **Delta Sharing**: typically free for the recipient (the provider may charge separately outside Databricks). +- **Zerobus Ingest**: per-GB ingested, billed under the Lakeflow Jobs Serverless SKU. + +See the [Databricks pricing page](https://www.databricks.com/product/pricing) and the per-product pricing pages linked from there. + +--- + +## When Lakeflow Connect doesn't fit yet + +A few situations where you'll reach for one of the alternatives: + +- **The connector for your source isn't in the catalog.** Check the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) — if your source isn't listed, use Auto Loader (if it's files), a hand-rolled Structured Streaming job, or wait for the connector to ship. +- **You need continuous ingestion.** Lakeflow Connect runs triggered only as of May 2026. For sub-minute latency on file sources, use Auto Loader with `Trigger.AvailableNow` on a short interval, or Structured Streaming directly. +- **You need to push instead of pull.** That's Zerobus. +- **You want zero copy.** That's Lakehouse Federation. + +--- + +## Resources + +- [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) +- [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) +- [Auto Loader docs](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/) +- [Lakehouse Federation docs](https://docs.databricks.com/aws/en/query-federation/) +- [Delta Sharing docs](https://docs.databricks.com/aws/en/delta-sharing/) +- [Pricing](https://www.databricks.com/product/pricing) diff --git a/manifest.json b/manifest.json index e6925dc..80485c7 100644 --- a/manifest.json +++ b/manifest.json @@ -213,6 +213,18 @@ "repo_dir": "skills", "version": "0.1.0" }, + "databricks-lakeflow-connect": { + "description": "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing.", + "files": [ + "SKILL.md", + "agents/openai.yaml", + "assets/databricks.png", + "assets/databricks.svg", + "references/4-ingestion-decision-tree.md" + ], + "repo_dir": "experimental", + "version": "0.0.1" + }, "databricks-metric-views": { "description": "Unity Catalog metric views: define, create, query, and manage governed business metrics in YAML. Use when building standardized KPIs, revenue metrics, order analytics, or any reusable business metrics that need consistent definitions across teams and tools.", "files": [ From 1b907c53dbfa2384a576c52cad997502d7f041a7 Mon Sep 17 00:00:00 2001 From: Jose Alfonso Date: Wed, 27 May 2026 22:29:08 +0200 Subject: [PATCH 2/4] experimental(lakeflow-connect): add 1-saas-connectors reference Deep coverage for the six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas. Signed-off-by: Jose Alfonso --- .../references/1-saas-connectors.md | 136 ++++++++++++++++++ manifest.json | 1 + 2 files changed, 137 insertions(+) create mode 100644 experimental/databricks-lakeflow-connect/references/1-saas-connectors.md diff --git a/experimental/databricks-lakeflow-connect/references/1-saas-connectors.md b/experimental/databricks-lakeflow-connect/references/1-saas-connectors.md new file mode 100644 index 0000000..304a5d1 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/1-saas-connectors.md @@ -0,0 +1,136 @@ +# SaaS connectors + +The six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence) all share the same authoring pattern. This reference covers the unified flow once, then captures per-connector deltas. + +--- + +## The unified SaaS pattern + +Three steps for every SaaS connector: + +1. **Create a UC `CONNECTION`** that owns the source credentials. + - **OAuth U2M** connections (Salesforce, ServiceNow, HubSpot, Confluence) must be created in Catalog Explorer — the OAuth handshake requires a browser. CLI and DAB cannot bootstrap U2M. + - **API-key / basic / refresh-token** connections (Workday Reports, GA4 via service account, ServiceNow basic) can be created with `databricks connections create` or a DAB resource. +2. **Create the ingestion pipeline** with `databricks pipelines create --json` (or DAB). The pipeline carries the `ingestion_definition` block that names the connection and lists the source objects to land. +3. **Schedule the pipeline**. Lakeflow Connect supports triggered runs only — schedule with a Jobs `pipeline_task` or with the pipeline's own `continuous: false` cron block. + +A minimal pipeline JSON: + +```json +{ + "name": "salesforce_to_uc", + "ingestion_definition": { + "connection_name": "my_salesforce_oauth_connection", + "objects": [ + {"table": {"source_schema": "salesforce", "source_table": "Account", + "destination_catalog": "main", "destination_schema": "salesforce_raw"}} + ] + }, + "channel": "PREVIEW" +} +``` + +Keys to know: + +- `ingestion_definition.connection_name` — the UC connection name (not URL, not ID). +- `objects[].table` — one entry per source table. Use `objects[].schema` to ingest a whole source schema in one block. +- `channel: PREVIEW` is required for connectors not yet fully GA in your region. Switch to `CURRENT` once available. + +--- + +## Salesforce + +- **Auth**: OAuth U2M only. No machine-to-machine, no basic auth, no API key. The connection must be created in Catalog Explorer with a browser-based login. +- **Limit**: 250 tables per pipeline. Split larger workloads into multiple pipelines partitioned by object family. +- **Formula fields**: ingested as full snapshots only — incremental CDC is not available for computed columns. Plan for higher DBU usage on objects with many formula fields. +- **Data-type changes**: source data-type changes are not auto-handled. A reload from snapshot is required when the source column type changes. +- **Sandboxes**: a separate UC connection per sandbox vs production org. Don't reuse connections across orgs. + +--- + +## Workday Reports (RaaS) + +The Workday connector is **Report-as-a-Service** — it ingests Workday custom reports, not raw HCM tables. Workday HCM is a separate (Beta) connector. + +- **Auth**: OAuth refresh token (recommended for production) or HTTP basic. The refresh token must be minted in Workday and stored in the UC connection. +- **Source objects**: each "table" is a Workday custom report. Configure the report in Workday first, then reference it by name in the pipeline. +- **Limits**: same 250-table-per-pipeline cap; per-report row limits inherit from the Workday report itself. +- **No auto data-type evolution**: report schema changes require a pipeline edit + reload. + +--- + +## ServiceNow + +- **Auth**: OAuth U2M (recommended) or HTTP basic. OAuth requires a registered ServiceNow OAuth application; basic auth requires a service account with read access to the target tables. +- **Source objects**: ServiceNow table names (e.g., `incident`, `change_request`). Reference fields (sys_id -> related record) are kept as `sys_id` strings — joins happen downstream. +- **Limits**: 250 tables per pipeline. Long-running ServiceNow instances with custom tables may need multiple pipelines. +- **Pagination**: handled by the connector; no client-side configuration needed. + +--- + +## Google Analytics 4 + +GA4 ingestion goes **via BigQuery** — Lakeflow Connect reads from the GA4 BigQuery export, not from the GA4 API directly. The customer must enable BigQuery export in their GA4 property before the connector can run. + +- **Auth**: GCP service-account JSON key. The service account needs `BigQuery Data Viewer` on the GA4 export dataset. +- **Prereq**: GA4 -> BigQuery export must be enabled (Admin -> BigQuery Links). Daily export is the typical setup; streaming export is supported. +- **Source objects**: the `events_*` tables in the GA4 export dataset. The connector handles the daily-shard pattern transparently. +- **Latency**: bounded by the GA4 -> BigQuery export cadence (typically next-day for daily export). + +--- + +## HubSpot + +- **Auth**: OAuth U2M. +- **Source objects**: HubSpot CRM objects (Contacts, Companies, Deals, Tickets, etc.) plus engagements. Check the connector reference for the current object list. +- **Status caveat**: status may differ by region — check the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) to confirm GA in your region before relying on production SLAs. + +--- + +## Confluence + +- **Auth**: OAuth U2M. +- **Source objects**: spaces, pages, comments. Markup is preserved in the page body column. +- **Status caveat**: same as HubSpot — confirm regional availability in the connector reference. + +--- + +## DAB pattern for SaaS connectors + +The production authoring path is a Databricks Asset Bundle resource. A minimal pipeline resource: + +```yaml +resources: + pipelines: + salesforce_ingestion: + name: salesforce_to_uc + channel: PREVIEW + ingestion_definition: + connection_name: my_salesforce_oauth_connection + objects: + - table: + source_schema: salesforce + source_table: Account + destination_catalog: ${var.catalog} + destination_schema: salesforce_raw + - table: + source_schema: salesforce + source_table: Opportunity + destination_catalog: ${var.catalog} + destination_schema: salesforce_raw +``` + +Schedule it via a Jobs resource with a `pipeline_task` pointing at this pipeline. See [databricks-dabs](../../../skills/databricks-dabs/SKILL.md) for bundle structure, target overrides, and the recommended layout for multi-pipeline bundles. + +--- + +## Common SaaS gotchas + +| Symptom | Likely cause | Fix | +|---|---|---| +| Watermark not advancing on an object | Cursor field misconfigured for that source object | Check the per-connector cursor-column docs; some objects need an explicit cursor override. | +| Duplicate-key error after a snapshot reload | Source has duplicate PKs (Salesforce composite keys, ServiceNow merged records) | Inspect the source for the duplicates; the connector won't auto-resolve. | +| New source column missing from the target | Schema evolution disabled or not yet propagated | Re-enable schema evolution on the destination table and trigger a snapshot run. | +| OAuth connection stuck in `PENDING` | U2M authorization not completed in Catalog Explorer | Re-open the connection in Catalog Explorer and complete the browser flow. | +| `channel: PREVIEW` warning at create time | Expected for connectors not yet GA in your region | Switch to `CURRENT` once the connector is GA where the pipeline runs. | +| Pipeline succeeds but no rows land | Destination schema missing, or the connection account lacks read on the source object | Check the event log; pre-flight errors are surfaced there. | diff --git a/manifest.json b/manifest.json index 80485c7..8e3b244 100644 --- a/manifest.json +++ b/manifest.json @@ -220,6 +220,7 @@ "agents/openai.yaml", "assets/databricks.png", "assets/databricks.svg", + "references/1-saas-connectors.md", "references/4-ingestion-decision-tree.md" ], "repo_dir": "experimental", From 81d43e5bceecb0cc5560381273409e7a0fd77857 Mon Sep 17 00:00:00 2001 From: Jose Alfonso Date: Wed, 27 May 2026 22:30:50 +0200 Subject: [PATCH 3/4] experimental(lakeflow-connect): add 2-database-connectors reference Deep coverage for SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, and gateway-specific gotchas. Brief pointer to Public Preview database connectors (Postgres/MySQL CDC, query-based, Foreign Catalog) pending deep coverage as they stabilize. Signed-off-by: Jose Alfonso --- .../references/2-database-connectors.md | 145 ++++++++++++++++++ manifest.json | 1 + 2 files changed, 146 insertions(+) create mode 100644 experimental/databricks-lakeflow-connect/references/2-database-connectors.md diff --git a/experimental/databricks-lakeflow-connect/references/2-database-connectors.md b/experimental/databricks-lakeflow-connect/references/2-database-connectors.md new file mode 100644 index 0000000..91bde98 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/2-database-connectors.md @@ -0,0 +1,145 @@ +# Database connectors + +SQL Server (cloud and on-prem) is the GA database connector. Postgres CDC, MySQL CDC, query-based variants, and Foreign Catalog connectors are Public Preview — production-supported but covered briefly here; see the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for their setup until deep coverage lands in a follow-up. + +--- + +## The gateway pattern + +Database connectors are **not** pure serverless. They split into two pipelines: + +``` + ┌───────────────────────┐ + │ Customer database │ + │ (SQL Server, etc.) │ + └──────────┬────────────┘ + │ CDC / change tracking + ▼ + ┌───────────────────────┐ + │ Ingestion gateway │ classic compute, + │ (one pipeline) │ runs in customer VPC + └──────────┬────────────┘ + │ change events + ▼ + ┌───────────────────────┐ + │ UC Volume staging │ 30-day retention by default + └──────────┬────────────┘ + │ + ▼ + ┌───────────────────────┐ + │ Ingestion pipeline │ serverless, + │ (one pipeline) │ applies CDC into Delta + └──────────┬────────────┘ + ▼ + ┌───────────────────────┐ + │ Delta tables in UC │ + └───────────────────────┘ +``` + +Why each piece: + +- **Gateway** runs in the customer's network so the source database is never exposed to Databricks-managed compute. It reads the CDC / change-tracking stream and writes change events into a UC Volume. +- **Staging Volume** decouples the two pipelines: the gateway can run on its own cadence, and the ingestion pipeline can re-process from the Volume without re-reading the source. +- **Ingestion pipeline** is the serverless half — it applies the staged events to Delta with CDC semantics and handles schema evolution. + +Trade-offs: + +- Two pipelines, two pieces of state. Both must be healthy. +- Gateway is **classic compute** — billed separately from the serverless ingestion DBUs. See the [pricing page](https://www.databricks.com/product/pricing/lakeflow-connect) for current rates. +- Staging Volume retention is 30 days. Reprocessing further back requires a snapshot reload. + +--- + +## SQL Server: change tracking vs CDC + +SQL Server offers two source mechanisms; pick one per database. + +**Change Tracking (CT)** — lightweight. The source tracks "which rows changed since version X" but not the actual change history. The gateway re-reads changed rows from the base table. + +- Lower overhead on the source. +- Adequate when downstream only needs the latest state per PK. +- Cannot reconstruct historical change order. + +**Change Data Capture (CDC)** — full change log. The source writes inserts/updates/deletes into change tables that the gateway reads directly. + +- Higher overhead on the source (separate change tables, log reader job). +- Required when downstream needs per-event history (audit, SCD2 from raw deltas, etc.). + +Most pipelines start with CT and switch to CDC only when audit or SCD2 demands it. + +--- + +## SQL Server cloud setup + +Prerequisites: + +1. **SQL Server 2012+** (cloud-managed: Azure SQL DB, Azure SQL MI, RDS for SQL Server). +2. **A dedicated database user** with `db_owner` on the source database, or the minimum grants for CT/CDC (see the connector reference). +3. **CT or CDC enabled** on the source tables (`ALTER DATABASE ... SET CHANGE_TRACKING = ON` for CT; `sys.sp_cdc_enable_table` for CDC). +4. **Network reachability** — the gateway compute must reach the source database. For cloud SQL Server this is usually VPC peering or PrivateLink. + +A DAB stub with both pipelines: + +```yaml +resources: + pipelines: + sqlserver_gateway: + name: sqlserver_gateway + channel: PREVIEW + gateway_definition: + connection_name: my_sqlserver_connection + gateway_storage_catalog: ${var.catalog} + gateway_storage_schema: ingestion_staging + gateway_storage_name: sqlserver_gateway_storage + + sqlserver_ingestion: + name: sqlserver_to_uc + channel: PREVIEW + ingestion_definition: + ingestion_gateway_id: ${resources.pipelines.sqlserver_gateway.id} + objects: + - table: + source_catalog: sales_db + source_schema: dbo + source_table: orders + destination_catalog: ${var.catalog} + destination_schema: sqlserver_raw +``` + +The SDK Python equivalent uses `w.pipelines.create` twice — once with `gateway_definition`, once with `ingestion_definition` referencing the gateway's pipeline ID. + +--- + +## SQL Server on-prem + +Same setup as cloud, plus private networking from the gateway to the on-prem source: + +- **Azure**: ExpressRoute or VPN gateway between the customer VNet and the on-prem network. +- **AWS**: Direct Connect or Site-to-Site VPN between the customer VPC and the on-prem network. +- **GCP**: Cloud Interconnect or Cloud VPN. + +The gateway compute itself runs on Databricks-managed VPC infrastructure inside the customer's workspace, so the private link only needs to extend that far. + +--- + +## Database-specific gotchas + +| Symptom | Likely cause | Fix | +|---|---|---| +| Gateway requires an instance type unavailable in the region | Default gateway cluster shape not stocked in the target region | Apply a cluster policy override on the gateway pipeline to pin a regionally-available instance type. | +| Snapshot-only mode silently disabled | Snapshot-only is not supported for CDC sources | Use CT instead, or accept incremental mode. | +| Pipeline state diverges from source after 30+ days | Staging Volume retention expired | Resnapshot the affected tables. Increase the Volume retention if reprocessing further back is a recurring need. | +| "Continuous mode not supported" error at create | Lakeflow Connect is triggered-only as of May 2026 | Use `continuous: false` plus a Jobs schedule. | +| Gateway pipeline succeeds but ingestion pipeline shows no new data | Staging path mismatch between the two pipelines | Confirm `gateway_storage_*` on the gateway matches the staging path the ingestion pipeline reads from. | + +--- + +## Public Preview database connectors (brief) + +The following are production-supported but ship more pattern variance than SQL Server. Use the [connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) for current setup steps: + +- **Postgres CDC, MySQL CDC** — same gateway pattern as SQL Server; logical decoding (Postgres) or binlog (MySQL) replaces CT/CDC. +- **Oracle / Teradata / SQL Server / Postgres / MySQL query-based** — no gateway; the connector issues periodic queries instead of reading a change feed. Trade-off: simpler, but higher source load and no per-event history. +- **Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog)** — Lakeflow Connect creates the foreign catalog and materializes the queried subset to Delta. Most useful for warehouse-to-lakehouse migration scenarios. + +Deep coverage for these connectors will land as they stabilize. diff --git a/manifest.json b/manifest.json index 8e3b244..8896444 100644 --- a/manifest.json +++ b/manifest.json @@ -221,6 +221,7 @@ "assets/databricks.png", "assets/databricks.svg", "references/1-saas-connectors.md", + "references/2-database-connectors.md", "references/4-ingestion-decision-tree.md" ], "repo_dir": "experimental", From 7355c515f8d98ff2d6795a8ed75e4b48f5ebc5f5 Mon Sep 17 00:00:00 2001 From: Jose Alfonso Date: Wed, 27 May 2026 22:32:19 +0200 Subject: [PATCH 4/4] experimental(lakeflow-connect): add 5-troubleshooting reference Event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers (public docs hub, connector reference, workspace support). Signed-off-by: Jose Alfonso --- .../5-troubleshooting-and-monitoring.md | 52 +++++++++++++++++++ manifest.json | 3 +- 2 files changed, 54 insertions(+), 1 deletion(-) create mode 100644 experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md diff --git a/experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md b/experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md new file mode 100644 index 0000000..97796c4 --- /dev/null +++ b/experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md @@ -0,0 +1,52 @@ +# Troubleshooting and monitoring + +This reference covers what to check when an ingestion pipeline misbehaves: where the logs live, the common error shapes, and the escalation path. + +--- + +## Where to look first + +Every Lakeflow Connect pipeline emits a structured event log. For SaaS pipelines that's the only artifact; for database pipelines you'll also want to inspect the gateway pipeline's events. + +The event log is a Delta table on the pipeline. Query it through SQL: + +```sql +SELECT timestamp, level, message, error +FROM event_log("") +WHERE level IN ('ERROR', 'WARN') + AND timestamp > current_timestamp() - INTERVAL 1 DAY +ORDER BY timestamp DESC +LIMIT 50; +``` + +For event-log table conventions (filtering by `event_type`, joining with metrics, etc.), see [databricks-pipelines](../../../skills/databricks-pipelines/SKILL.md). + +**Database pipelines have two event logs** — one for the gateway, one for the ingestion pipeline. A symptom on the ingestion side often has its root cause in the gateway side. When debugging database connectors, query both. + +--- + +## Common errors and resolutions + +| Error / symptom | Likely cause | Resolution | +|---|---|---| +| `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION` | Source snapshot contains duplicate values on the declared primary key. | Inspect the source for duplicate PKs (often a merged record or composite-key surprise). The connector won't auto-resolve — fix at the source or change the PK declaration. | +| `validate_only` update appears in pipeline run history | Expected. A dry-run validation run is logged alongside actual runs. | Filter `event_log` on `details:flow_progress.status != 'VALIDATING'` if the dry-runs are noisy. | +| SCD2 row count doesn't match raw source count | Expected. SCD2 multiplies rows per change (one row per version), so SCD2 row count >> source row count is normal. | Compare on PK count with `current = true` instead of total row count. | +| `NULL` values appear after switching SCD1 -> SCD2 | Expected. Pre-switch history is reconstructed as a single open version with `NULL`s for unknown deltas. | Re-snapshot the table if a clean SCD2 history is required from a specific point. | +| `GB ingested` >> source row size in the metrics | Expected for CDC sources. Change log columns, schema metadata, and per-batch overhead inflate ingested bytes. | Use source row count, not GB ingested, as the workload sizing signal. | +| Gateway pipeline fails: instance type unavailable in region | Default gateway cluster shape isn't stocked in the target region. | Apply a cluster policy override on the gateway pipeline to pin a regionally-available instance type. | +| Pipeline runs but the destination table never updates | UC `CONNECTION` not in `READY` state, OR destination schema missing. | `DESCRIBE CONNECTION ` — state must be `READY`. Verify the destination schema exists and the pipeline's service principal has `CREATE TABLE` + `MODIFY`. | +| OAuth U2M connection refreshes fail after weeks of working | Refresh token expired or revoked at the SaaS source. | Re-open the connection in Catalog Explorer and re-authorize. Plan for periodic re-auth if the SaaS source enforces a refresh-token lifetime. | +| `channel: PREVIEW` warning at pipeline create | Expected for connectors not yet GA in your region. | Switch to `CURRENT` once the connector is GA where the pipeline runs. | + +--- + +## Escalation pointers + +When the event log doesn't explain a failure: + +1. **Public docs hub** — [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) covers concepts and links to per-connector pages. +2. **Connector reference** — [per-connector setup](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) is the canonical source for current auth, limits, and supported objects per source. +3. **Workspace support** — file a support case from Help -> Contact Support inside the workspace; attach the pipeline ID and a relevant `event_log` extract. + +For monitoring patterns beyond event-log queries (dashboards, alerting on pipeline state, SLAs), see [databricks-pipelines](../../../skills/databricks-pipelines/SKILL.md). diff --git a/manifest.json b/manifest.json index 8896444..c3a482e 100644 --- a/manifest.json +++ b/manifest.json @@ -222,7 +222,8 @@ "assets/databricks.svg", "references/1-saas-connectors.md", "references/2-database-connectors.md", - "references/4-ingestion-decision-tree.md" + "references/4-ingestion-decision-tree.md", + "references/5-troubleshooting-and-monitoring.md" ], "repo_dir": "experimental", "version": "0.0.1"