experimental: add databricks-lakeflow-connect skill#103
Conversation
Initial scope-first commit for a draft PR. GA-first deep coverage, PuPr listed but deferred to follow-up commits. This commit includes: - SKILL.md (routing + 3-tier catalog: GA / PuPr / Beta-PrPr + workflow + key concepts + common issues) - references/4-ingestion-decision-tree.md (LFC vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches) - agents/openai.yaml + assets/ via scripts/skills.py generate - manifest.json updated To follow in subsequent commits: - references/1-saas-connectors.md (Salesforce, Workday Reports, ServiceNow, GA4, HubSpot, Confluence — all GA) - references/2-database-connectors.md (SQL Server cloud + on-prem + gateway pattern intro — GA) - references/5-troubleshooting-and-monitoring.md (GA-focused) Public Preview connectors (NetSuite, Dynamics 365, PG/MySQL CDC, query-based databases, Foreign Catalog query-based, SFTP) are production-supported and listed in SKILL.md; deep coverage will be added incrementally as PuPr connectors stabilize. SharePoint/Google Drive (Beta currently, GA Jun 1 target) and other Beta/PrPr connectors are not first-class in this skill. references/3-file-and-streaming-connectors.md will be created when SFTP + SharePoint/Drive get deep coverage (post-v1). Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
|
Talked with @jralfonsog and he will add in the GA connector references he has been working on as part of this PR before we review and finalize. |
Deep coverage for the six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Deep coverage for SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, and gateway-specific gotchas. Brief pointer to Public Preview database connectors (Postgres/MySQL CDC, query-based, Foreign Catalog) pending deep coverage as they stabilize. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers (public docs hub, connector reference, workspace support). Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Skillforge full evaluation (L1 - L5) — composite 0.76, PASSAfter all four content commits landed, ran the full Skillforge pyramid via the Pyramid summary
L3 audit trajectory across commits
L5 classification (95 checks across 8 test cases)
Per-case L5 response score
The two lowest-scoring cases ( L1+L2+L3 quick-eval baseline (run separately before L4/L5): composite 0.85 (L1=0.72, L2=1.00, L3=0.84). |
There was a problem hiding this comment.
Got some things to change and others to review, such as if GA connector examples should really use PREVIEW vs. CURRENT. Overall looking good.
Also, any links to other skills should match what stable skills are using, which is just the name without a link.
SUGGESTION: Align with the dominant convention. In Related Skills and inline mentions:
REPLACE: **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)**
WITH: **databricks-pipelines**
| icon_small: "./assets/databricks.svg" | ||
| icon_large: "./assets/databricks.png" | ||
| brand_color: "#FF3621" | ||
| default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect." |
There was a problem hiding this comment.
| default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect." | |
| default_prompt: "Use $databricks-lakeflow-connect to build managed ingestion pipelines into databricks using Lakeflow Connect." |
| | Source | Type | Auth | GA target | | ||
| |--------|------|------|-----------| | ||
| | NetSuite | SaaS pull | OAuth | May 31, 2026 | | ||
| | Dynamics 365 | SaaS pull | OAuth | May 31, 2026 | | ||
| | PostgreSQL CDC | Database CDC | DB user + gateway | Jun 30, 2026 (tentative); ungated PuPr May 29 | | ||
| | MySQL CDC | Database CDC | DB user + gateway | Jul 15, 2026 (tentative); ungated PuPr May 29 | | ||
| | Oracle / Teradata / SQL Server / PG / MySQL (query-based) | Database query | DB user | Jun 30, 2026 | | ||
| | Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog) | Database query | Foreign Catalog | Jun 30, 2026 | | ||
| | SFTP | File pull | Key / password | Jun 30, 2026 | |
There was a problem hiding this comment.
Lets remove GA target, we have typically avoided specific dates in skills so they don't cause confusion or require constant maintenance.
| | Issue | Solution | | ||
| |-------|----------| | ||
| | **Pipeline fails with `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION`** | Primary key collision in the source snapshot. Inspect the source for duplicate rows on the declared PK column. | | ||
| | **Watermark not advancing on a SaaS source** | Cursor field misconfigured. Check the connector reference for the supported cursor column per source object. | | ||
| | **Column added in source but missing from target** | Schema evolution may need to be explicitly re-enabled per connector. Check connector docs. | | ||
| | **Gateway requires an instance type unavailable in your region** | Apply a cluster policy override on the gateway pipeline; see [2-database-connectors.md](references/2-database-connectors.md). | | ||
| | **`channel: PREVIEW` warning at pipeline create** | Expected for new connectors. Switch to `channel: CURRENT` once the connector is GA in your region. | | ||
| | **`databricks pipelines create` succeeds but no data flows** | Confirm UC connection is in `READY` state and the destination schema exists. Check the event log for any `pre-flight` failures. | | ||
| | **Ingestion run shows GB ingested >> source row size** | Expected for CDC sources — change log columns + schema metadata add overhead. | | ||
|
|
||
| For a deeper troubleshooting reference, see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md). |
There was a problem hiding this comment.
Can we consolidate this to just point to troubleshooting which should already have this info?
| **Documentation:** | ||
| - [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect) | ||
| - [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors) | ||
| - [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect) |
There was a problem hiding this comment.
Remove. These are covered in references at the bottom of this file.
| --- | ||
| name: databricks-lakeflow-connect | ||
| description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing." | ||
| --- |
There was a problem hiding this comment.
Add this, related to what is changing in PR #105
parent: databricks-core
compatibility: Requires databricks CLI (>= v0.294.0)
metadata:
version: "0.1.0"
| "destination_catalog": "main", "destination_schema": "salesforce_raw"}} | ||
| ] | ||
| }, | ||
| "channel": "PREVIEW" |
There was a problem hiding this comment.
Is PREVIEW correct, not CURRENT?
| "destination_catalog": "main", "destination_schema": "salesforce_raw"}} | ||
| ] | ||
| }, | ||
| "channel": "PREVIEW" |
There was a problem hiding this comment.
Is PREVIEW correct, not CURRENT?
| pipelines: | ||
| salesforce_ingestion: | ||
| name: salesforce_to_uc | ||
| channel: PREVIEW |
There was a problem hiding this comment.
Is PREVIEW correct, not CURRENT?
| pipelines: | ||
| sqlserver_gateway: | ||
| name: sqlserver_gateway | ||
| channel: PREVIEW |
There was a problem hiding this comment.
Is PREVIEW correct, not CURRENT?
|
|
||
| sqlserver_ingestion: | ||
| name: sqlserver_to_uc | ||
| channel: PREVIEW |
There was a problem hiding this comment.
Is PREVIEW correct, not CURRENT?
|
@jralfonsog can you attach the report.html that's generated as well? |
Summary
New experimental skill
databricks-lakeflow-connectfor managed ingestion pipelines. GA-first deep coverage; PuPr connectors are listed inSKILL.mdas production-supported with deep coverage planned as they stabilize. Nodatabricks-pipelinesoverlap — Lakeflow Connect pipelines reuse the pipelines API surface viaingestion_definition, and this skill cross-links toskills/databricks-pipelines/from the decision tree and Related Skills.Changes
experimental/databricks-lakeflow-connect/SKILL.md(~200 lines) — routing + 3-tier catalog (GA / PuPr / Beta-PrPr) + workflow + key concepts + common issues.experimental/databricks-lakeflow-connect/references/1-saas-connectors.md(~135 lines) — six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas.experimental/databricks-lakeflow-connect/references/2-database-connectors.md(~145 lines) — SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, gateway-specific gotchas, brief pointer to PuPr database connectors.experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md(~130 lines) — Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches. Cross-links to the Auto Loader work in databricks-solutions/ai-dev-kit#539.experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md(~50 lines) — event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers.experimental/databricks-lakeflow-connect/agents/openai.yaml+assets/databricks.{svg,png}— auto-generated viascripts/skills.py generate.manifest.json— updated byscripts/skills.py generateto register the new skill and its references.SharePoint / Google Drive (Beta as of May 2026; GA target Jun 1) are not first-class in v1 — they appear in the Beta/PrPr note in
SKILL.md.databricks-zerobus-ingestis pointed to from the catalog and decision tree (push-vs-pull dichotomy), not re-covered.To follow
references/3-file-and-streaming-connectors.md— created when SFTP + SharePoint/Drive get deep coverageCross-repo
#ai-dev-kit-teamSlack on 2026-05-27; maintainers signed off on Databricks Agent Skillsexperimental/as the destination.Test plan
python3 scripts/skills.py generateclean.python3 scripts/skills.py validatepasses (Everything is up to date.).skills/databricks-pipelines/,skills/databricks-dabs/,skills/databricks-jobs/,experimental/databricks-zerobus-ingest/,experimental/databricks-unity-catalog/).stf auditL3 trajectory across commits: 8.2 → 8.3 → 8.5 → 8.7 (all dimensions PASS at final).stf generate -n 8 --difficulty mixed, hand-curated tool-agnostic. See PR comment for L5 classification + per-case breakdown.stf auditper-dimension (L3, after all references)3-file-and-streaming-connectors.mdand PuPr deep coverage)This pull request was AI-assisted by Isaac.