experimental: add databricks-lakeflow-connect skill by jralfonsog · Pull Request #103 · databricks/databricks-agent-skills

jralfonsog · 2026-05-27T15:54:31Z

Summary

New experimental skill databricks-lakeflow-connect for managed ingestion pipelines. GA-first deep coverage; PuPr connectors are listed in SKILL.md as production-supported with deep coverage planned as they stabilize. No databricks-pipelines overlap — Lakeflow Connect pipelines reuse the pipelines API surface via ingestion_definition, and this skill cross-links to skills/databricks-pipelines/ from the decision tree and Related Skills.

Changes

experimental/databricks-lakeflow-connect/SKILL.md (~200 lines) — routing + 3-tier catalog (GA / PuPr / Beta-PrPr) + workflow + key concepts + common issues.
experimental/databricks-lakeflow-connect/references/1-saas-connectors.md (~135 lines) — six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas.
experimental/databricks-lakeflow-connect/references/2-database-connectors.md (~145 lines) — SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, gateway-specific gotchas, brief pointer to PuPr database connectors.
experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md (~130 lines) — Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches. Cross-links to the Auto Loader work in databricks-solutions/ai-dev-kit#539.
experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md (~50 lines) — event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers.
experimental/databricks-lakeflow-connect/agents/openai.yaml + assets/databricks.{svg,png} — auto-generated via scripts/skills.py generate.
manifest.json — updated by scripts/skills.py generate to register the new skill and its references.

SharePoint / Google Drive (Beta as of May 2026; GA target Jun 1) are not first-class in v1 — they appear in the Beta/PrPr note in SKILL.md. databricks-zerobus-ingest is pointed to from the catalog and decision tree (push-vs-pull dichotomy), not re-covered.

To follow

Commit	Content
(v2)	PuPr deep coverage (NetSuite, Dynamics 365, PG/MySQL CDC, SFTP, query-based databases, Foreign Catalog query) as connectors stabilize
(v2)	`references/3-file-and-streaming-connectors.md` — created when SFTP + SharePoint/Drive get deep coverage

Cross-repo

Tracking issue: databricks-solutions/ai-dev-kit#499.
Companion ai-dev-kit PR: databricks-solutions/ai-dev-kit#539 ships the Auto Loader reference in the SDP skill that this skill's decision tree cross-links to.
Scope-checked in #ai-dev-kit-team Slack on 2026-05-27; maintainers signed off on Databricks Agent Skills experimental/ as the destination.

Test plan

python3 scripts/skills.py generate clean.
python3 scripts/skills.py validate passes (Everything is up to date.).
All cross-skill links resolve against the DAS layout (skills/databricks-pipelines/, skills/databricks-dabs/, skills/databricks-jobs/, experimental/databricks-zerobus-ingest/, experimental/databricks-unity-catalog/).
stf audit L3 trajectory across commits: 8.2 → 8.3 → 8.5 → 8.7 (all dimensions PASS at final).
Full Skillforge pyramid L1 - L5 — composite 0.76, PASS. Per-level: L1=0.72 (36 checks), L2=1.00 (3 checks), L3=0.83 (15 checks), L4=0.68 (40 checks), L5=0.58 (95 checks). Ground truth = 8 cases generated with stf generate -n 8 --difficulty mixed, hand-curated tool-agnostic. See PR comment for L5 classification + per-case breakdown.
CI green.

`stf audit` per-dimension (L3, after all references)

Dimension	Score	Status
tool_accuracy	10.0	PASS
examples_valid	10.0	PASS
no_conflicts	9.0	PASS
llm_navigable	9.0	PASS
scoped_clearly	10.0	PASS
security	9.0	PASS
actionable_instructions	8.0	PASS
error_handling	8.0	PASS
no_hallucination_triggers	7.0	PASS
self_contained	7.0	PASS (climbed from 6.0 baseline as references landed; remaining headroom is the deferred `3-file-and-streaming-connectors.md` and PuPr deep coverage)

This pull request was AI-assisted by Isaac.

Initial scope-first commit for a draft PR. GA-first deep coverage, PuPr listed but deferred to follow-up commits. This commit includes: - SKILL.md (routing + 3-tier catalog: GA / PuPr / Beta-PrPr + workflow + key concepts + common issues) - references/4-ingestion-decision-tree.md (LFC vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches) - agents/openai.yaml + assets/ via scripts/skills.py generate - manifest.json updated To follow in subsequent commits: - references/1-saas-connectors.md (Salesforce, Workday Reports, ServiceNow, GA4, HubSpot, Confluence — all GA) - references/2-database-connectors.md (SQL Server cloud + on-prem + gateway pattern intro — GA) - references/5-troubleshooting-and-monitoring.md (GA-focused) Public Preview connectors (NetSuite, Dynamics 365, PG/MySQL CDC, query-based databases, Foreign Catalog query-based, SFTP) are production-supported and listed in SKILL.md; deep coverage will be added incrementally as PuPr connectors stabilize. SharePoint/Google Drive (Beta currently, GA Jun 1 target) and other Beta/PrPr connectors are not first-class in this skill. references/3-file-and-streaming-connectors.md will be created when SFTP + SharePoint/Drive get deep coverage (post-v1). Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>

dustinvannoy-db · 2026-05-27T16:49:17Z

Talked with @jralfonsog and he will add in the GA connector references he has been working on as part of this PR before we review and finalize.

Deep coverage for the six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>

Deep coverage for SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, and gateway-specific gotchas. Brief pointer to Public Preview database connectors (Postgres/MySQL CDC, query-based, Foreign Catalog) pending deep coverage as they stabilize. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>

Event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers (public docs hub, connector reference, workspace support). Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>

jralfonsog · 2026-05-28T00:15:57Z

Skillforge full evaluation (L1 - L5) — composite 0.76, PASS

After all four content commits landed, ran the full Skillforge pyramid via the ~/voodoo/skillforge/SKILL.md orchestrator pattern. Ground truth was bootstrapped with stf generate -n 8 --difficulty mixed, then hand-curated tool-agnostic.

Pyramid summary

Level	Score	Checks	Status
L1 (unit / built-in)	0.72	36	PASS
L2 (integration / connectivity)	1.00	3	PASS
L3 (static / LLM judge)	0.83	15	PASS
L4 (thinking)	0.68	40	PASS
L5 (output WITH vs WITHOUT)	0.58	95	PASS
Composite	0.76		PASS

L3 audit trajectory across commits

After commit	Overall	self_contained	actionable	scoped_clearly	security
1 (SKILL.md + decision tree)	8.2	6.0	7.0	9.0	8.0
2 (+ SaaS connectors ref)	8.3	7.0	7.0	9.0	8.0
3 (+ database connectors ref)	8.5	7.0	7.0	9.0	9.0
4 (+ troubleshooting ref)	8.7	7.0	8.0	10.0	9.0

L5 classification (95 checks across 8 test cases)

Classification	Count	%
POSITIVE (skill taught the agent something useful)	34	36%
NEUTRAL (agent already knew it; skill not needed here)	33	35%
NEEDS_SKILL (both WITH and WITHOUT missed; coverage gap)	22	23%
REGRESSION (skill made the agent worse)	4	4%
UNTAGGED	2	2%

Per-case L5 response score

Case ID	Difficulty	Score	Notes
`saas_oauth_u2m_d9e3`	intermediate	0.83	OAuth U2M cannot be automated in DAB
`dab_authoring_h6c5`	hard	0.72	DAB conversion for Salesforce pipeline
`federation_vs_connect_e5f7`	intermediate	0.68	Snowflake — Federation vs Connect
`too_many_tables_f2a4`	intermediate	0.65	400-table partition workaround
`salesforce_basic_a3f1`	easy	0.63	Salesforce pipeline create
`continuous_mode_error_c4d8`	easy	0.63	Triggered-only constraint
`sqlserver_cdc_gateway_b7c2`	hard	0.20	Agent run cut by `--timeout 300` mid-tool-use
`no_data_flowing_g8b1`	intermediate	0.02	Agent run cut by `--timeout 300` mid-tool-use

The two lowest-scoring cases (sqlserver_cdc_gateway_b7c2, no_data_flowing_g8b1) were cut off by the 5-minute agent timeout while still in active tool_use. The 4 REGRESSIONs and 13 of the 22 NEEDS_SKILL are concentrated in those two truncated runs. Re-running them with a longer timeout is the obvious next step if reviewers want to clear the marginal pass.

L1+L2+L3 quick-eval baseline (run separately before L4/L5): composite 0.85 (L1=0.72, L2=1.00, L3=0.84).

dustinvannoy-db

Got some things to change and others to review, such as if GA connector examples should really use PREVIEW vs. CURRENT. Overall looking good.

Also, any links to other skills should match what stable skills are using, which is just the name without a link.

SUGGESTION: Align with the dominant convention. In Related Skills and inline mentions:
REPLACE: **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)**
WITH: **databricks-pipelines**

dustinvannoy-db · 2026-05-28T00:51:50Z

+  icon_small: "./assets/databricks.svg"
+  icon_large: "./assets/databricks.png"
+  brand_color: "#FF3621"
+  default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect."


Suggested change

default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect."

default_prompt: "Use $databricks-lakeflow-connect to build managed ingestion pipelines into databricks using Lakeflow Connect."

dustinvannoy-db · 2026-05-28T04:17:16Z

+| Source | Type | Auth | GA target |
+|--------|------|------|-----------|
+| NetSuite | SaaS pull | OAuth | May 31, 2026 |
+| Dynamics 365 | SaaS pull | OAuth | May 31, 2026 |
+| PostgreSQL CDC | Database CDC | DB user + gateway | Jun 30, 2026 (tentative); ungated PuPr May 29 |
+| MySQL CDC | Database CDC | DB user + gateway | Jul 15, 2026 (tentative); ungated PuPr May 29 |
+| Oracle / Teradata / SQL Server / PG / MySQL (query-based) | Database query | DB user | Jun 30, 2026 |
+| Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog) | Database query | Foreign Catalog | Jun 30, 2026 |
+| SFTP | File pull | Key / password | Jun 30, 2026 |


Lets remove GA target, we have typically avoided specific dates in skills so they don't cause confusion or require constant maintenance.

dustinvannoy-db · 2026-05-28T04:24:43Z

+| Issue | Solution |
+|-------|----------|
+| **Pipeline fails with `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION`** | Primary key collision in the source snapshot. Inspect the source for duplicate rows on the declared PK column. |
+| **Watermark not advancing on a SaaS source** | Cursor field misconfigured. Check the connector reference for the supported cursor column per source object. |
+| **Column added in source but missing from target** | Schema evolution may need to be explicitly re-enabled per connector. Check connector docs. |
+| **Gateway requires an instance type unavailable in your region** | Apply a cluster policy override on the gateway pipeline; see [2-database-connectors.md](references/2-database-connectors.md). |
+| **`channel: PREVIEW` warning at pipeline create** | Expected for new connectors. Switch to `channel: CURRENT` once the connector is GA in your region. |
+| **`databricks pipelines create` succeeds but no data flows** | Confirm UC connection is in `READY` state and the destination schema exists. Check the event log for any `pre-flight` failures. |
+| **Ingestion run shows GB ingested >> source row size** | Expected for CDC sources — change log columns + schema metadata add overhead. |
+
+For a deeper troubleshooting reference, see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md).


Can we consolidate this to just point to troubleshooting which should already have this info?

dustinvannoy-db · 2026-05-28T04:28:07Z

+**Documentation:**
+- [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect)
+- [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors)
+- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect)


Remove. These are covered in references at the bottom of this file.

dustinvannoy-db · 2026-05-28T04:30:30Z

+---
+name: databricks-lakeflow-connect
+description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing."
+---


Add this, related to what is changing in PR #105

parent: databricks-core
compatibility: Requires databricks CLI (>= v0.294.0)
metadata:
version: "0.1.0"

dustinvannoy-db · 2026-05-28T04:37:57Z

+                 "destination_catalog": "main", "destination_schema": "salesforce_raw"}}
+    ]
+  },
+  "channel": "PREVIEW"