Skip to content

experimental: add databricks-lakeflow-connect skill#103

Draft
jralfonsog wants to merge 4 commits into
databricks:mainfrom
jralfonsog:experimental/lakeflow-connect
Draft

experimental: add databricks-lakeflow-connect skill#103
jralfonsog wants to merge 4 commits into
databricks:mainfrom
jralfonsog:experimental/lakeflow-connect

Conversation

@jralfonsog
Copy link
Copy Markdown

@jralfonsog jralfonsog commented May 27, 2026

Summary

New experimental skill databricks-lakeflow-connect for managed ingestion pipelines. GA-first deep coverage; PuPr connectors are listed in SKILL.md as production-supported with deep coverage planned as they stabilize. No databricks-pipelines overlap — Lakeflow Connect pipelines reuse the pipelines API surface via ingestion_definition, and this skill cross-links to skills/databricks-pipelines/ from the decision tree and Related Skills.

Changes

  • experimental/databricks-lakeflow-connect/SKILL.md (~200 lines) — routing + 3-tier catalog (GA / PuPr / Beta-PrPr) + workflow + key concepts + common issues.
  • experimental/databricks-lakeflow-connect/references/1-saas-connectors.md (~135 lines) — six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas.
  • experimental/databricks-lakeflow-connect/references/2-database-connectors.md (~145 lines) — SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, gateway-specific gotchas, brief pointer to PuPr database connectors.
  • experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md (~130 lines) — Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches. Cross-links to the Auto Loader work in databricks-solutions/ai-dev-kit#539.
  • experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md (~50 lines) — event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers.
  • experimental/databricks-lakeflow-connect/agents/openai.yaml + assets/databricks.{svg,png} — auto-generated via scripts/skills.py generate.
  • manifest.json — updated by scripts/skills.py generate to register the new skill and its references.

SharePoint / Google Drive (Beta as of May 2026; GA target Jun 1) are not first-class in v1 — they appear in the Beta/PrPr note in SKILL.md. databricks-zerobus-ingest is pointed to from the catalog and decision tree (push-vs-pull dichotomy), not re-covered.

To follow

Commit Content
(v2) PuPr deep coverage (NetSuite, Dynamics 365, PG/MySQL CDC, SFTP, query-based databases, Foreign Catalog query) as connectors stabilize
(v2) references/3-file-and-streaming-connectors.md — created when SFTP + SharePoint/Drive get deep coverage

Cross-repo

  • Tracking issue: databricks-solutions/ai-dev-kit#499.
  • Companion ai-dev-kit PR: databricks-solutions/ai-dev-kit#539 ships the Auto Loader reference in the SDP skill that this skill's decision tree cross-links to.
  • Scope-checked in #ai-dev-kit-team Slack on 2026-05-27; maintainers signed off on Databricks Agent Skills experimental/ as the destination.

Test plan

  • python3 scripts/skills.py generate clean.
  • python3 scripts/skills.py validate passes (Everything is up to date.).
  • All cross-skill links resolve against the DAS layout (skills/databricks-pipelines/, skills/databricks-dabs/, skills/databricks-jobs/, experimental/databricks-zerobus-ingest/, experimental/databricks-unity-catalog/).
  • stf audit L3 trajectory across commits: 8.2 → 8.3 → 8.5 → 8.7 (all dimensions PASS at final).
  • Full Skillforge pyramid L1 - L5 — composite 0.76, PASS. Per-level: L1=0.72 (36 checks), L2=1.00 (3 checks), L3=0.83 (15 checks), L4=0.68 (40 checks), L5=0.58 (95 checks). Ground truth = 8 cases generated with stf generate -n 8 --difficulty mixed, hand-curated tool-agnostic. See PR comment for L5 classification + per-case breakdown.
  • CI green.

stf audit per-dimension (L3, after all references)

Dimension Score Status
tool_accuracy 10.0 PASS
examples_valid 10.0 PASS
no_conflicts 9.0 PASS
llm_navigable 9.0 PASS
scoped_clearly 10.0 PASS
security 9.0 PASS
actionable_instructions 8.0 PASS
error_handling 8.0 PASS
no_hallucination_triggers 7.0 PASS
self_contained 7.0 PASS (climbed from 6.0 baseline as references landed; remaining headroom is the deferred 3-file-and-streaming-connectors.md and PuPr deep coverage)

This pull request was AI-assisted by Isaac.

Initial scope-first commit for a draft PR. GA-first deep coverage,
PuPr listed but deferred to follow-up commits.

This commit includes:
- SKILL.md (routing + 3-tier catalog: GA / PuPr / Beta-PrPr + workflow
  + key concepts + common issues)
- references/4-ingestion-decision-tree.md (LFC vs Auto Loader vs
  Lakehouse Federation vs Delta Sharing vs Zerobus + cost
  considerations + escape hatches)
- agents/openai.yaml + assets/ via scripts/skills.py generate
- manifest.json updated

To follow in subsequent commits:
- references/1-saas-connectors.md (Salesforce, Workday Reports,
  ServiceNow, GA4, HubSpot, Confluence — all GA)
- references/2-database-connectors.md (SQL Server cloud + on-prem +
  gateway pattern intro — GA)
- references/5-troubleshooting-and-monitoring.md (GA-focused)

Public Preview connectors (NetSuite, Dynamics 365, PG/MySQL CDC,
query-based databases, Foreign Catalog query-based, SFTP) are
production-supported and listed in SKILL.md; deep coverage will be
added incrementally as PuPr connectors stabilize. SharePoint/Google
Drive (Beta currently, GA Jun 1 target) and other Beta/PrPr connectors
are not first-class in this skill.

references/3-file-and-streaming-connectors.md will be created when SFTP
+ SharePoint/Drive get deep coverage (post-v1).

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@dustinvannoy-db
Copy link
Copy Markdown
Collaborator

Talked with @jralfonsog and he will add in the GA connector references he has been working on as part of this PR before we review and finalize.

Deep coverage for the six GA SaaS connectors (Salesforce, Workday Reports,
ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection
+ pipeline + schedule pattern, per-connector auth and limits, DAB stub, and
common gotchas.

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Deep coverage for SQL Server (cloud and on-prem): the gateway pattern,
change tracking vs CDC, DAB stub with both gateway and ingestion
pipelines, on-prem private networking, and gateway-specific gotchas.
Brief pointer to Public Preview database connectors (Postgres/MySQL CDC,
query-based, Foreign Catalog) pending deep coverage as they stabilize.

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Event log queries (SaaS and database pipelines), nine common
error / expected-behavior rows with resolutions, and escalation
pointers (public docs hub, connector reference, workspace support).

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@jralfonsog
Copy link
Copy Markdown
Author

Skillforge full evaluation (L1 - L5) — composite 0.76, PASS

After all four content commits landed, ran the full Skillforge pyramid via the ~/voodoo/skillforge/SKILL.md orchestrator pattern. Ground truth was bootstrapped with stf generate -n 8 --difficulty mixed, then hand-curated tool-agnostic.

Pyramid summary

Level Score Checks Status
L1 (unit / built-in) 0.72 36 PASS
L2 (integration / connectivity) 1.00 3 PASS
L3 (static / LLM judge) 0.83 15 PASS
L4 (thinking) 0.68 40 PASS
L5 (output WITH vs WITHOUT) 0.58 95 PASS
Composite 0.76 PASS

L3 audit trajectory across commits

After commit Overall self_contained actionable scoped_clearly security
1 (SKILL.md + decision tree) 8.2 6.0 7.0 9.0 8.0
2 (+ SaaS connectors ref) 8.3 7.0 7.0 9.0 8.0
3 (+ database connectors ref) 8.5 7.0 7.0 9.0 9.0
4 (+ troubleshooting ref) 8.7 7.0 8.0 10.0 9.0

L5 classification (95 checks across 8 test cases)

Classification Count %
POSITIVE (skill taught the agent something useful) 34 36%
NEUTRAL (agent already knew it; skill not needed here) 33 35%
NEEDS_SKILL (both WITH and WITHOUT missed; coverage gap) 22 23%
REGRESSION (skill made the agent worse) 4 4%
UNTAGGED 2 2%

Per-case L5 response score

Case ID Difficulty Score Notes
saas_oauth_u2m_d9e3 intermediate 0.83 OAuth U2M cannot be automated in DAB
dab_authoring_h6c5 hard 0.72 DAB conversion for Salesforce pipeline
federation_vs_connect_e5f7 intermediate 0.68 Snowflake — Federation vs Connect
too_many_tables_f2a4 intermediate 0.65 400-table partition workaround
salesforce_basic_a3f1 easy 0.63 Salesforce pipeline create
continuous_mode_error_c4d8 easy 0.63 Triggered-only constraint
sqlserver_cdc_gateway_b7c2 hard 0.20 Agent run cut by --timeout 300 mid-tool-use
no_data_flowing_g8b1 intermediate 0.02 Agent run cut by --timeout 300 mid-tool-use

The two lowest-scoring cases (sqlserver_cdc_gateway_b7c2, no_data_flowing_g8b1) were cut off by the 5-minute agent timeout while still in active tool_use. The 4 REGRESSIONs and 13 of the 22 NEEDS_SKILL are concentrated in those two truncated runs. Re-running them with a longer timeout is the obvious next step if reviewers want to clear the marginal pass.

L1+L2+L3 quick-eval baseline (run separately before L4/L5): composite 0.85 (L1=0.72, L2=1.00, L3=0.84).

Copy link
Copy Markdown
Collaborator

@dustinvannoy-db dustinvannoy-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some things to change and others to review, such as if GA connector examples should really use PREVIEW vs. CURRENT. Overall looking good.

Also, any links to other skills should match what stable skills are using, which is just the name without a link.

SUGGESTION: Align with the dominant convention. In Related Skills and inline mentions:
REPLACE: **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)**
WITH: **databricks-pipelines**

icon_small: "./assets/databricks.svg"
icon_large: "./assets/databricks.png"
brand_color: "#FF3621"
default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default_prompt: "Use $databricks-lakeflow-connect for build managed ingestion pipelines into databricks using lakeflow connect."
default_prompt: "Use $databricks-lakeflow-connect to build managed ingestion pipelines into databricks using Lakeflow Connect."

Comment on lines +62 to +70
| Source | Type | Auth | GA target |
|--------|------|------|-----------|
| NetSuite | SaaS pull | OAuth | May 31, 2026 |
| Dynamics 365 | SaaS pull | OAuth | May 31, 2026 |
| PostgreSQL CDC | Database CDC | DB user + gateway | Jun 30, 2026 (tentative); ungated PuPr May 29 |
| MySQL CDC | Database CDC | DB user + gateway | Jul 15, 2026 (tentative); ungated PuPr May 29 |
| Oracle / Teradata / SQL Server / PG / MySQL (query-based) | Database query | DB user | Jun 30, 2026 |
| Snowflake / Redshift / Synapse / BigQuery (Foreign Catalog) | Database query | Foreign Catalog | Jun 30, 2026 |
| SFTP | File pull | Key / password | Jun 30, 2026 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove GA target, we have typically avoided specific dates in skills so they don't cause confusion or require constant maintenance.

Comment on lines +171 to +181
| Issue | Solution |
|-------|----------|
| **Pipeline fails with `APPLY_CHANGES_FROM_SNAPSHOT_ERROR.DUPLICATE_KEY_VIOLATION`** | Primary key collision in the source snapshot. Inspect the source for duplicate rows on the declared PK column. |
| **Watermark not advancing on a SaaS source** | Cursor field misconfigured. Check the connector reference for the supported cursor column per source object. |
| **Column added in source but missing from target** | Schema evolution may need to be explicitly re-enabled per connector. Check connector docs. |
| **Gateway requires an instance type unavailable in your region** | Apply a cluster policy override on the gateway pipeline; see [2-database-connectors.md](references/2-database-connectors.md). |
| **`channel: PREVIEW` warning at pipeline create** | Expected for new connectors. Switch to `channel: CURRENT` once the connector is GA in your region. |
| **`databricks pipelines create` succeeds but no data flows** | Confirm UC connection is in `READY` state and the destination schema exists. Check the event log for any `pre-flight` failures. |
| **Ingestion run shows GB ingested >> source row size** | Expected for CDC sources — change log columns + schema metadata add overhead. |

For a deeper troubleshooting reference, see [5-troubleshooting-and-monitoring.md](references/5-troubleshooting-and-monitoring.md).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate this to just point to troubleshooting which should already have this info?

Comment on lines +12 to +15
**Documentation:**
- [Lakeflow Connect overview](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect)
- [Connector reference](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/connectors)
- [Pricing](https://www.databricks.com/product/pricing/lakeflow-connect)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove. These are covered in references at the bottom of this file.

---
name: databricks-lakeflow-connect
description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing."
---
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this, related to what is changing in PR #105

parent: databricks-core
compatibility: Requires databricks CLI (>= v0.294.0)
metadata:
version: "0.1.0"

"destination_catalog": "main", "destination_schema": "salesforce_raw"}}
]
},
"channel": "PREVIEW"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PREVIEW correct, not CURRENT?

"destination_catalog": "main", "destination_schema": "salesforce_raw"}}
]
},
"channel": "PREVIEW"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PREVIEW correct, not CURRENT?

pipelines:
salesforce_ingestion:
name: salesforce_to_uc
channel: PREVIEW
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PREVIEW correct, not CURRENT?

pipelines:
sqlserver_gateway:
name: sqlserver_gateway
channel: PREVIEW
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PREVIEW correct, not CURRENT?


sqlserver_ingestion:
name: sqlserver_to_uc
channel: PREVIEW
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PREVIEW correct, not CURRENT?

@auschoi96
Copy link
Copy Markdown

@jralfonsog can you attach the report.html that's generated as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants