skills: split into databricks-model-serving (ops) + databricks-ml-training (experimental) by QuentinAmbard · Pull Request #110 · databricks/databricks-agent-skills

QuentinAmbard · 2026-05-28T08:36:51Z

Stacks on top of #84. Merge #84 first; this PR includes those commits at the base and adds the new skill on top.

Why this PR exists

PR #84 lands the model-serving content (endpoint create, query, update, traffic config, AI Gateway, Foundation Model API discovery) into databricks-model-serving. That's the right shape for a serving-ops skill, and it's what reviewers should expect a skill called "model serving" to contain.

The remaining a-d-k content — training a model with MLflow autolog, registering it to Unity Catalog, promoting versions via @prod aliases, custom PyFunc authoring, hand-rolled ResponsesAgent code — is a different lifecycle. It runs before an endpoint exists, often in a notebook submitted as a serverless job, and an agent asked to "train an XGBoost model and deploy it" needs both concerns surfaced cleanly rather than blended into one skill description.

This PR lands the dev-side content as a separate databricks-ml-training experimental skill, and weaves a few small but high-leverage serving-side fixes from the original a-d-k content into databricks-model-serving where they belong.

What this PR improves

A focused dev-side skill. New experimental/databricks-ml-training/ owns the training → register → consume narrative: MLflow autolog with Optuna for hyperparameter tuning, mlflow.set_registry_uri(\"databricks-uc\") + experiment-parent-folder pre-creation, alias-based promotion (@prod / @challenger), batch scoring via mlflow.pyfunc.spark_udf, custom PyFunc with the file-based "Models from Code" pattern, hand-rolled ResponsesAgent with LangGraph + UC Function + Vector Search tools, and the databricks jobs submit --no-wait train-and-deploy pattern.

Frontmatter triggers that actually triage. Each skill's description lists what it IS for and what it explicitly is NOT for, with cross-pointers (databricks-ml-training says "use databricks-model-serving for endpoint ops"; databricks-model-serving says "use databricks-ml-training for training and PyFunc authoring"). When the user says "train a model and deploy it," the orchestrator pulls both skills exactly once each.

Cross-skill links that resolve. Every databricks-model-serving → databricks-ml-training link and every reverse link uses the right relative path for the stable-skills/ ↔ experimental-experimental/ layout. No broken anchors, no stale paths to the old training-and-serving.md filename anywhere.

Five small but high-leverage gaps closed in databricks-model-serving. The original a-d-k port left a few non-obvious serving behaviors implicit. Each fix is woven into the existing section that already covers the topic — no new mega-sections, no duplication of MLflow boilerplate an LLM already knows from training data. The result: serving-side behavior an agent would otherwise have to guess at is now explicit and signposted.

Summary of changes

Area	What changed
New experimental skill	`experimental/databricks-ml-training/` with SKILL.md + agents/ + assets/ + references/{custom-pyfunc.md, genai-agents.md}. Owns the full dev-side narrative (autolog + Optuna, UC registration, alias promotion, batch scoring, custom PyFunc, custom ResponsesAgent, train-and-deploy serverless job pattern).
Frontmatter scoping	Both skills' descriptions list scope + explicit NOT-for callouts pointing at their sibling. Triggers triage cleanly when the user mentions both training and deployment.
Cross-links	All cross-skill paths fixed for the stable ↔ experimental layout. Foundation Model API discovery moved into `databricks-model-serving/SKILL.md` inline (was previously linked into the relocated training file).
Serving gaps closed	Five small fixes in `databricks-model-serving/SKILL.md`: MLflow Deployments client gotchas (`tags=` top-level, `served_model_name` derivation), zero-downtime version-swap pattern (alias-repoint AND `update_endpoint`), two-state-field readiness rationale (`state.ready` lies during version-swap), classical-ML `dataframe_records` query example, Serving-UI "Owned by me" SP-filter troubleshooting row. Each merged into the existing section that already covered the topic.
DAS-only content preserved	PR #84's idempotency check before agent deploy, AppKit integration section, off-platform AI SDK v6 streaming, endpoint-structure ASCII diagram, OpenAPI schema section — all kept.
Versioning	`databricks-model-serving` bumped to 0.4.0 (description retightened, gaps closed). New `databricks-ml-training` at 0.1.0 under `experimental/`. Manifest regenerated; 27 skills total.

Reviewer aid

The split is on the natural seam — anything that runs before mlflow.deployments.get_deploy_client(...).create_endpoint(...) is dev-side and lives in databricks-ml-training, anything from create_endpoint onward is ops-side and lives in databricks-model-serving. The Python create_endpoint(...) / update_endpoint(...) call itself is canonically a serving operation and is documented there with the two non-obvious gotchas.

Validation: python3 scripts/skills.py validate passes; zero broken links across both touched skills.

This pull request and its description were written by Isaac.

Phase 1 of databricks#73's TODO #1b. Adds references/fm-api-endpoints.md with the curated Foundation Model API endpoint table (chat/instruct + embedding models) from databricks-solutions/ai-dev-kit's model-serving skill, plus common defaults and query examples (CLI + SDK). Stripped: the cloud/language prefix on the docs link, and the leftover MCP-tool references in the source. The endpoint table itself is static catalog data — no MCP coupling. SKILL.md updates: - bump version to 0.2.0 - point Endpoint Types table at the new reference - point the Foundation Model discovery bullet at the new reference Subsequent phases (separate PRs / commits) port the remaining dev-side content: classical-ml autolog patterns, Custom PyFunc signatures, ResponsesAgent with the create_text_output_item gotcha, UCFunctionToolkit + VectorSearchRetrieverTool resource passthrough. Co-authored-by: Isaac

Aligns the verbatim a-d-k port with the live docs.databricks.com supported-models page (validated via WebFetch on 2026-05-26): ADDED (missing from a-d-k snapshot): - databricks-claude-opus-4-7 (now most capable Claude) - databricks-gpt-5-5-pro, 5-5 - databricks-gpt-5-4, 5-4-mini, 5-4-nano - databricks-gpt-5-3-codex, 5-2-codex - databricks-gemini-3-1-flash-lite, 3-5-flash - databricks-qwen35-122b-a10b (Preview) REMOVED (retired, no longer in docs): - databricks-claude-3-7-sonnet - databricks-meta-llama-3-1-405b-instruct UPDATED notes: - claude-opus-4-6 no longer "Most capable" - gpt-5-2 no longer "Latest" - gpt-5-1-codex-{max,mini} + gpt-5-2-codex marked retiring 2026-07-16 - gemini-3-pro marked retired 2026-03-26 with redirect through 2026-06-07 - Several Gemini / Codex endpoints annotated with cross-geo requirement - qwen3-next-80b annotated as Preview OPENING PARAGRAPH: - "available in every workspace" -> "available in supported Model Serving regions"; calls out cross-geo requirement for several endpoints NOT TOUCHED (out of scope: not docs-validatable from supported-models page): - served_entities[].entity_name guidance (line 3 second half) - SKILL.md "system.ai.* catalog" claim on the pay-per-token row These remain as in the a-d-k snapshot and should be revisited if/when docs cover them directly. Test plan: `scripts/skills.py validate` -> "Everything is up to date"; `scripts/skills.py generate` -> only refreshes manifest.json timestamps. Co-authored-by: Isaac

…ot static catalog Quentin pointed out (PR databricks#84) that the prior two commits actually ported from `main:databricks-skills/databricks-model-serving/`, not `experimental:databricks-skills/databricks-ml-training-serving/` as the PR description claimed. The two skills take opposite approaches: - `main` ships a static catalog table of FM API endpoint names. - `experimental` deliberately rejects that ("a static skill list goes stale fast — always list at runtime instead of hard-coding names") and ships a `databricks serving-endpoints list | jq ...` one-liner plus runtime-resolved defaults (highest-numbered Claude Sonnet for agents, highest-numbered `-codex-max` for code). Re-port to match the experimental philosophy: - `references/fm-api-endpoints.md`: replace the static catalog with the runtime-list snippet (filtered by `databricks-` name prefix AND `system.ai.*` served entity, to exclude non-FM endpoints sharing the prefix), runtime-resolved family defaults, and CLI + SDK query examples that use a placeholder endpoint name rather than a hard-coded model. - `SKILL.md`: update the Endpoint Types row + the Foundation-Model discovery bullet to reframe the reference as "discover at runtime" rather than "curated table". Version stays at 0.2.0 (frontmatter unchanged → manifest unchanged). The 2026-05-26 catalog refresh in the previous commit is dropped here: the experimental skill's point is that no static table is the right shape, so curating one against docs.databricks.com isn't useful for the stable skill either. Co-authored-by: Isaac

…ental port Previous commit (c148500) restated the experimental section in my own words and added a "Querying" section + provisioned-throughput aside + docs-link gloss that aren't in the upstream skill. The PR's stated goal is to port from experimental — do an actual port, not a paraphrase. `references/fm-api-endpoints.md` now mirrors the `## Foundation Model API endpoints` section of `experimental:databricks-ml-training-serving/SKILL.md` verbatim (heading promoted from `##` to `#` since this is a standalone file): intro paragraph + the `databricks serving-endpoints list | jq ...` one-liner + the family-based default-picking rule. Nothing else. Also trim the SKILL.md discovery bullet back toward its original shape — link to the reference file for the runtime-list snippet, then the same `system.ai` / `serving-endpoints list` / `get-open-api` alternatives that were already there. Co-authored-by: Isaac

…ntal Expands the port from the FM-endpoints-only scope to cover every section of `experimental:databricks-ml-training-serving/`. Mirrors the experimental skill's 3-file structure 1:1 into stable's `references/` directory; the standalone fm-api-endpoints.md added in earlier commits goes away (its content lives inline in training-and-serving.md exactly as it does in experimental's SKILL.md). Added (all verbatim ports, mechanical adjustments only): references/training-and-serving.md Ports experimental SKILL.md content. Mechanical changes only: frontmatter stripped (destination is a reference file, not a SKILL.md); `1-custom-pyfunc.md` → `custom-pyfunc.md`, `2-genai-agents.md` → `genai-agents.md` (filename renames); `../<skill>/SKILL.md` → `../../<skill>/SKILL.md` (one more level of nesting since this file is in references/ rather than at the skill root). Content covers: canonical train/register/serve flow, `mlflow.{sklearn,xgboost,…}.autolog()` patterns, UC alias-based promotion, batch scoring via `spark_udf`, real-time endpoint create + zero-downtime version swap, `state.ready` vs `state.config_update` poll-both gotcha, `jobs submit --no-wait` serverless deploy pattern, Foundation Model API endpoints runtime-list, and the full gotchas trap-table. references/custom-pyfunc.md Ports experimental 1-custom-pyfunc.md verbatim. Mechanical change: `[SKILL.md]` → `[training-and-serving.md]` where the original cross-referenced its parent SKILL.md. Content: file-based PyFunc ("Models from Code"), `infer_signature`, `code_paths`, pre-deploy validation via `mlflow.models.predict(env_manager="uv")`. references/genai-agents.md Ports experimental 2-genai-agents.md verbatim. Mechanical changes: cross-skill paths bumped one level deeper; `[SKILL.md]` → `[training-and-serving.md]`. Content covers: `ResponsesAgent` interface, LangGraph agent with `UCFunctionToolkit` + `VectorSearchRetrieverTool`, the `create_text_output_item` raw-dict-silently-fails gotcha, the `resources=[...]` passthrough-auth list (DatabricksServingEndpoint, DatabricksFunction, DatabricksVectorSearchIndex, DatabricksLakebase), async deploy via `agents.deploy()` from a serverless job, query via CLI and OpenAI-compatible client. Removed: references/fm-api-endpoints.md Standalone file from earlier commits; its content lives inline in training-and-serving.md exactly as it does in experimental's SKILL.md, so the deliberate split is no longer needed. Stable SKILL.md updates (minimal, ops-focus preserved): - FM-endpoint link targets updated from `references/fm-api-endpoints.md` to `references/training-and-serving.md#foundation-model-api-endpoints` in the Endpoint Types table row and the FM-discovery bullet. - New `### Develop & deploy new models` subsection under "What's Next" with a 3-row table pointing at the new dev-side references, framed as "this skill is ops-focused; for the dev-side flow, see below". Manifest regenerated. Co-authored-by: Isaac

- The mechanical `../` → `../../` rewrite in the verbatim port assumed every peer skill is stable, but 4 of them live in `experimental/`. `../../<skill>/SKILL.md` resolved to `skills/<skill>/SKILL.md` which does not exist for `databricks-agent-bricks`, `databricks-mlflow-evaluation`, `databricks-vector-search`, `databricks-unity-catalog`. Repointed to `../../../experimental/<skill>/SKILL.md`. `databricks-jobs` link unchanged (it's stable). - SKILL.md frontmatter `description` only described the ops surface, so agents wouldn't route dev-side asks (train, register, PyFunc, ResponsesAgent) to this skill. Broadened to cover both ops and the new dev surface. - Version bumped 0.2.0 → 0.3.0 + manifest regenerated. Co-authored-by: Isaac

…-phase1 # Conflicts: # manifest.json

@simonfaltum

Per @simonfaltum review: before resubmitting a deploy serverless job, agents should check whether a run is already in flight (active job runs filtered on run_name) or whether the target endpoint already exists in the right state. Avoids wasting ~15 min of serverless and racing for the same endpoint name. Co-authored-by: Isaac

…icks-ml-training Splits the post-port databricks-model-serving skill into two skills with clean responsibility boundaries: databricks-model-serving keeps the endpoint lifecycle / ops surface, and a new experimental databricks-ml-training owns the dev-side training, MLflow tracking, UC registration, custom PyFunc, and hand-rolled ResponsesAgent content. Also closes five small gaps in databricks-model-serving where non-obvious serving behavior from the original a-d-k port had fallen through the cracks (Python deployments client gotchas, zero-downtime version swap, two-field readiness rationale, classical-ML query shape, Serving-UI SP filter). Co-authored-by: Isaac

jamesbroadhead and others added 9 commits May 26, 2026 09:47

Merge remote-tracking branch 'origin/main' into jb/model-serving-port…

15d7b4c

…-phase1 # Conflicts: # manifest.json

QuentinAmbard requested review from a team, dustinvannoy-db, lennartkats-db and simonfaltum as code owners May 28, 2026 08:36

QuentinAmbard mentioned this pull request May 28, 2026

skills(model-serving): merge dev-side training/agent flows from a-d-k experimental #84

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skills: split into databricks-model-serving (ops) + databricks-ml-training (experimental)#110

skills: split into databricks-model-serving (ops) + databricks-ml-training (experimental)#110
QuentinAmbard wants to merge 9 commits into
databricks:mainfrom
QuentinAmbard:skills/databricks-ml-training-split

QuentinAmbard commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

QuentinAmbard commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this PR exists

What this PR improves

Summary of changes

Reviewer aid

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

QuentinAmbard commented May 28, 2026 •

edited

Loading