skills: split into databricks-model-serving (ops) + databricks-ml-training (experimental)#110
Open
QuentinAmbard wants to merge 9 commits into
Open
Conversation
Phase 1 of databricks#73's TODO #1b. Adds references/fm-api-endpoints.md with the curated Foundation Model API endpoint table (chat/instruct + embedding models) from databricks-solutions/ai-dev-kit's model-serving skill, plus common defaults and query examples (CLI + SDK). Stripped: the cloud/language prefix on the docs link, and the leftover MCP-tool references in the source. The endpoint table itself is static catalog data — no MCP coupling. SKILL.md updates: - bump version to 0.2.0 - point Endpoint Types table at the new reference - point the Foundation Model discovery bullet at the new reference Subsequent phases (separate PRs / commits) port the remaining dev-side content: classical-ml autolog patterns, Custom PyFunc signatures, ResponsesAgent with the create_text_output_item gotcha, UCFunctionToolkit + VectorSearchRetrieverTool resource passthrough. Co-authored-by: Isaac
Aligns the verbatim a-d-k port with the live docs.databricks.com
supported-models page (validated via WebFetch on 2026-05-26):
ADDED (missing from a-d-k snapshot):
- databricks-claude-opus-4-7 (now most capable Claude)
- databricks-gpt-5-5-pro, 5-5
- databricks-gpt-5-4, 5-4-mini, 5-4-nano
- databricks-gpt-5-3-codex, 5-2-codex
- databricks-gemini-3-1-flash-lite, 3-5-flash
- databricks-qwen35-122b-a10b (Preview)
REMOVED (retired, no longer in docs):
- databricks-claude-3-7-sonnet
- databricks-meta-llama-3-1-405b-instruct
UPDATED notes:
- claude-opus-4-6 no longer "Most capable"
- gpt-5-2 no longer "Latest"
- gpt-5-1-codex-{max,mini} + gpt-5-2-codex marked retiring 2026-07-16
- gemini-3-pro marked retired 2026-03-26 with redirect through 2026-06-07
- Several Gemini / Codex endpoints annotated with cross-geo requirement
- qwen3-next-80b annotated as Preview
OPENING PARAGRAPH:
- "available in every workspace" -> "available in supported Model Serving
regions"; calls out cross-geo requirement for several endpoints
NOT TOUCHED (out of scope: not docs-validatable from supported-models page):
- served_entities[].entity_name guidance (line 3 second half)
- SKILL.md "system.ai.* catalog" claim on the pay-per-token row
These remain as in the a-d-k snapshot and should be revisited if/when
docs cover them directly.
Test plan: `scripts/skills.py validate` -> "Everything is up to date";
`scripts/skills.py generate` -> only refreshes manifest.json timestamps.
Co-authored-by: Isaac
…ot static catalog Quentin pointed out (PR databricks#84) that the prior two commits actually ported from `main:databricks-skills/databricks-model-serving/`, not `experimental:databricks-skills/databricks-ml-training-serving/` as the PR description claimed. The two skills take opposite approaches: - `main` ships a static catalog table of FM API endpoint names. - `experimental` deliberately rejects that ("a static skill list goes stale fast — always list at runtime instead of hard-coding names") and ships a `databricks serving-endpoints list | jq ...` one-liner plus runtime-resolved defaults (highest-numbered Claude Sonnet for agents, highest-numbered `-codex-max` for code). Re-port to match the experimental philosophy: - `references/fm-api-endpoints.md`: replace the static catalog with the runtime-list snippet (filtered by `databricks-` name prefix AND `system.ai.*` served entity, to exclude non-FM endpoints sharing the prefix), runtime-resolved family defaults, and CLI + SDK query examples that use a placeholder endpoint name rather than a hard-coded model. - `SKILL.md`: update the Endpoint Types row + the Foundation-Model discovery bullet to reframe the reference as "discover at runtime" rather than "curated table". Version stays at 0.2.0 (frontmatter unchanged → manifest unchanged). The 2026-05-26 catalog refresh in the previous commit is dropped here: the experimental skill's point is that no static table is the right shape, so curating one against docs.databricks.com isn't useful for the stable skill either. Co-authored-by: Isaac
…ental port Previous commit (c148500) restated the experimental section in my own words and added a "Querying" section + provisioned-throughput aside + docs-link gloss that aren't in the upstream skill. The PR's stated goal is to port from experimental — do an actual port, not a paraphrase. `references/fm-api-endpoints.md` now mirrors the `## Foundation Model API endpoints` section of `experimental:databricks-ml-training-serving/SKILL.md` verbatim (heading promoted from `##` to `#` since this is a standalone file): intro paragraph + the `databricks serving-endpoints list | jq ...` one-liner + the family-based default-picking rule. Nothing else. Also trim the SKILL.md discovery bullet back toward its original shape — link to the reference file for the runtime-list snippet, then the same `system.ai` / `serving-endpoints list` / `get-open-api` alternatives that were already there. Co-authored-by: Isaac
…ntal
Expands the port from the FM-endpoints-only scope to cover every
section of `experimental:databricks-ml-training-serving/`. Mirrors
the experimental skill's 3-file structure 1:1 into stable's
`references/` directory; the standalone fm-api-endpoints.md added in
earlier commits goes away (its content lives inline in
training-and-serving.md exactly as it does in experimental's SKILL.md).
Added (all verbatim ports, mechanical adjustments only):
references/training-and-serving.md
Ports experimental SKILL.md content. Mechanical changes only:
frontmatter stripped (destination is a reference file, not a
SKILL.md); `1-custom-pyfunc.md` → `custom-pyfunc.md`,
`2-genai-agents.md` → `genai-agents.md` (filename renames);
`../<skill>/SKILL.md` → `../../<skill>/SKILL.md` (one more level
of nesting since this file is in references/ rather than at the
skill root). Content covers: canonical train/register/serve flow,
`mlflow.{sklearn,xgboost,…}.autolog()` patterns, UC alias-based
promotion, batch scoring via `spark_udf`, real-time endpoint
create + zero-downtime version swap, `state.ready` vs
`state.config_update` poll-both gotcha, `jobs submit --no-wait`
serverless deploy pattern, Foundation Model API endpoints
runtime-list, and the full gotchas trap-table.
references/custom-pyfunc.md
Ports experimental 1-custom-pyfunc.md verbatim.
Mechanical change: `[SKILL.md]` → `[training-and-serving.md]`
where the original cross-referenced its parent SKILL.md.
Content: file-based PyFunc ("Models from Code"),
`infer_signature`, `code_paths`, pre-deploy validation via
`mlflow.models.predict(env_manager="uv")`.
references/genai-agents.md
Ports experimental 2-genai-agents.md verbatim.
Mechanical changes: cross-skill paths bumped one level deeper;
`[SKILL.md]` → `[training-and-serving.md]`. Content covers:
`ResponsesAgent` interface, LangGraph agent with
`UCFunctionToolkit` + `VectorSearchRetrieverTool`, the
`create_text_output_item` raw-dict-silently-fails gotcha, the
`resources=[...]` passthrough-auth list (DatabricksServingEndpoint,
DatabricksFunction, DatabricksVectorSearchIndex, DatabricksLakebase),
async deploy via `agents.deploy()` from a serverless job, query
via CLI and OpenAI-compatible client.
Removed:
references/fm-api-endpoints.md
Standalone file from earlier commits; its content lives inline
in training-and-serving.md exactly as it does in experimental's
SKILL.md, so the deliberate split is no longer needed.
Stable SKILL.md updates (minimal, ops-focus preserved):
- FM-endpoint link targets updated from `references/fm-api-endpoints.md`
to `references/training-and-serving.md#foundation-model-api-endpoints`
in the Endpoint Types table row and the FM-discovery bullet.
- New `### Develop & deploy new models` subsection under "What's Next"
with a 3-row table pointing at the new dev-side references, framed
as "this skill is ops-focused; for the dev-side flow, see below".
Manifest regenerated.
Co-authored-by: Isaac
- The mechanical `../` → `../../` rewrite in the verbatim port assumed every peer skill is stable, but 4 of them live in `experimental/`. `../../<skill>/SKILL.md` resolved to `skills/<skill>/SKILL.md` which does not exist for `databricks-agent-bricks`, `databricks-mlflow-evaluation`, `databricks-vector-search`, `databricks-unity-catalog`. Repointed to `../../../experimental/<skill>/SKILL.md`. `databricks-jobs` link unchanged (it's stable). - SKILL.md frontmatter `description` only described the ops surface, so agents wouldn't route dev-side asks (train, register, PyFunc, ResponsesAgent) to this skill. Broadened to cover both ops and the new dev surface. - Version bumped 0.2.0 → 0.3.0 + manifest regenerated. Co-authored-by: Isaac
…-phase1 # Conflicts: # manifest.json
Per @simonfaltum review: before resubmitting a deploy serverless job, agents should check whether a run is already in flight (active job runs filtered on run_name) or whether the target endpoint already exists in the right state. Avoids wasting ~15 min of serverless and racing for the same endpoint name. Co-authored-by: Isaac
…icks-ml-training Splits the post-port databricks-model-serving skill into two skills with clean responsibility boundaries: databricks-model-serving keeps the endpoint lifecycle / ops surface, and a new experimental databricks-ml-training owns the dev-side training, MLflow tracking, UC registration, custom PyFunc, and hand-rolled ResponsesAgent content. Also closes five small gaps in databricks-model-serving where non-obvious serving behavior from the original a-d-k port had fallen through the cracks (Python deployments client gotchas, zero-downtime version swap, two-field readiness rationale, classical-ML query shape, Serving-UI SP filter). Co-authored-by: Isaac
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this PR exists
PR #84 lands the model-serving content (endpoint create, query, update, traffic config, AI Gateway, Foundation Model API discovery) into
databricks-model-serving. That's the right shape for a serving-ops skill, and it's what reviewers should expect a skill called "model serving" to contain.The remaining a-d-k content — training a model with MLflow autolog, registering it to Unity Catalog, promoting versions via
@prodaliases, custom PyFunc authoring, hand-rolledResponsesAgentcode — is a different lifecycle. It runs before an endpoint exists, often in a notebook submitted as a serverless job, and an agent asked to "train an XGBoost model and deploy it" needs both concerns surfaced cleanly rather than blended into one skill description.This PR lands the dev-side content as a separate
databricks-ml-trainingexperimental skill, and weaves a few small but high-leverage serving-side fixes from the original a-d-k content intodatabricks-model-servingwhere they belong.What this PR improves
A focused dev-side skill. New
experimental/databricks-ml-training/owns the training → register → consume narrative: MLflow autolog with Optuna for hyperparameter tuning,mlflow.set_registry_uri(\"databricks-uc\")+ experiment-parent-folder pre-creation, alias-based promotion (@prod/@challenger), batch scoring viamlflow.pyfunc.spark_udf, custom PyFunc with the file-based "Models from Code" pattern, hand-rolledResponsesAgentwith LangGraph + UC Function + Vector Search tools, and thedatabricks jobs submit --no-waittrain-and-deploy pattern.Frontmatter triggers that actually triage. Each skill's description lists what it IS for and what it explicitly is NOT for, with cross-pointers (
databricks-ml-trainingsays "use databricks-model-serving for endpoint ops";databricks-model-servingsays "use databricks-ml-training for training and PyFunc authoring"). When the user says "train a model and deploy it," the orchestrator pulls both skills exactly once each.Cross-skill links that resolve. Every
databricks-model-serving→databricks-ml-traininglink and every reverse link uses the right relative path for the stable-skills/↔ experimental-experimental/layout. No broken anchors, no stale paths to the oldtraining-and-serving.mdfilename anywhere.Five small but high-leverage gaps closed in
databricks-model-serving. The original a-d-k port left a few non-obvious serving behaviors implicit. Each fix is woven into the existing section that already covers the topic — no new mega-sections, no duplication of MLflow boilerplate an LLM already knows from training data. The result: serving-side behavior an agent would otherwise have to guess at is now explicit and signposted.Summary of changes
experimental/databricks-ml-training/with SKILL.md + agents/ + assets/ + references/{custom-pyfunc.md, genai-agents.md}. Owns the full dev-side narrative (autolog + Optuna, UC registration, alias promotion, batch scoring, custom PyFunc, custom ResponsesAgent, train-and-deploy serverless job pattern).databricks-model-serving/SKILL.mdinline (was previously linked into the relocated training file).databricks-model-serving/SKILL.md: MLflow Deployments client gotchas (tags=top-level,served_model_namederivation), zero-downtime version-swap pattern (alias-repoint ANDupdate_endpoint), two-state-field readiness rationale (state.readylies during version-swap), classical-MLdataframe_recordsquery example, Serving-UI "Owned by me" SP-filter troubleshooting row. Each merged into the existing section that already covered the topic.databricks-model-servingbumped to 0.4.0 (description retightened, gaps closed). Newdatabricks-ml-trainingat 0.1.0 underexperimental/. Manifest regenerated; 27 skills total.Reviewer aid
The split is on the natural seam — anything that runs before
mlflow.deployments.get_deploy_client(...).create_endpoint(...)is dev-side and lives indatabricks-ml-training, anything fromcreate_endpointonward is ops-side and lives indatabricks-model-serving. The Pythoncreate_endpoint(...)/update_endpoint(...)call itself is canonically a serving operation and is documented there with the two non-obvious gotchas.Validation:
python3 scripts/skills.py validatepasses; zero broken links across both touched skills.This pull request and its description were written by Isaac.