skills(model-serving): merge dev-side training/agent flows from a-d-k experimental#84
skills(model-serving): merge dev-side training/agent flows from a-d-k experimental#84jamesbroadhead wants to merge 11 commits into
Conversation
Phase 1 of #73's TODO #1b. Adds references/fm-api-endpoints.md with the curated Foundation Model API endpoint table (chat/instruct + embedding models) from databricks-solutions/ai-dev-kit's model-serving skill, plus common defaults and query examples (CLI + SDK). Stripped: the cloud/language prefix on the docs link, and the leftover MCP-tool references in the source. The endpoint table itself is static catalog data — no MCP coupling. SKILL.md updates: - bump version to 0.2.0 - point Endpoint Types table at the new reference - point the Foundation Model discovery bullet at the new reference Subsequent phases (separate PRs / commits) port the remaining dev-side content: classical-ml autolog patterns, Custom PyFunc signatures, ResponsesAgent with the create_text_output_item gotcha, UCFunctionToolkit + VectorSearchRetrieverTool resource passthrough. Co-authored-by: Isaac
Aligns the verbatim a-d-k port with the live docs.databricks.com
supported-models page (validated via WebFetch on 2026-05-26):
ADDED (missing from a-d-k snapshot):
- databricks-claude-opus-4-7 (now most capable Claude)
- databricks-gpt-5-5-pro, 5-5
- databricks-gpt-5-4, 5-4-mini, 5-4-nano
- databricks-gpt-5-3-codex, 5-2-codex
- databricks-gemini-3-1-flash-lite, 3-5-flash
- databricks-qwen35-122b-a10b (Preview)
REMOVED (retired, no longer in docs):
- databricks-claude-3-7-sonnet
- databricks-meta-llama-3-1-405b-instruct
UPDATED notes:
- claude-opus-4-6 no longer "Most capable"
- gpt-5-2 no longer "Latest"
- gpt-5-1-codex-{max,mini} + gpt-5-2-codex marked retiring 2026-07-16
- gemini-3-pro marked retired 2026-03-26 with redirect through 2026-06-07
- Several Gemini / Codex endpoints annotated with cross-geo requirement
- qwen3-next-80b annotated as Preview
OPENING PARAGRAPH:
- "available in every workspace" -> "available in supported Model Serving
regions"; calls out cross-geo requirement for several endpoints
NOT TOUCHED (out of scope: not docs-validatable from supported-models page):
- served_entities[].entity_name guidance (line 3 second half)
- SKILL.md "system.ai.* catalog" claim on the pay-per-token row
These remain as in the a-d-k snapshot and should be revisited if/when
docs cover them directly.
Test plan: `scripts/skills.py validate` -> "Everything is up to date";
`scripts/skills.py generate` -> only refreshes manifest.json timestamps.
Co-authored-by: Isaac
c9015d8 to
d400eff
Compare
|
@jamesbroadhead I suspect this is also coming from main from the content I see? The experimental skill is https://github.com/databricks-solutions/ai-dev-kit/blob/experimental/databricks-skills/databricks-ml-training-serving/SKILL.md |
|
Hi @QuentinAmbard — Claude here, working with James. You're right, and I owe you (and the PR description) a correction. I checked both branches:
Content-wise it's a clean fingerprint match for The bigger issue is philosophical: the experimental
It ships I'll rework this PR to actually align with the experimental skill — replace the static catalog reference file with the runtime-list snippet + the runtime-resolved defaults, and fix the PR description. The doc-validated catalog work isn't wasted; it just shouldn't be how the stable skill steers callers. On "Should I also open another merge PR for suggestion?" — happy to coordinate. If you mean a PR upstream into a-d-k's |
…ot static catalog Quentin pointed out (PR #84) that the prior two commits actually ported from `main:databricks-skills/databricks-model-serving/`, not `experimental:databricks-skills/databricks-ml-training-serving/` as the PR description claimed. The two skills take opposite approaches: - `main` ships a static catalog table of FM API endpoint names. - `experimental` deliberately rejects that ("a static skill list goes stale fast — always list at runtime instead of hard-coding names") and ships a `databricks serving-endpoints list | jq ...` one-liner plus runtime-resolved defaults (highest-numbered Claude Sonnet for agents, highest-numbered `-codex-max` for code). Re-port to match the experimental philosophy: - `references/fm-api-endpoints.md`: replace the static catalog with the runtime-list snippet (filtered by `databricks-` name prefix AND `system.ai.*` served entity, to exclude non-FM endpoints sharing the prefix), runtime-resolved family defaults, and CLI + SDK query examples that use a placeholder endpoint name rather than a hard-coded model. - `SKILL.md`: update the Endpoint Types row + the Foundation-Model discovery bullet to reframe the reference as "discover at runtime" rather than "curated table". Version stays at 0.2.0 (frontmatter unchanged → manifest unchanged). The 2026-05-26 catalog refresh in the previous commit is dropped here: the experimental skill's point is that no static table is the right shape, so curating one against docs.databricks.com isn't useful for the stable skill either. Co-authored-by: Isaac
…ental port Previous commit (c148500) restated the experimental section in my own words and added a "Querying" section + provisioned-throughput aside + docs-link gloss that aren't in the upstream skill. The PR's stated goal is to port from experimental — do an actual port, not a paraphrase. `references/fm-api-endpoints.md` now mirrors the `## Foundation Model API endpoints` section of `experimental:databricks-ml-training-serving/SKILL.md` verbatim (heading promoted from `##` to `#` since this is a standalone file): intro paragraph + the `databricks serving-endpoints list | jq ...` one-liner + the family-based default-picking rule. Nothing else. Also trim the SKILL.md discovery bullet back toward its original shape — link to the reference file for the runtime-list snippet, then the same `system.ai` / `serving-endpoints list` / `get-open-api` alternatives that were already there. Co-authored-by: Isaac
…ntal
Expands the port from the FM-endpoints-only scope to cover every
section of `experimental:databricks-ml-training-serving/`. Mirrors
the experimental skill's 3-file structure 1:1 into stable's
`references/` directory; the standalone fm-api-endpoints.md added in
earlier commits goes away (its content lives inline in
training-and-serving.md exactly as it does in experimental's SKILL.md).
Added (all verbatim ports, mechanical adjustments only):
references/training-and-serving.md
Ports experimental SKILL.md content. Mechanical changes only:
frontmatter stripped (destination is a reference file, not a
SKILL.md); `1-custom-pyfunc.md` → `custom-pyfunc.md`,
`2-genai-agents.md` → `genai-agents.md` (filename renames);
`../<skill>/SKILL.md` → `../../<skill>/SKILL.md` (one more level
of nesting since this file is in references/ rather than at the
skill root). Content covers: canonical train/register/serve flow,
`mlflow.{sklearn,xgboost,…}.autolog()` patterns, UC alias-based
promotion, batch scoring via `spark_udf`, real-time endpoint
create + zero-downtime version swap, `state.ready` vs
`state.config_update` poll-both gotcha, `jobs submit --no-wait`
serverless deploy pattern, Foundation Model API endpoints
runtime-list, and the full gotchas trap-table.
references/custom-pyfunc.md
Ports experimental 1-custom-pyfunc.md verbatim.
Mechanical change: `[SKILL.md]` → `[training-and-serving.md]`
where the original cross-referenced its parent SKILL.md.
Content: file-based PyFunc ("Models from Code"),
`infer_signature`, `code_paths`, pre-deploy validation via
`mlflow.models.predict(env_manager="uv")`.
references/genai-agents.md
Ports experimental 2-genai-agents.md verbatim.
Mechanical changes: cross-skill paths bumped one level deeper;
`[SKILL.md]` → `[training-and-serving.md]`. Content covers:
`ResponsesAgent` interface, LangGraph agent with
`UCFunctionToolkit` + `VectorSearchRetrieverTool`, the
`create_text_output_item` raw-dict-silently-fails gotcha, the
`resources=[...]` passthrough-auth list (DatabricksServingEndpoint,
DatabricksFunction, DatabricksVectorSearchIndex, DatabricksLakebase),
async deploy via `agents.deploy()` from a serverless job, query
via CLI and OpenAI-compatible client.
Removed:
references/fm-api-endpoints.md
Standalone file from earlier commits; its content lives inline
in training-and-serving.md exactly as it does in experimental's
SKILL.md, so the deliberate split is no longer needed.
Stable SKILL.md updates (minimal, ops-focus preserved):
- FM-endpoint link targets updated from `references/fm-api-endpoints.md`
to `references/training-and-serving.md#foundation-model-api-endpoints`
in the Endpoint Types table row and the FM-discovery bullet.
- New `### Develop & deploy new models` subsection under "What's Next"
with a 3-row table pointing at the new dev-side references, framed
as "this skill is ops-focused; for the dev-side flow, see below".
Manifest regenerated.
Co-authored-by: Isaac
- The mechanical `../` → `../../` rewrite in the verbatim port assumed every peer skill is stable, but 4 of them live in `experimental/`. `../../<skill>/SKILL.md` resolved to `skills/<skill>/SKILL.md` which does not exist for `databricks-agent-bricks`, `databricks-mlflow-evaluation`, `databricks-vector-search`, `databricks-unity-catalog`. Repointed to `../../../experimental/<skill>/SKILL.md`. `databricks-jobs` link unchanged (it's stable). - SKILL.md frontmatter `description` only described the ops surface, so agents wouldn't route dev-side asks (train, register, PyFunc, ResponsesAgent) to this skill. Broadened to cover both ops and the new dev surface. - Version bumped 0.2.0 → 0.3.0 + manifest regenerated. Co-authored-by: Isaac
QuentinAmbard
left a comment
There was a problem hiding this comment.
nice let's merge this one, I'll send a followup PR on top!
|
|
||
| ## Deploy (async job, ~15 min) | ||
|
|
||
| `databricks.agents.deploy()` blocks for ~15 minutes — don't run it inline from the CLI. Submit as a serverless job so the chat session doesn't hold the connection. |
There was a problem hiding this comment.
This is great - should we add something about how agents can check if there has already been submitted a serverless job for the deploy?
There was a problem hiding this comment.
(Claude here.)
Good call — added in 8c8a1b3. Two cheap checks just before the submit:
databricks jobs list-runs --active-onlyfiltered onrun_name == "deploy_<model>"to catch an already-in-flight deploy.databricks serving-endpoints get <endpoint_name>to skip the redeploy if the endpoint already exists on the right version.
If either hits, the recipe now says to follow the existing run with jobs get-run instead of submitting a new one.
…-phase1 # Conflicts: # manifest.json
Per @simonfaltum review: before resubmitting a deploy serverless job, agents should check whether a run is already in flight (active job runs filtered on run_name) or whether the target endpoint already exists in the right state. Avoids wasting ~15 min of serverless and racing for the same endpoint name. Co-authored-by: Isaac
…apx Related Skills entry `databricks-app-apx` was the FastAPI+React stack referenced from ai-dev-kit's `databricks-apps-python` skill. It has been removed upstream (a-d-k is deprecated; the apx-on-CLI flow merged into the stable `databricks-apps` skill via #84/#73). The "Related Skills" bullet is the last dangling reference inside this repo. This PR was prepared by Claude.
|
Stacked a follow-up on this in #110 — adds a separate #110 includes the commits from this PR at its base — please merge this one first, then #110 will rebase cleanly onto the new main. |
- Drop ../../../experimental/... cross-skill links that 404 when installed
(skills install flat under ~/.claude/skills/, not under stable/ vs
experimental/). Use plain skill-name references instead.
- Replace ai-dev-kit-specific tag examples ("aidevkit_project") with a
neutral "project": "demo" so a d-a-s skill doesn't bleed a-d-k convention.
- Tighten SKILL.md description from ~870 chars to ~290 chars, matching the
convention being established in PR #107.
Co-authored-by: Isaac
Summary
Merges the dev-side surface of
experimental:databricks-ml-training-serving/into stable'sdatabricks-model-serving. Closes #73's TODO #1b.The two skills had near-zero content overlap — stable was ops-focused (manage existing endpoints via CLI); experimental was dev-focused (train, register, log a PyFunc or
ResponsesAgent, deploy). Combining them avoids forcing users to invoke two skills for what is functionally one workflow.Shape:
references/files carry the dev-side flow verbatim from experimental.descriptionbroadened so agent routing fires on dev-side asks too (train, register, PyFunc, ResponsesAgent) — seedescriptionfield for the full trigger phrase list. NOT for: no-code agents (usedatabricks-agent-bricks); MLflow scorers (usedatabricks-mlflow-evaluation).Changes
references/training-and-serving.mddatabricks-ml-training-serving/SKILL.mdmlflow.{sklearn,xgboost,…}.autolog()patterns, UC alias-based promotion (@prod/@challenger), batch scoring viaspark_udf, real-time endpoint create + zero-downtime version swap,state.readyvsstate.config_updatepoll-both gotcha,jobs submit --no-waitserverless deploy pattern, Foundation Model API endpoints runtime-list (replaces the earlier static catalog draft per @QuentinAmbard's review), and the gotchas trap-table.references/custom-pyfunc.mddatabricks-ml-training-serving/1-custom-pyfunc.mdpython_model="model.py"),infer_signature,code_paths, pre-deploy validation viamlflow.models.predict(env_manager="uv").references/genai-agents.mddatabricks-ml-training-serving/2-genai-agents.mdResponsesAgentinterface, LangGraph agent withUCFunctionToolkit+VectorSearchRetrieverTool, thecreate_text_output_itemraw-dict-silently-fails gotcha, theresources=[...]passthrough-auth list, async deploy viaagents.deploy()from a serverless job, query via CLI and OpenAI-compatible client.All 3 ports are verbatim — only mechanical adjustments:
1-custom-pyfunc.md→custom-pyfunc.md,2-genai-agents.md→genai-agents.md.references/location. Stable peers (databricks-jobs) use../../; experimental-only peers (databricks-agent-bricks,databricks-mlflow-evaluation,databricks-vector-search,databricks-unity-catalog) use../../../experimental/.SKILL.md updates (kept tight — ops focus preserved):
references/training-and-serving.md#foundation-model-api-endpoints.### Develop & deploy new modelssubsection under "What's Next" with a 3-row table linking the new references.descriptionexpanded to cover the dev surface (see above).Manifest: regenerated via
python3 scripts/skills.py generate.Reviewer history
Earlier commits on this branch made two mistakes that have since been corrected:
main:databricks-model-serving/rather thanexperimental:databricks-ml-training-serving/— caught by @QuentinAmbard, reworked.databricks serving-endpoints list | jq ...plus runtime-resolved defaults. Replaced with the runtime-list snippet.Coverage vs. #73 TODO #1b
training-and-serving.md§ Train and registercustom-pyfunc.md(whole file)ResponsesAgent+create_text_output_itemgotchagenai-agents.md§ CRITICAL: output items must use helper methodsUCFunctionToolkit+VectorSearchRetrieverToolresource passthroughgenai-agents.md§ Log + register + § Resources that need passthrough authtraining-and-serving.md§ Foundation Model API endpointsPlus content the original TODO didn't enumerate: batch scoring via
spark_udf, real-time endpoint create + version swap, thestate.readyvsstate.config_updatepoll-both gotcha, serverlessjobs submit --no-waitdeploy pattern, the consolidated gotchas trap-table.Known follow-ups (out of scope)
references/training-and-serving.mdhas an anchor link#one-time-runs-jobs-submit--async-pattern-for-notebooksintodatabricks-jobs/SKILL.md. The section exists in a-d-k'sdatabricks-jobsbut not yet in d-a-sdatabricks-jobs/SKILL.md. Link falls back to the file top.references/off-platform-streaming.md(pre-existing from #76) is in the manifest but unwired from SKILL.md. Untouched by this PR.databricks-vector-searchto stable, the../../../experimental/databricks-vector-search/SKILL.mdlink intraining-and-serving.mdshould be flipped to the stable path. skills: promote databricks-vector-search to stable #87's link-sweep should handle it.Test plan
python3 scripts/skills.py generateclean.python3 scripts/skills.py validatepasses (Everything is up to date.).@databricks/eng-apps-devexper CODEOWNERS).This pull request and its description were written by Claude.