Skip to content

skills: split into databricks-model-serving (ops) + databricks-ml-training (experimental)#110

Open
QuentinAmbard wants to merge 9 commits into
databricks:mainfrom
QuentinAmbard:skills/databricks-ml-training-split
Open

skills: split into databricks-model-serving (ops) + databricks-ml-training (experimental)#110
QuentinAmbard wants to merge 9 commits into
databricks:mainfrom
QuentinAmbard:skills/databricks-ml-training-split

Conversation

@QuentinAmbard
Copy link
Copy Markdown

@QuentinAmbard QuentinAmbard commented May 28, 2026

Stacks on top of #84. Merge #84 first; this PR includes those commits at the base and adds the new skill on top.

Why this PR exists

PR #84 lands the model-serving content (endpoint create, query, update, traffic config, AI Gateway, Foundation Model API discovery) into databricks-model-serving. That's the right shape for a serving-ops skill, and it's what reviewers should expect a skill called "model serving" to contain.

The remaining a-d-k content — training a model with MLflow autolog, registering it to Unity Catalog, promoting versions via @prod aliases, custom PyFunc authoring, hand-rolled ResponsesAgent code — is a different lifecycle. It runs before an endpoint exists, often in a notebook submitted as a serverless job, and an agent asked to "train an XGBoost model and deploy it" needs both concerns surfaced cleanly rather than blended into one skill description.

This PR lands the dev-side content as a separate databricks-ml-training experimental skill, and weaves a few small but high-leverage serving-side fixes from the original a-d-k content into databricks-model-serving where they belong.

What this PR improves

A focused dev-side skill. New experimental/databricks-ml-training/ owns the training → register → consume narrative: MLflow autolog with Optuna for hyperparameter tuning, mlflow.set_registry_uri(\"databricks-uc\") + experiment-parent-folder pre-creation, alias-based promotion (@prod / @challenger), batch scoring via mlflow.pyfunc.spark_udf, custom PyFunc with the file-based "Models from Code" pattern, hand-rolled ResponsesAgent with LangGraph + UC Function + Vector Search tools, and the databricks jobs submit --no-wait train-and-deploy pattern.

Frontmatter triggers that actually triage. Each skill's description lists what it IS for and what it explicitly is NOT for, with cross-pointers (databricks-ml-training says "use databricks-model-serving for endpoint ops"; databricks-model-serving says "use databricks-ml-training for training and PyFunc authoring"). When the user says "train a model and deploy it," the orchestrator pulls both skills exactly once each.

Cross-skill links that resolve. Every databricks-model-servingdatabricks-ml-training link and every reverse link uses the right relative path for the stable-skills/ ↔ experimental-experimental/ layout. No broken anchors, no stale paths to the old training-and-serving.md filename anywhere.

Five small but high-leverage gaps closed in databricks-model-serving. The original a-d-k port left a few non-obvious serving behaviors implicit. Each fix is woven into the existing section that already covers the topic — no new mega-sections, no duplication of MLflow boilerplate an LLM already knows from training data. The result: serving-side behavior an agent would otherwise have to guess at is now explicit and signposted.

Summary of changes

Area What changed
New experimental skill experimental/databricks-ml-training/ with SKILL.md + agents/ + assets/ + references/{custom-pyfunc.md, genai-agents.md}. Owns the full dev-side narrative (autolog + Optuna, UC registration, alias promotion, batch scoring, custom PyFunc, custom ResponsesAgent, train-and-deploy serverless job pattern).
Frontmatter scoping Both skills' descriptions list scope + explicit NOT-for callouts pointing at their sibling. Triggers triage cleanly when the user mentions both training and deployment.
Cross-links All cross-skill paths fixed for the stable ↔ experimental layout. Foundation Model API discovery moved into databricks-model-serving/SKILL.md inline (was previously linked into the relocated training file).
Serving gaps closed Five small fixes in databricks-model-serving/SKILL.md: MLflow Deployments client gotchas (tags= top-level, served_model_name derivation), zero-downtime version-swap pattern (alias-repoint AND update_endpoint), two-state-field readiness rationale (state.ready lies during version-swap), classical-ML dataframe_records query example, Serving-UI "Owned by me" SP-filter troubleshooting row. Each merged into the existing section that already covered the topic.
DAS-only content preserved PR #84's idempotency check before agent deploy, AppKit integration section, off-platform AI SDK v6 streaming, endpoint-structure ASCII diagram, OpenAPI schema section — all kept.
Versioning databricks-model-serving bumped to 0.4.0 (description retightened, gaps closed). New databricks-ml-training at 0.1.0 under experimental/. Manifest regenerated; 27 skills total.

Reviewer aid

The split is on the natural seam — anything that runs before mlflow.deployments.get_deploy_client(...).create_endpoint(...) is dev-side and lives in databricks-ml-training, anything from create_endpoint onward is ops-side and lives in databricks-model-serving. The Python create_endpoint(...) / update_endpoint(...) call itself is canonically a serving operation and is documented there with the two non-obvious gotchas.

Validation: python3 scripts/skills.py validate passes; zero broken links across both touched skills.

This pull request and its description were written by Isaac.

jamesbroadhead and others added 9 commits May 26, 2026 09:47
Phase 1 of databricks#73's TODO #1b. Adds references/fm-api-endpoints.md with the
curated Foundation Model API endpoint table (chat/instruct + embedding
models) from databricks-solutions/ai-dev-kit's model-serving skill,
plus common defaults and query examples (CLI + SDK).

Stripped: the cloud/language prefix on the docs link, and the leftover
MCP-tool references in the source. The endpoint table itself is static
catalog data — no MCP coupling.

SKILL.md updates:
- bump version to 0.2.0
- point Endpoint Types table at the new reference
- point the Foundation Model discovery bullet at the new reference

Subsequent phases (separate PRs / commits) port the remaining dev-side
content: classical-ml autolog patterns, Custom PyFunc signatures,
ResponsesAgent with the create_text_output_item gotcha, UCFunctionToolkit
+ VectorSearchRetrieverTool resource passthrough.

Co-authored-by: Isaac
Aligns the verbatim a-d-k port with the live docs.databricks.com
supported-models page (validated via WebFetch on 2026-05-26):

ADDED (missing from a-d-k snapshot):
- databricks-claude-opus-4-7 (now most capable Claude)
- databricks-gpt-5-5-pro, 5-5
- databricks-gpt-5-4, 5-4-mini, 5-4-nano
- databricks-gpt-5-3-codex, 5-2-codex
- databricks-gemini-3-1-flash-lite, 3-5-flash
- databricks-qwen35-122b-a10b (Preview)

REMOVED (retired, no longer in docs):
- databricks-claude-3-7-sonnet
- databricks-meta-llama-3-1-405b-instruct

UPDATED notes:
- claude-opus-4-6 no longer "Most capable"
- gpt-5-2 no longer "Latest"
- gpt-5-1-codex-{max,mini} + gpt-5-2-codex marked retiring 2026-07-16
- gemini-3-pro marked retired 2026-03-26 with redirect through 2026-06-07
- Several Gemini / Codex endpoints annotated with cross-geo requirement
- qwen3-next-80b annotated as Preview

OPENING PARAGRAPH:
- "available in every workspace" -> "available in supported Model Serving
  regions"; calls out cross-geo requirement for several endpoints

NOT TOUCHED (out of scope: not docs-validatable from supported-models page):
- served_entities[].entity_name guidance (line 3 second half)
- SKILL.md "system.ai.* catalog" claim on the pay-per-token row
These remain as in the a-d-k snapshot and should be revisited if/when
docs cover them directly.

Test plan: `scripts/skills.py validate` -> "Everything is up to date";
`scripts/skills.py generate` -> only refreshes manifest.json timestamps.

Co-authored-by: Isaac
…ot static catalog

Quentin pointed out (PR databricks#84) that the prior two commits actually ported
from `main:databricks-skills/databricks-model-serving/`, not
`experimental:databricks-skills/databricks-ml-training-serving/` as the
PR description claimed. The two skills take opposite approaches:

  - `main` ships a static catalog table of FM API endpoint names.
  - `experimental` deliberately rejects that ("a static skill list goes
    stale fast — always list at runtime instead of hard-coding names")
    and ships a `databricks serving-endpoints list | jq ...` one-liner
    plus runtime-resolved defaults (highest-numbered Claude Sonnet for
    agents, highest-numbered `-codex-max` for code).

Re-port to match the experimental philosophy:

  - `references/fm-api-endpoints.md`: replace the static catalog with the
    runtime-list snippet (filtered by `databricks-` name prefix AND
    `system.ai.*` served entity, to exclude non-FM endpoints sharing the
    prefix), runtime-resolved family defaults, and CLI + SDK query
    examples that use a placeholder endpoint name rather than a hard-coded
    model.

  - `SKILL.md`: update the Endpoint Types row + the Foundation-Model
    discovery bullet to reframe the reference as "discover at runtime"
    rather than "curated table". Version stays at 0.2.0 (frontmatter
    unchanged → manifest unchanged).

The 2026-05-26 catalog refresh in the previous commit is dropped here:
the experimental skill's point is that no static table is the right
shape, so curating one against docs.databricks.com isn't useful for the
stable skill either.

Co-authored-by: Isaac
…ental port

Previous commit (c148500) restated the experimental section in my own
words and added a "Querying" section + provisioned-throughput aside +
docs-link gloss that aren't in the upstream skill. The PR's stated goal
is to port from experimental — do an actual port, not a paraphrase.

`references/fm-api-endpoints.md` now mirrors the
`## Foundation Model API endpoints` section of
`experimental:databricks-ml-training-serving/SKILL.md` verbatim
(heading promoted from `##` to `#` since this is a standalone file):
intro paragraph + the `databricks serving-endpoints list | jq ...`
one-liner + the family-based default-picking rule. Nothing else.

Also trim the SKILL.md discovery bullet back toward its original
shape — link to the reference file for the runtime-list snippet, then
the same `system.ai` / `serving-endpoints list` / `get-open-api`
alternatives that were already there.

Co-authored-by: Isaac
…ntal

Expands the port from the FM-endpoints-only scope to cover every
section of `experimental:databricks-ml-training-serving/`. Mirrors
the experimental skill's 3-file structure 1:1 into stable's
`references/` directory; the standalone fm-api-endpoints.md added in
earlier commits goes away (its content lives inline in
training-and-serving.md exactly as it does in experimental's SKILL.md).

Added (all verbatim ports, mechanical adjustments only):

  references/training-and-serving.md
    Ports experimental SKILL.md content. Mechanical changes only:
    frontmatter stripped (destination is a reference file, not a
    SKILL.md); `1-custom-pyfunc.md` → `custom-pyfunc.md`,
    `2-genai-agents.md` → `genai-agents.md` (filename renames);
    `../<skill>/SKILL.md` → `../../<skill>/SKILL.md` (one more level
    of nesting since this file is in references/ rather than at the
    skill root). Content covers: canonical train/register/serve flow,
    `mlflow.{sklearn,xgboost,…}.autolog()` patterns, UC alias-based
    promotion, batch scoring via `spark_udf`, real-time endpoint
    create + zero-downtime version swap, `state.ready` vs
    `state.config_update` poll-both gotcha, `jobs submit --no-wait`
    serverless deploy pattern, Foundation Model API endpoints
    runtime-list, and the full gotchas trap-table.

  references/custom-pyfunc.md
    Ports experimental 1-custom-pyfunc.md verbatim.
    Mechanical change: `[SKILL.md]` → `[training-and-serving.md]`
    where the original cross-referenced its parent SKILL.md.
    Content: file-based PyFunc ("Models from Code"),
    `infer_signature`, `code_paths`, pre-deploy validation via
    `mlflow.models.predict(env_manager="uv")`.

  references/genai-agents.md
    Ports experimental 2-genai-agents.md verbatim.
    Mechanical changes: cross-skill paths bumped one level deeper;
    `[SKILL.md]` → `[training-and-serving.md]`. Content covers:
    `ResponsesAgent` interface, LangGraph agent with
    `UCFunctionToolkit` + `VectorSearchRetrieverTool`, the
    `create_text_output_item` raw-dict-silently-fails gotcha, the
    `resources=[...]` passthrough-auth list (DatabricksServingEndpoint,
    DatabricksFunction, DatabricksVectorSearchIndex, DatabricksLakebase),
    async deploy via `agents.deploy()` from a serverless job, query
    via CLI and OpenAI-compatible client.

Removed:

  references/fm-api-endpoints.md
    Standalone file from earlier commits; its content lives inline
    in training-and-serving.md exactly as it does in experimental's
    SKILL.md, so the deliberate split is no longer needed.

Stable SKILL.md updates (minimal, ops-focus preserved):

  - FM-endpoint link targets updated from `references/fm-api-endpoints.md`
    to `references/training-and-serving.md#foundation-model-api-endpoints`
    in the Endpoint Types table row and the FM-discovery bullet.
  - New `### Develop & deploy new models` subsection under "What's Next"
    with a 3-row table pointing at the new dev-side references, framed
    as "this skill is ops-focused; for the dev-side flow, see below".

Manifest regenerated.

Co-authored-by: Isaac
- The mechanical `../` → `../../` rewrite in the verbatim port assumed
  every peer skill is stable, but 4 of them live in `experimental/`.
  `../../<skill>/SKILL.md` resolved to `skills/<skill>/SKILL.md` which
  does not exist for `databricks-agent-bricks`, `databricks-mlflow-evaluation`,
  `databricks-vector-search`, `databricks-unity-catalog`. Repointed to
  `../../../experimental/<skill>/SKILL.md`. `databricks-jobs` link unchanged
  (it's stable).

- SKILL.md frontmatter `description` only described the ops surface, so
  agents wouldn't route dev-side asks (train, register, PyFunc, ResponsesAgent)
  to this skill. Broadened to cover both ops and the new dev surface.

- Version bumped 0.2.0 → 0.3.0 + manifest regenerated.

Co-authored-by: Isaac
Per @simonfaltum review: before resubmitting a deploy serverless job,
agents should check whether a run is already in flight (active job
runs filtered on run_name) or whether the target endpoint already
exists in the right state. Avoids wasting ~15 min of serverless and
racing for the same endpoint name.

Co-authored-by: Isaac
…icks-ml-training

Splits the post-port databricks-model-serving skill into two skills with
clean responsibility boundaries: databricks-model-serving keeps the
endpoint lifecycle / ops surface, and a new experimental
databricks-ml-training owns the dev-side training, MLflow tracking, UC
registration, custom PyFunc, and hand-rolled ResponsesAgent content.

Also closes five small gaps in databricks-model-serving where
non-obvious serving behavior from the original a-d-k port had fallen
through the cracks (Python deployments client gotchas, zero-downtime
version swap, two-field readiness rationale, classical-ML query shape,
Serving-UI SP filter).

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants