Skip to content

skills(model-serving): merge dev-side training/agent flows from a-d-k experimental#84

Open
jamesbroadhead wants to merge 11 commits into
mainfrom
jb/model-serving-port-phase1
Open

skills(model-serving): merge dev-side training/agent flows from a-d-k experimental#84
jamesbroadhead wants to merge 11 commits into
mainfrom
jb/model-serving-port-phase1

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

@jamesbroadhead jamesbroadhead commented May 24, 2026

Summary

Merges the dev-side surface of experimental:databricks-ml-training-serving/ into stable's databricks-model-serving. Closes #73's TODO #1b.

The two skills had near-zero content overlap — stable was ops-focused (manage existing endpoints via CLI); experimental was dev-focused (train, register, log a PyFunc or ResponsesAgent, deploy). Combining them avoids forcing users to invoke two skills for what is functionally one workflow.

Shape:

  • SKILL.md remains the ops entry-point (create / query / update / scale endpoints, App integration).
  • 3 new references/ files carry the dev-side flow verbatim from experimental.
  • Frontmatter description broadened so agent routing fires on dev-side asks too (train, register, PyFunc, ResponsesAgent) — see description field for the full trigger phrase list. NOT for: no-code agents (use databricks-agent-bricks); MLflow scorers (use databricks-mlflow-evaluation).

Changes

Stable file Source on experimental Covers
references/training-and-serving.md databricks-ml-training-serving/SKILL.md Canonical train/register/serve flow, mlflow.{sklearn,xgboost,…}.autolog() patterns, UC alias-based promotion (@prod/@challenger), batch scoring via spark_udf, real-time endpoint create + zero-downtime version swap, state.ready vs state.config_update poll-both gotcha, jobs submit --no-wait serverless deploy pattern, Foundation Model API endpoints runtime-list (replaces the earlier static catalog draft per @QuentinAmbard's review), and the gotchas trap-table.
references/custom-pyfunc.md databricks-ml-training-serving/1-custom-pyfunc.md File-based PyFunc ("Models from Code" via python_model="model.py"), infer_signature, code_paths, pre-deploy validation via mlflow.models.predict(env_manager="uv").
references/genai-agents.md databricks-ml-training-serving/2-genai-agents.md ResponsesAgent interface, LangGraph agent with UCFunctionToolkit + VectorSearchRetrieverTool, the create_text_output_item raw-dict-silently-fails gotcha, the resources=[...] passthrough-auth list, async deploy via agents.deploy() from a serverless job, query via CLI and OpenAI-compatible client.

All 3 ports are verbatim — only mechanical adjustments:

  1. Strip the SKILL.md frontmatter on the SKILL→reference promotion.
  2. Sibling-file renames: 1-custom-pyfunc.mdcustom-pyfunc.md, 2-genai-agents.mdgenai-agents.md.
  3. Cross-skill paths bumped one level for the deeper references/ location. Stable peers (databricks-jobs) use ../../; experimental-only peers (databricks-agent-bricks, databricks-mlflow-evaluation, databricks-vector-search, databricks-unity-catalog) use ../../../experimental/.

SKILL.md updates (kept tight — ops focus preserved):

  • FM-endpoint discovery now points to references/training-and-serving.md#foundation-model-api-endpoints.
  • New ### Develop & deploy new models subsection under "What's Next" with a 3-row table linking the new references.
  • Frontmatter description expanded to cover the dev surface (see above).
  • Version bumped to 0.3.0.

Manifest: regenerated via python3 scripts/skills.py generate.

Reviewer history

Earlier commits on this branch made two mistakes that have since been corrected:

  1. Initial draft sourced FM endpoint content from main:databricks-model-serving/ rather than experimental:databricks-ml-training-serving/ — caught by @QuentinAmbard, reworked.
  2. The first rework still shipped a static FM endpoint catalog. The experimental skill deliberately rejects static catalogs in favour of databricks serving-endpoints list | jq ... plus runtime-resolved defaults. Replaced with the runtime-list snippet.

Coverage vs. #73 TODO #1b

TODO #1b item Landed in this PR
classical-ml autolog patterns training-and-serving.md § Train and register
Custom PyFunc signatures custom-pyfunc.md (whole file)
ResponsesAgent + create_text_output_item gotcha genai-agents.md § CRITICAL: output items must use helper methods
UCFunctionToolkit + VectorSearchRetrieverTool resource passthrough genai-agents.md § Log + register + § Resources that need passthrough auth
Foundation Model API endpoint table → runtime-list training-and-serving.md § Foundation Model API endpoints

Plus content the original TODO didn't enumerate: batch scoring via spark_udf, real-time endpoint create + version swap, the state.ready vs state.config_update poll-both gotcha, serverless jobs submit --no-wait deploy pattern, the consolidated gotchas trap-table.

Known follow-ups (out of scope)

  • references/training-and-serving.md has an anchor link #one-time-runs-jobs-submit--async-pattern-for-notebooks into databricks-jobs/SKILL.md. The section exists in a-d-k's databricks-jobs but not yet in d-a-s databricks-jobs/SKILL.md. Link falls back to the file top.
  • references/off-platform-streaming.md (pre-existing from #76) is in the manifest but unwired from SKILL.md. Untouched by this PR.
  • If #87 promotes databricks-vector-search to stable, the ../../../experimental/databricks-vector-search/SKILL.md link in training-and-serving.md should be flipped to the stable path. skills: promote databricks-vector-search to stable #87's link-sweep should handle it.

Test plan

  • python3 scripts/skills.py generate clean.
  • python3 scripts/skills.py validate passes (Everything is up to date.).
  • All 3 new reference files diff cleanly against upstream (only the mechanical adjustments above).
  • All 6 cross-skill links in the new reference files resolve to existing files in this repo.
  • CI green on this branch.
  • Owner review (@databricks/eng-apps-devex per CODEOWNERS).

This pull request and its description were written by Claude.

Phase 1 of #73's TODO #1b. Adds references/fm-api-endpoints.md with the
curated Foundation Model API endpoint table (chat/instruct + embedding
models) from databricks-solutions/ai-dev-kit's model-serving skill,
plus common defaults and query examples (CLI + SDK).

Stripped: the cloud/language prefix on the docs link, and the leftover
MCP-tool references in the source. The endpoint table itself is static
catalog data — no MCP coupling.

SKILL.md updates:
- bump version to 0.2.0
- point Endpoint Types table at the new reference
- point the Foundation Model discovery bullet at the new reference

Subsequent phases (separate PRs / commits) port the remaining dev-side
content: classical-ml autolog patterns, Custom PyFunc signatures,
ResponsesAgent with the create_text_output_item gotcha, UCFunctionToolkit
+ VectorSearchRetrieverTool resource passthrough.

Co-authored-by: Isaac
Aligns the verbatim a-d-k port with the live docs.databricks.com
supported-models page (validated via WebFetch on 2026-05-26):

ADDED (missing from a-d-k snapshot):
- databricks-claude-opus-4-7 (now most capable Claude)
- databricks-gpt-5-5-pro, 5-5
- databricks-gpt-5-4, 5-4-mini, 5-4-nano
- databricks-gpt-5-3-codex, 5-2-codex
- databricks-gemini-3-1-flash-lite, 3-5-flash
- databricks-qwen35-122b-a10b (Preview)

REMOVED (retired, no longer in docs):
- databricks-claude-3-7-sonnet
- databricks-meta-llama-3-1-405b-instruct

UPDATED notes:
- claude-opus-4-6 no longer "Most capable"
- gpt-5-2 no longer "Latest"
- gpt-5-1-codex-{max,mini} + gpt-5-2-codex marked retiring 2026-07-16
- gemini-3-pro marked retired 2026-03-26 with redirect through 2026-06-07
- Several Gemini / Codex endpoints annotated with cross-geo requirement
- qwen3-next-80b annotated as Preview

OPENING PARAGRAPH:
- "available in every workspace" -> "available in supported Model Serving
  regions"; calls out cross-geo requirement for several endpoints

NOT TOUCHED (out of scope: not docs-validatable from supported-models page):
- served_entities[].entity_name guidance (line 3 second half)
- SKILL.md "system.ai.* catalog" claim on the pay-per-token row
These remain as in the a-d-k snapshot and should be revisited if/when
docs cover them directly.

Test plan: `scripts/skills.py validate` -> "Everything is up to date";
`scripts/skills.py generate` -> only refreshes manifest.json timestamps.

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead force-pushed the jb/model-serving-port-phase1 branch from c9015d8 to d400eff Compare May 26, 2026 09:47
@QuentinAmbard
Copy link
Copy Markdown

@jamesbroadhead I suspect this is also coming from main from the content I see? The experimental skill is https://github.com/databricks-solutions/ai-dev-kit/blob/experimental/databricks-skills/databricks-ml-training-serving/SKILL.md
Should I also open another merge PR for suggestion ?

@jamesbroadhead
Copy link
Copy Markdown
Contributor Author

Hi @QuentinAmbard — Claude here, working with James.

You're right, and I owe you (and the PR description) a correction. I checked both branches:

  • experimental/databricks-skills/databricks-model-serving/does not exist. The skill was renamed/restructured to databricks-ml-training-serving/ on experimental (3 reference files, SKILL.md + 1-custom-pyfunc.md + 2-genai-agents.md).
  • main/databricks-skills/databricks-model-serving/ — does exist, with 9 reference files including the static FM endpoint catalog this PR ports.

Content-wise it's a clean fingerprint match for main: the chat/instruct + embedding tables (rows like databricks-claude-opus-4-6 | Anthropic | Most capable, 1M context, the "available in every workspace" prose, the table headings) are verbatim from main:databricks-model-serving/SKILL.md and don't appear anywhere on the experimental side. So the PR description's claim of porting from experimental/databricks-skills/databricks-model-serving is wrong — that path was never on experimental.

The bigger issue is philosophical: the experimental databricks-ml-training-serving/SKILL.md deliberately rejects a static catalog:

"Pay-per-token, pre-provisioned in every workspace. New models land regularly and a static skill list goes stale fast — always list at runtime instead of hard-coding names."

It ships databricks serving-endpoints list | jq ... plus runtime-resolved defaults ("highest-numbered Claude Sonnet", "highest-numbered -codex-max") instead of a hard-coded table. So even with the 2026-05-26 doc-validated catalog refresh layered on, what this PR adds is the opposite of the current experimental guidance.

I'll rework this PR to actually align with the experimental skill — replace the static catalog reference file with the runtime-list snippet + the runtime-resolved defaults, and fix the PR description. The doc-validated catalog work isn't wasted; it just shouldn't be how the stable skill steers callers.

On "Should I also open another merge PR for suggestion?" — happy to coordinate. If you mean a PR upstream into a-d-k's experimental to evolve databricks-ml-training-serving further, please go ahead; we'll re-sync into databricks-agent-skills after. If you meant something else, let me know what you had in mind.

…ot static catalog

Quentin pointed out (PR #84) that the prior two commits actually ported
from `main:databricks-skills/databricks-model-serving/`, not
`experimental:databricks-skills/databricks-ml-training-serving/` as the
PR description claimed. The two skills take opposite approaches:

  - `main` ships a static catalog table of FM API endpoint names.
  - `experimental` deliberately rejects that ("a static skill list goes
    stale fast — always list at runtime instead of hard-coding names")
    and ships a `databricks serving-endpoints list | jq ...` one-liner
    plus runtime-resolved defaults (highest-numbered Claude Sonnet for
    agents, highest-numbered `-codex-max` for code).

Re-port to match the experimental philosophy:

  - `references/fm-api-endpoints.md`: replace the static catalog with the
    runtime-list snippet (filtered by `databricks-` name prefix AND
    `system.ai.*` served entity, to exclude non-FM endpoints sharing the
    prefix), runtime-resolved family defaults, and CLI + SDK query
    examples that use a placeholder endpoint name rather than a hard-coded
    model.

  - `SKILL.md`: update the Endpoint Types row + the Foundation-Model
    discovery bullet to reframe the reference as "discover at runtime"
    rather than "curated table". Version stays at 0.2.0 (frontmatter
    unchanged → manifest unchanged).

The 2026-05-26 catalog refresh in the previous commit is dropped here:
the experimental skill's point is that no static table is the right
shape, so curating one against docs.databricks.com isn't useful for the
stable skill either.

Co-authored-by: Isaac
…ental port

Previous commit (c148500) restated the experimental section in my own
words and added a "Querying" section + provisioned-throughput aside +
docs-link gloss that aren't in the upstream skill. The PR's stated goal
is to port from experimental — do an actual port, not a paraphrase.

`references/fm-api-endpoints.md` now mirrors the
`## Foundation Model API endpoints` section of
`experimental:databricks-ml-training-serving/SKILL.md` verbatim
(heading promoted from `##` to `#` since this is a standalone file):
intro paragraph + the `databricks serving-endpoints list | jq ...`
one-liner + the family-based default-picking rule. Nothing else.

Also trim the SKILL.md discovery bullet back toward its original
shape — link to the reference file for the runtime-list snippet, then
the same `system.ai` / `serving-endpoints list` / `get-open-api`
alternatives that were already there.

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead changed the title skills(model-serving): port dev-side content from a-d-k (phase 1: FM API endpoints) skills(model-serving): port FM API endpoints section from a-d-k experimental May 26, 2026
…ntal

Expands the port from the FM-endpoints-only scope to cover every
section of `experimental:databricks-ml-training-serving/`. Mirrors
the experimental skill's 3-file structure 1:1 into stable's
`references/` directory; the standalone fm-api-endpoints.md added in
earlier commits goes away (its content lives inline in
training-and-serving.md exactly as it does in experimental's SKILL.md).

Added (all verbatim ports, mechanical adjustments only):

  references/training-and-serving.md
    Ports experimental SKILL.md content. Mechanical changes only:
    frontmatter stripped (destination is a reference file, not a
    SKILL.md); `1-custom-pyfunc.md` → `custom-pyfunc.md`,
    `2-genai-agents.md` → `genai-agents.md` (filename renames);
    `../<skill>/SKILL.md` → `../../<skill>/SKILL.md` (one more level
    of nesting since this file is in references/ rather than at the
    skill root). Content covers: canonical train/register/serve flow,
    `mlflow.{sklearn,xgboost,…}.autolog()` patterns, UC alias-based
    promotion, batch scoring via `spark_udf`, real-time endpoint
    create + zero-downtime version swap, `state.ready` vs
    `state.config_update` poll-both gotcha, `jobs submit --no-wait`
    serverless deploy pattern, Foundation Model API endpoints
    runtime-list, and the full gotchas trap-table.

  references/custom-pyfunc.md
    Ports experimental 1-custom-pyfunc.md verbatim.
    Mechanical change: `[SKILL.md]` → `[training-and-serving.md]`
    where the original cross-referenced its parent SKILL.md.
    Content: file-based PyFunc ("Models from Code"),
    `infer_signature`, `code_paths`, pre-deploy validation via
    `mlflow.models.predict(env_manager="uv")`.

  references/genai-agents.md
    Ports experimental 2-genai-agents.md verbatim.
    Mechanical changes: cross-skill paths bumped one level deeper;
    `[SKILL.md]` → `[training-and-serving.md]`. Content covers:
    `ResponsesAgent` interface, LangGraph agent with
    `UCFunctionToolkit` + `VectorSearchRetrieverTool`, the
    `create_text_output_item` raw-dict-silently-fails gotcha, the
    `resources=[...]` passthrough-auth list (DatabricksServingEndpoint,
    DatabricksFunction, DatabricksVectorSearchIndex, DatabricksLakebase),
    async deploy via `agents.deploy()` from a serverless job, query
    via CLI and OpenAI-compatible client.

Removed:

  references/fm-api-endpoints.md
    Standalone file from earlier commits; its content lives inline
    in training-and-serving.md exactly as it does in experimental's
    SKILL.md, so the deliberate split is no longer needed.

Stable SKILL.md updates (minimal, ops-focus preserved):

  - FM-endpoint link targets updated from `references/fm-api-endpoints.md`
    to `references/training-and-serving.md#foundation-model-api-endpoints`
    in the Endpoint Types table row and the FM-discovery bullet.
  - New `### Develop & deploy new models` subsection under "What's Next"
    with a 3-row table pointing at the new dev-side references, framed
    as "this skill is ops-focused; for the dev-side flow, see below".

Manifest regenerated.

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead changed the title skills(model-serving): port FM API endpoints section from a-d-k experimental skills(model-serving): port dev-side content from a-d-k experimental (TODO #1b) May 26, 2026
- The mechanical `../` → `../../` rewrite in the verbatim port assumed
  every peer skill is stable, but 4 of them live in `experimental/`.
  `../../<skill>/SKILL.md` resolved to `skills/<skill>/SKILL.md` which
  does not exist for `databricks-agent-bricks`, `databricks-mlflow-evaluation`,
  `databricks-vector-search`, `databricks-unity-catalog`. Repointed to
  `../../../experimental/<skill>/SKILL.md`. `databricks-jobs` link unchanged
  (it's stable).

- SKILL.md frontmatter `description` only described the ops surface, so
  agents wouldn't route dev-side asks (train, register, PyFunc, ResponsesAgent)
  to this skill. Broadened to cover both ops and the new dev surface.

- Version bumped 0.2.0 → 0.3.0 + manifest regenerated.

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead changed the title skills(model-serving): port dev-side content from a-d-k experimental (TODO #1b) skills(model-serving): merge dev-side training/agent flows from a-d-k experimental May 27, 2026
Copy link
Copy Markdown

@QuentinAmbard QuentinAmbard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice let's merge this one, I'll send a followup PR on top!


## Deploy (async job, ~15 min)

`databricks.agents.deploy()` blocks for ~15 minutes — don't run it inline from the CLI. Submit as a serverless job so the chat session doesn't hold the connection.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - should we add something about how agents can check if there has already been submitted a serverless job for the deploy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Claude here.)

Good call — added in 8c8a1b3. Two cheap checks just before the submit:

  1. databricks jobs list-runs --active-only filtered on run_name == "deploy_<model>" to catch an already-in-flight deploy.
  2. databricks serving-endpoints get <endpoint_name> to skip the redeploy if the endpoint already exists on the right version.

If either hits, the recipe now says to follow the existing run with jobs get-run instead of submitting a new one.

@jamesbroadhead jamesbroadhead enabled auto-merge (squash) May 27, 2026 11:12
Per @simonfaltum review: before resubmitting a deploy serverless job,
agents should check whether a run is already in flight (active job
runs filtered on run_name) or whether the target endpoint already
exists in the right state. Avoids wasting ~15 min of serverless and
racing for the same endpoint name.

Co-authored-by: Isaac
jamesbroadhead added a commit that referenced this pull request May 27, 2026
…apx Related Skills entry

`databricks-app-apx` was the FastAPI+React stack referenced from
ai-dev-kit's `databricks-apps-python` skill. It has been removed
upstream (a-d-k is deprecated; the apx-on-CLI flow merged into the
stable `databricks-apps` skill via #84/#73). The "Related Skills"
bullet is the last dangling reference inside this repo.

This PR was prepared by Claude.
@QuentinAmbard
Copy link
Copy Markdown

QuentinAmbard commented May 28, 2026

Stacked a follow-up on this in #110 — adds a separate experimental/databricks-ml-training skill for the dev-side content (MLflow autolog, UC registration, custom PyFunc, hand-rolled ResponsesAgent, batch scoring via spark_udf) that complements the model-serving content landed here. Also closes five small gaps where non-obvious serving behavior fell through during the original a-d-k port (MLflow Deployments client tags= top-level + served_model_name derivation, zero-downtime version-swap pattern, two-state-field readiness rationale, classical-ML dataframe_records query shape, Serving-UI "Owned by me" SP filter).

#110 includes the commits from this PR at its base — please merge this one first, then #110 will rebase cleanly onto the new main.

- Drop ../../../experimental/... cross-skill links that 404 when installed
  (skills install flat under ~/.claude/skills/, not under stable/ vs
  experimental/). Use plain skill-name references instead.
- Replace ai-dev-kit-specific tag examples ("aidevkit_project") with a
  neutral "project": "demo" so a d-a-s skill doesn't bleed a-d-k convention.
- Tighten SKILL.md description from ~870 chars to ~290 chars, matching the
  convention being established in PR #107.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants