Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring by kwulffert23 · Pull Request #14 · databricks-solutions/agentic-customer-support

kwulffert23 · 2026-04-27T21:43:47Z

Branch: `sdk-bump-and-fixes`

This branch bumps the Databricks/MLflow SDKs to current versions, switches the agent's LLMs from dbdemos-openai-gpt4 to Claude (Opus 4.7 / Sonnet 4.6), and fixes several pre-existing latent bugs surfaced by deploying end-to-end on a dev Azure workspace.

Motivation

Existing mlflow[databricks]==3.9.0rc0 was pinned to a release candidate; several APIs the project uses (mlflow.genai.scorers, make_judge, mlflow.genai.evaluate) only stabilised in MLflow 3.10/3.11.
databricks-sdk==0.73.0 was ~30 minor versions behind the current 0.105.0.
Other Databricks packages (databricks-agents, databricks-vectorsearch, databricks-openai, databricks-mcp) all materially behind current.
Switching to Claude as the agent LLM exposed coupling to OpenAI behaviour in the codebase that needed unblocking.

SDK upgrade

mlflow[databricks]      3.9.0rc0 -> >=3.11,<4    (resolved 3.11.1 — off the RC)
databricks-sdk          0.73.0   -> >=0.105,<1   (resolved 0.105.0)
databricks-agents       1.9.3    -> >=1.9.4,<2   (resolved 1.9.4 — current PyPI max)
databricks-vectorsearch >=0.62   -> >=0.66,<1    (resolved 0.67)
databricks-openai       0.6.1    -> >=0.15,<0.16 (resolved 0.15.0)
databricks-mcp          0.4.0    -> >=0.9,<0.10  (resolved 0.9.0)

Notable transitive cleanup: langchain / langgraph removed from the lock — newer databricks-agents no longer pulls them.

Files: pyproject.toml, requirements.txt, uv.lock.

Source-of-truth report from the sdk-reviewer subagent: SDK_REVIEW.md.

Agent LLM migration to Claude

Agent	Endpoint
supervisor	`databricks-claude-opus-4-7`
account, billing, product, tech_support	`databricks-claude-sonnet-4-6`

Files: configs/agents/{supervisor,account,billing,product,tech_support}.yaml.

Three code-level adjustments were needed for Claude compatibility:

Removed the temperature parameter from all five agent configs. Databricks Foundation Model API rejects temperature for the Claude 4.x family.
Generalised the Claude-specific tool-call branch in telco_support_agent/agents/utils/message_formatting.py from if llm_endpoint == "databricks-claude-3-7-sonnet" to if "claude" in llm_endpoint.lower(). Claude requires content: None alongside tool_calls rather than the literal string "tool call".
Updated billing/account/product system prompts to make the auto-injection of customer explicit. Tool specs strip the customer parameter from what the LLM sees (per tool_injection.py:23-72); the runtime auto-injects it. Claude follows prompt instructions more literally than GPT-4 and was asking the user for the customer ID; the new prompts explicitly say "the customer ID is auto-injected; never ask the user for it."

UC function registration bug

tools/registry.py:_register_domain_functions previously triggered registration via importlib.import_module(...) and relied on module-level side-effects to register the SQL UDFs. But each tools/<domain>/functions.py had its register_*(uc_config) calls inside if __name__ == "__main__":, so importing was a no-op and 0/8 UC functions ever got created. Symptom at runtime: Routine or Model 'telco_customer_support_dev.agent.get_billing_info' does not exist.

Fix: each domain module now exposes a top-level register_all(uc_config) function. _register_domain_functions calls module.register_all(uc_config) after import. The if __name__ == "__main__": block is preserved for direct execution.

Files: telco_support_agent/tools/{billing,account,product}/functions.py, telco_support_agent/tools/registry.py.

Data catalog plumbing

UCConfig.data_catalog defaulted to telco_customer_support_prod (per a "data always comes from prod" assumption in the schema), and the three notebook config classes only set agent_catalog from the widget — not data_catalog. The widget value was telco_customer_support_dev for this workspace, but data_catalog silently fell back to prod. Two failure modes resulted:

CREATE OR REPLACE FUNCTION referenced telco_customer_support_prod.gold.customers (missing) → registration silently failed.
The deployed agent's uc_config.yaml artifact had data_catalog=prod → RestException: TABLE_DOES_NOT_EXIST: Table 'telco_customer_support_prod.gold.knowledge_base_index'.

Two fixes:

telco_support_agent/config/notebooks.py: all three to_uc_config() methods (RunEvalsConfig, LogRegisterConfig, DeployAgentConfig) now also pass data_catalog=self.uc_catalog.
telco_support_agent/ops/logging.py: _get_supervisor_resources(...) now accepts an optional data_catalog parameter (falling back to uc_catalog when not provided), instead of hardcoding telco_customer_support_prod. The call site in log_agent passes uc_config.data_catalog.

For multi-environment setups that genuinely want shared prod data, callers can still pass an explicit data_catalog.

Deploy idempotency

databricks.agents.deploy() raises BadRequest: Cannot create 2+ served entities with the same name when called with the same (model, version, endpoint) already deployed. The notebook job would fail on every rerun after a successful deploy.

Fix: telco_support_agent/ops/deployment.py:deploy_agent now lists existing deployments via agents.get_deployments(model_name=...) and short-circuits when a matching (endpoint_name, model_version) is found — returning the existing deployment and continuing through the rest of the function (waiting for ready, permissions, instructions). New versions still go through the normal deploy path.

Bundle / app routing

databricks.yml: dev/staging/prod targets all routed to the Azure workspace (adb-7405608427441525.5.azuredatabricks.net).
telco_support_agent/ui/app_dev.yaml: DATABRICKS_HOST updated to the Azure workspace URL; MLFLOW_EXPERIMENT_ID updated to 2366216092548657 (the experiment for /Shared/telco_support_agent/dev/dev_telco_support_agent on Azure).

The UI app staging/prod yamls were left untouched (still on e2-demo-west) since this branch only validates dev. Apply the same edits when bringing those environments online.

App service principal `CAN_QUERY` on the serving endpoint

Databricks Apps run as a workspace-managed service principal. With no explicit grant, the app's calls to /serving-endpoints/dev-telco-customer-support-agent/invocations returned HTTP 403. The fix was applied out-of-band:

databricks serving-endpoints update-permissions <endpoint_id> \
  --json '{"access_control_list":[{"service_principal_name":"<app_sp_client_id>","permission_level":"CAN_QUERY"}]}' \
  --profile fe-vm-azure

This is workspace state — not in git. If anyone redeploys the endpoint from scratch (agents.delete_deployment then agents.deploy) the grant resets. A follow-up could bake this into _set_permissions in telco_support_agent/ops/deployment.py so it's reproducible from the bundle.

Note: a resources: block in app_dev.yaml was tried first and reverted — that yaml controls the runtime container's command/env, not app-level resource permissions; the latter is a separate Apps API.

`dbdemos_tracker` API change

The UI middleware was logging Tracking error: ... on every request. The installed dbdemos_tracker API moved demo_name from Tracker.track_app_view() to the Tracker(...) constructor. Updated signatures:

Tracker(org_id, email=None, demo_name=None, demo_catalog_id=None)
Tracker.track_app_view(self, user_email, app_path)

telco_support_agent/ui/backend/app/main.py updated accordingly.

Verification

Bundle deploy: databricks bundle deploy -t dev --profile fe-vm-azure clean.
Job run: [dev kyra_wulffert] telco_log_register_deploy_agent_dev — UC functions register (8/8), model logs to UC, endpoint deploys.
Endpoint query (raw API to bypass CLI schema filtering for custom_inputs):
```
databricks api post /serving-endpoints/dev-telco-customer-support-agent/invocations \
  --json @/tmp/req.json --profile fe-vm-azure
```
with body {"input":[{"role":"user","content":"What was the customer's data in May?"}],"custom_inputs":{"customer":"CUS-10001"}} returns real billing data via Claude → tool call → UC function → Delta read.
UI app: cd telco_support_agent/ui && ./deploy.sh dev fe-vm-azure builds the React frontend and deploys. App reachable at https://telco-support-agent-dev-7405608427441525.5.azure.databricksapps.com/.

Tooling notes encountered

databricks CLI 0.291.0 hit an expired Terraform PGP signing key; upgrade to 0.298.0 via brew upgrade databricks resolved it. Required to deploy bundles cloud-side.
databricks serving-endpoints query strips fields not in its OpenAI-shaped schema — custom_inputs was silently dropped. Use databricks api post /serving-endpoints/<name>/invocations for full-fidelity requests.
The UI app's frontend is built/staged by telco_support_agent/ui/deploy.sh (it runs npm run build and copies frontend/dist → static/). The bundle deploy alone uploads raw source and starts the app in API-only mode (404 at /).

Follow-ups left for later

telco_support_agent/data/generators/knowledge_base.py:346 still references databricks-claude-3-7-sonnet (used at synthetic-data-gen time, not at agent runtime — left alone here).
telco_support_agent/ui/app_{staging,prod}.yaml not updated for Azure (the user's stash holds those changes).
The dev UI app may emit a non-fatal warning Tracker.track_app_view() got an unexpected keyword argument 'demo_name' — frontend demo telemetry, doesn't affect functionality.

Commit log on this branch

(Most recent first.)

Move demo_name kwarg to dbdemos Tracker constructor
Grant the dev UI app CAN_QUERY on the serving endpoint
Add BRANCH_NOTES.md summarising the SDK migration + fixes
Point dev UI app at the Azure MLflow experiment
Tell sub-agents the customer ID is auto-injected
Make deploy_agent idempotent
Route dev UI app to Azure workspace
Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring

(Plus a follow-up commit revising this doc and removing a no-op resources: block from app_dev.yaml.)

This pull request and its description were written by Isaac.

SDK upgrade - mlflow[databricks] 3.9.0rc0 -> >=3.11,<4 (off the RC) - databricks-sdk 0.73 -> >=0.105,<1 - databricks-agents 1.9.3 -> >=1.9.4,<2 - databricks-vectorsearch >=0.62 -> >=0.66,<1 - databricks-openai 0.6.1 -> >=0.15,<0.16 - databricks-mcp 0.4.0 -> >=0.9,<0.10 Agent LLM - supervisor on databricks-claude-opus-4-7 - account/billing/product/tech_support on databricks-claude-sonnet-4-6 - drop `temperature` (Claude 4.x on Databricks rejects it) - generalize tool-call formatting in message_formatting.py to match any Claude endpoint instead of a hardcoded sonnet-3-7 string UC function registration - expose register_all(uc_config) in tools/{billing,account,product}/functions.py; registration was previously gated behind `if __name__ == "__main__":` so importlib.import_module(...) was a no-op and 0/8 functions ever got created - registry now calls module.register_all(uc_config) after import Data catalog flow - LogRegisterConfig/RunEvalsConfig/DeployAgentConfig.to_uc_config() now also set data_catalog from the widget's uc_catalog (was defaulting to a hardcoded 'telco_customer_support_prod' that doesn't exist in this workspace) - _get_supervisor_resources() now takes data_catalog and stops hardcoding the prod catalog for vector search index resources Other - SDK_REVIEW.md: migration report from the sdk-reviewer subagent - databricks.yml: route dev/staging/prod targets to the Azure workspace Co-authored-by: Isaac

DATABRICKS_HOST was pointing at e2-demo-west; the deployed agent endpoint lives on adb-7405608427441525.5.azuredatabricks.net, so the UI app needs to call there. Co-authored-by: Isaac

Skip agents.deploy() when the (model, version, endpoint) tuple is already deployed. Avoids the name-collision error from agents.deploy() on reruns of the bundle job after a successful deploy. Co-authored-by: Isaac

Claude follows prompt instructions more literally than the previous GPT-4 endpoint, so it asked the user for the customer ID even though the runtime auto-injects it into every tool call (per ToolParameterInjector.prepare_tool_spec_for_llm). Replace the vague "Customer ID is always present" line with an explicit directive: the customer ID won't appear in the tool's parameter list, never ask the user for it, call the tool immediately. Co-authored-by: Isaac

Resolved /Shared/telco_support_agent/dev/dev_telco_support_agent on adb-7405608427441525.5 to experiment_id 2366216092548657. The previous ID belonged to e2-demo-west and would silently drop UI traces. Co-authored-by: Isaac

PR-description-style summary of everything in sdk-bump-and-fixes: SDK bumps, Claude switch with the three compatibility tweaks, UC function registration bug, data_catalog plumbing, deploy idempotency, bundle/app routing, verification, follow-ups. Co-authored-by: Isaac

Databricks Apps run as their own service principal; without an explicit serving_endpoint resource declaration, calls to /serving-endpoints/<name>/invocations return 403. Adding a CAN_QUERY resource binding for dev-telco-customer-support-agent so the app can stream chat responses. Co-authored-by: Isaac

The current dbdemos_tracker API takes demo_name on the Tracker constructor, not on track_app_view (signature is Tracker(org_id, email=None, demo_name=None, demo_catalog_id=None) and track_app_view(self, user_email, app_path)). Without this the middleware logged on every request — first that demo_name was an unexpected kwarg, then that demo_name was required. Co-authored-by: Isaac

- Remove the resources: block from app_dev.yaml that I tried as a permission grant — that yaml is the runtime container config and doesn't affect app-level permissions, the actual fix was the serving-endpoints update-permissions call documented in BRANCH_NOTES.md. - Update BRANCH_NOTES.md to cover the CAN_QUERY grant on the agent endpoint, the dbdemos_tracker constructor API change, and refresh the commit log section. Co-authored-by: Isaac

kwulffert23 · 2026-04-27T21:47:21Z

@jeannefukumaru could you take a look when you have a chance? I couldn't add you as a formal reviewer due to permissions.

kwulffert23 added 9 commits April 27, 2026 20:34

Route dev UI app to Azure workspace

208461f

DATABRICKS_HOST was pointing at e2-demo-west; the deployed agent endpoint lives on adb-7405608427441525.5.azuredatabricks.net, so the UI app needs to call there. Co-authored-by: Isaac

Make deploy_agent idempotent

2cc4299

Skip agents.deploy() when the (model, version, endpoint) tuple is already deployed. Avoids the name-collision error from agents.deploy() on reruns of the bundle job after a successful deploy. Co-authored-by: Isaac

Point dev UI app at the Azure MLflow experiment

66759aa

Resolved /Shared/telco_support_agent/dev/dev_telco_support_agent on adb-7405608427441525.5 to experiment_id 2366216092548657. The previous ID belonged to e2-demo-west and would silently drop UI traces. Co-authored-by: Isaac

kwulffert23 marked this pull request as ready for review April 27, 2026 21:45

kwulffert23 marked this pull request as draft April 28, 2026 05:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring#14

Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring#14
kwulffert23 wants to merge 9 commits into
databricks-solutions:mainfrom
kwulffert23:sdk-bump-and-fixes

kwulffert23 commented Apr 27, 2026

Uh oh!

kwulffert23 commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kwulffert23 commented Apr 27, 2026

Branch: sdk-bump-and-fixes

Motivation

SDK upgrade

Agent LLM migration to Claude

UC function registration bug

Data catalog plumbing

Deploy idempotency

Bundle / app routing

App service principal CAN_QUERY on the serving endpoint

dbdemos_tracker API change

Verification

Tooling notes encountered

Follow-ups left for later

Commit log on this branch

Uh oh!

kwulffert23 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Branch: `sdk-bump-and-fixes`

App service principal `CAN_QUERY` on the serving endpoint

`dbdemos_tracker` API change

kwulffert23 commented Apr 27, 2026 •

edited

Loading