Skip to content

Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring#14

Draft
kwulffert23 wants to merge 9 commits into
databricks-solutions:mainfrom
kwulffert23:sdk-bump-and-fixes
Draft

Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring#14
kwulffert23 wants to merge 9 commits into
databricks-solutions:mainfrom
kwulffert23:sdk-bump-and-fixes

Conversation

@kwulffert23
Copy link
Copy Markdown

Branch: sdk-bump-and-fixes

This branch bumps the Databricks/MLflow SDKs to current versions, switches the agent's LLMs from dbdemos-openai-gpt4 to Claude (Opus 4.7 / Sonnet 4.6), and fixes several pre-existing latent bugs surfaced by deploying end-to-end on a dev Azure workspace.

Motivation

  • Existing mlflow[databricks]==3.9.0rc0 was pinned to a release candidate; several APIs the project uses (mlflow.genai.scorers, make_judge, mlflow.genai.evaluate) only stabilised in MLflow 3.10/3.11.
  • databricks-sdk==0.73.0 was ~30 minor versions behind the current 0.105.0.
  • Other Databricks packages (databricks-agents, databricks-vectorsearch, databricks-openai, databricks-mcp) all materially behind current.
  • Switching to Claude as the agent LLM exposed coupling to OpenAI behaviour in the codebase that needed unblocking.

SDK upgrade

mlflow[databricks]      3.9.0rc0 -> >=3.11,<4    (resolved 3.11.1 — off the RC)
databricks-sdk          0.73.0   -> >=0.105,<1   (resolved 0.105.0)
databricks-agents       1.9.3    -> >=1.9.4,<2   (resolved 1.9.4 — current PyPI max)
databricks-vectorsearch >=0.62   -> >=0.66,<1    (resolved 0.67)
databricks-openai       0.6.1    -> >=0.15,<0.16 (resolved 0.15.0)
databricks-mcp          0.4.0    -> >=0.9,<0.10  (resolved 0.9.0)

Notable transitive cleanup: langchain / langgraph removed from the lock — newer databricks-agents no longer pulls them.

Files: pyproject.toml, requirements.txt, uv.lock.

Source-of-truth report from the sdk-reviewer subagent: SDK_REVIEW.md.

Agent LLM migration to Claude

Agent Endpoint
supervisor databricks-claude-opus-4-7
account, billing, product, tech_support databricks-claude-sonnet-4-6

Files: configs/agents/{supervisor,account,billing,product,tech_support}.yaml.

Three code-level adjustments were needed for Claude compatibility:

  1. Removed the temperature parameter from all five agent configs. Databricks Foundation Model API rejects temperature for the Claude 4.x family.
  2. Generalised the Claude-specific tool-call branch in telco_support_agent/agents/utils/message_formatting.py from if llm_endpoint == "databricks-claude-3-7-sonnet" to if "claude" in llm_endpoint.lower(). Claude requires content: None alongside tool_calls rather than the literal string "tool call".
  3. Updated billing/account/product system prompts to make the auto-injection of customer explicit. Tool specs strip the customer parameter from what the LLM sees (per tool_injection.py:23-72); the runtime auto-injects it. Claude follows prompt instructions more literally than GPT-4 and was asking the user for the customer ID; the new prompts explicitly say "the customer ID is auto-injected; never ask the user for it."

UC function registration bug

tools/registry.py:_register_domain_functions previously triggered registration via importlib.import_module(...) and relied on module-level side-effects to register the SQL UDFs. But each tools/<domain>/functions.py had its register_*(uc_config) calls inside if __name__ == "__main__":, so importing was a no-op and 0/8 UC functions ever got created. Symptom at runtime: Routine or Model 'telco_customer_support_dev.agent.get_billing_info' does not exist.

Fix: each domain module now exposes a top-level register_all(uc_config) function. _register_domain_functions calls module.register_all(uc_config) after import. The if __name__ == "__main__": block is preserved for direct execution.

Files: telco_support_agent/tools/{billing,account,product}/functions.py, telco_support_agent/tools/registry.py.

Data catalog plumbing

UCConfig.data_catalog defaulted to telco_customer_support_prod (per a "data always comes from prod" assumption in the schema), and the three notebook config classes only set agent_catalog from the widget — not data_catalog. The widget value was telco_customer_support_dev for this workspace, but data_catalog silently fell back to prod. Two failure modes resulted:

  1. CREATE OR REPLACE FUNCTION referenced telco_customer_support_prod.gold.customers (missing) → registration silently failed.
  2. The deployed agent's uc_config.yaml artifact had data_catalog=prodRestException: TABLE_DOES_NOT_EXIST: Table 'telco_customer_support_prod.gold.knowledge_base_index'.

Two fixes:

  • telco_support_agent/config/notebooks.py: all three to_uc_config() methods (RunEvalsConfig, LogRegisterConfig, DeployAgentConfig) now also pass data_catalog=self.uc_catalog.
  • telco_support_agent/ops/logging.py: _get_supervisor_resources(...) now accepts an optional data_catalog parameter (falling back to uc_catalog when not provided), instead of hardcoding telco_customer_support_prod. The call site in log_agent passes uc_config.data_catalog.

For multi-environment setups that genuinely want shared prod data, callers can still pass an explicit data_catalog.

Deploy idempotency

databricks.agents.deploy() raises BadRequest: Cannot create 2+ served entities with the same name when called with the same (model, version, endpoint) already deployed. The notebook job would fail on every rerun after a successful deploy.

Fix: telco_support_agent/ops/deployment.py:deploy_agent now lists existing deployments via agents.get_deployments(model_name=...) and short-circuits when a matching (endpoint_name, model_version) is found — returning the existing deployment and continuing through the rest of the function (waiting for ready, permissions, instructions). New versions still go through the normal deploy path.

Bundle / app routing

  • databricks.yml: dev/staging/prod targets all routed to the Azure workspace (adb-7405608427441525.5.azuredatabricks.net).
  • telco_support_agent/ui/app_dev.yaml: DATABRICKS_HOST updated to the Azure workspace URL; MLFLOW_EXPERIMENT_ID updated to 2366216092548657 (the experiment for /Shared/telco_support_agent/dev/dev_telco_support_agent on Azure).

The UI app staging/prod yamls were left untouched (still on e2-demo-west) since this branch only validates dev. Apply the same edits when bringing those environments online.

App service principal CAN_QUERY on the serving endpoint

Databricks Apps run as a workspace-managed service principal. With no explicit grant, the app's calls to /serving-endpoints/dev-telco-customer-support-agent/invocations returned HTTP 403. The fix was applied out-of-band:

databricks serving-endpoints update-permissions <endpoint_id> \
  --json '{"access_control_list":[{"service_principal_name":"<app_sp_client_id>","permission_level":"CAN_QUERY"}]}' \
  --profile fe-vm-azure

This is workspace state — not in git. If anyone redeploys the endpoint from scratch (agents.delete_deployment then agents.deploy) the grant resets. A follow-up could bake this into _set_permissions in telco_support_agent/ops/deployment.py so it's reproducible from the bundle.

Note: a resources: block in app_dev.yaml was tried first and reverted — that yaml controls the runtime container's command/env, not app-level resource permissions; the latter is a separate Apps API.

dbdemos_tracker API change

The UI middleware was logging Tracking error: ... on every request. The installed dbdemos_tracker API moved demo_name from Tracker.track_app_view() to the Tracker(...) constructor. Updated signatures:

  • Tracker(org_id, email=None, demo_name=None, demo_catalog_id=None)
  • Tracker.track_app_view(self, user_email, app_path)

telco_support_agent/ui/backend/app/main.py updated accordingly.

Verification

  • Bundle deploy: databricks bundle deploy -t dev --profile fe-vm-azure clean.
  • Job run: [dev kyra_wulffert] telco_log_register_deploy_agent_dev — UC functions register (8/8), model logs to UC, endpoint deploys.
  • Endpoint query (raw API to bypass CLI schema filtering for custom_inputs):
    databricks api post /serving-endpoints/dev-telco-customer-support-agent/invocations \
      --json @/tmp/req.json --profile fe-vm-azure
    
    with body {"input":[{"role":"user","content":"What was the customer's data in May?"}],"custom_inputs":{"customer":"CUS-10001"}} returns real billing data via Claude → tool call → UC function → Delta read.
  • UI app: cd telco_support_agent/ui && ./deploy.sh dev fe-vm-azure builds the React frontend and deploys. App reachable at https://telco-support-agent-dev-7405608427441525.5.azure.databricksapps.com/.

Tooling notes encountered

  • databricks CLI 0.291.0 hit an expired Terraform PGP signing key; upgrade to 0.298.0 via brew upgrade databricks resolved it. Required to deploy bundles cloud-side.
  • databricks serving-endpoints query strips fields not in its OpenAI-shaped schema — custom_inputs was silently dropped. Use databricks api post /serving-endpoints/<name>/invocations for full-fidelity requests.
  • The UI app's frontend is built/staged by telco_support_agent/ui/deploy.sh (it runs npm run build and copies frontend/diststatic/). The bundle deploy alone uploads raw source and starts the app in API-only mode (404 at /).

Follow-ups left for later

  • telco_support_agent/data/generators/knowledge_base.py:346 still references databricks-claude-3-7-sonnet (used at synthetic-data-gen time, not at agent runtime — left alone here).
  • telco_support_agent/ui/app_{staging,prod}.yaml not updated for Azure (the user's stash holds those changes).
  • The dev UI app may emit a non-fatal warning Tracker.track_app_view() got an unexpected keyword argument 'demo_name' — frontend demo telemetry, doesn't affect functionality.

Commit log on this branch

(Most recent first.)

Move demo_name kwarg to dbdemos Tracker constructor
Grant the dev UI app CAN_QUERY on the serving endpoint
Add BRANCH_NOTES.md summarising the SDK migration + fixes
Point dev UI app at the Azure MLflow experiment
Tell sub-agents the customer ID is auto-injected
Make deploy_agent idempotent
Route dev UI app to Azure workspace
Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring

(Plus a follow-up commit revising this doc and removing a no-op resources: block from app_dev.yaml.)


This pull request and its description were written by Isaac.

SDK upgrade
- mlflow[databricks] 3.9.0rc0 -> >=3.11,<4 (off the RC)
- databricks-sdk 0.73 -> >=0.105,<1
- databricks-agents 1.9.3 -> >=1.9.4,<2
- databricks-vectorsearch >=0.62 -> >=0.66,<1
- databricks-openai 0.6.1 -> >=0.15,<0.16
- databricks-mcp 0.4.0 -> >=0.9,<0.10

Agent LLM
- supervisor on databricks-claude-opus-4-7
- account/billing/product/tech_support on databricks-claude-sonnet-4-6
- drop `temperature` (Claude 4.x on Databricks rejects it)
- generalize tool-call formatting in message_formatting.py to match any
  Claude endpoint instead of a hardcoded sonnet-3-7 string

UC function registration
- expose register_all(uc_config) in tools/{billing,account,product}/functions.py;
  registration was previously gated behind `if __name__ == "__main__":` so
  importlib.import_module(...) was a no-op and 0/8 functions ever got created
- registry now calls module.register_all(uc_config) after import

Data catalog flow
- LogRegisterConfig/RunEvalsConfig/DeployAgentConfig.to_uc_config() now also
  set data_catalog from the widget's uc_catalog (was defaulting to a
  hardcoded 'telco_customer_support_prod' that doesn't exist in this workspace)
- _get_supervisor_resources() now takes data_catalog and stops hardcoding the
  prod catalog for vector search index resources

Other
- SDK_REVIEW.md: migration report from the sdk-reviewer subagent
- databricks.yml: route dev/staging/prod targets to the Azure workspace

Co-authored-by: Isaac
DATABRICKS_HOST was pointing at e2-demo-west; the deployed agent
endpoint lives on adb-7405608427441525.5.azuredatabricks.net, so
the UI app needs to call there.

Co-authored-by: Isaac
Skip agents.deploy() when the (model, version, endpoint) tuple is
already deployed. Avoids the name-collision error from agents.deploy()
on reruns of the bundle job after a successful deploy.

Co-authored-by: Isaac
Claude follows prompt instructions more literally than the previous
GPT-4 endpoint, so it asked the user for the customer ID even though
the runtime auto-injects it into every tool call (per
ToolParameterInjector.prepare_tool_spec_for_llm). Replace the vague
"Customer ID is always present" line with an explicit directive: the
customer ID won't appear in the tool's parameter list, never ask the
user for it, call the tool immediately.

Co-authored-by: Isaac
Resolved /Shared/telco_support_agent/dev/dev_telco_support_agent on
adb-7405608427441525.5 to experiment_id 2366216092548657. The previous
ID belonged to e2-demo-west and would silently drop UI traces.

Co-authored-by: Isaac
PR-description-style summary of everything in sdk-bump-and-fixes:
SDK bumps, Claude switch with the three compatibility tweaks,
UC function registration bug, data_catalog plumbing, deploy
idempotency, bundle/app routing, verification, follow-ups.

Co-authored-by: Isaac
Databricks Apps run as their own service principal; without an
explicit serving_endpoint resource declaration, calls to
/serving-endpoints/<name>/invocations return 403. Adding a
CAN_QUERY resource binding for dev-telco-customer-support-agent
so the app can stream chat responses.

Co-authored-by: Isaac
The current dbdemos_tracker API takes demo_name on the Tracker
constructor, not on track_app_view (signature is
Tracker(org_id, email=None, demo_name=None, demo_catalog_id=None)
and track_app_view(self, user_email, app_path)). Without this the
middleware logged on every request — first that demo_name was an
unexpected kwarg, then that demo_name was required.

Co-authored-by: Isaac
- Remove the resources: block from app_dev.yaml that I tried as a
  permission grant — that yaml is the runtime container config and
  doesn't affect app-level permissions, the actual fix was the
  serving-endpoints update-permissions call documented in
  BRANCH_NOTES.md.
- Update BRANCH_NOTES.md to cover the CAN_QUERY grant on the agent
  endpoint, the dbdemos_tracker constructor API change, and refresh
  the commit log section.

Co-authored-by: Isaac
@kwulffert23 kwulffert23 marked this pull request as ready for review April 27, 2026 21:45
@kwulffert23
Copy link
Copy Markdown
Author

kwulffert23 commented Apr 27, 2026

@jeannefukumaru could you take a look when you have a chance? I couldn't add you as a formal reviewer due to permissions.

@kwulffert23 kwulffert23 marked this pull request as draft April 28, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant