Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring#14
Draft
kwulffert23 wants to merge 9 commits into
Draft
Bump Databricks/MLflow SDKs, switch to Claude, fix UC catalog wiring#14kwulffert23 wants to merge 9 commits into
kwulffert23 wants to merge 9 commits into
Conversation
SDK upgrade
- mlflow[databricks] 3.9.0rc0 -> >=3.11,<4 (off the RC)
- databricks-sdk 0.73 -> >=0.105,<1
- databricks-agents 1.9.3 -> >=1.9.4,<2
- databricks-vectorsearch >=0.62 -> >=0.66,<1
- databricks-openai 0.6.1 -> >=0.15,<0.16
- databricks-mcp 0.4.0 -> >=0.9,<0.10
Agent LLM
- supervisor on databricks-claude-opus-4-7
- account/billing/product/tech_support on databricks-claude-sonnet-4-6
- drop `temperature` (Claude 4.x on Databricks rejects it)
- generalize tool-call formatting in message_formatting.py to match any
Claude endpoint instead of a hardcoded sonnet-3-7 string
UC function registration
- expose register_all(uc_config) in tools/{billing,account,product}/functions.py;
registration was previously gated behind `if __name__ == "__main__":` so
importlib.import_module(...) was a no-op and 0/8 functions ever got created
- registry now calls module.register_all(uc_config) after import
Data catalog flow
- LogRegisterConfig/RunEvalsConfig/DeployAgentConfig.to_uc_config() now also
set data_catalog from the widget's uc_catalog (was defaulting to a
hardcoded 'telco_customer_support_prod' that doesn't exist in this workspace)
- _get_supervisor_resources() now takes data_catalog and stops hardcoding the
prod catalog for vector search index resources
Other
- SDK_REVIEW.md: migration report from the sdk-reviewer subagent
- databricks.yml: route dev/staging/prod targets to the Azure workspace
Co-authored-by: Isaac
DATABRICKS_HOST was pointing at e2-demo-west; the deployed agent endpoint lives on adb-7405608427441525.5.azuredatabricks.net, so the UI app needs to call there. Co-authored-by: Isaac
Skip agents.deploy() when the (model, version, endpoint) tuple is already deployed. Avoids the name-collision error from agents.deploy() on reruns of the bundle job after a successful deploy. Co-authored-by: Isaac
Claude follows prompt instructions more literally than the previous GPT-4 endpoint, so it asked the user for the customer ID even though the runtime auto-injects it into every tool call (per ToolParameterInjector.prepare_tool_spec_for_llm). Replace the vague "Customer ID is always present" line with an explicit directive: the customer ID won't appear in the tool's parameter list, never ask the user for it, call the tool immediately. Co-authored-by: Isaac
Resolved /Shared/telco_support_agent/dev/dev_telco_support_agent on adb-7405608427441525.5 to experiment_id 2366216092548657. The previous ID belonged to e2-demo-west and would silently drop UI traces. Co-authored-by: Isaac
PR-description-style summary of everything in sdk-bump-and-fixes: SDK bumps, Claude switch with the three compatibility tweaks, UC function registration bug, data_catalog plumbing, deploy idempotency, bundle/app routing, verification, follow-ups. Co-authored-by: Isaac
Databricks Apps run as their own service principal; without an explicit serving_endpoint resource declaration, calls to /serving-endpoints/<name>/invocations return 403. Adding a CAN_QUERY resource binding for dev-telco-customer-support-agent so the app can stream chat responses. Co-authored-by: Isaac
The current dbdemos_tracker API takes demo_name on the Tracker constructor, not on track_app_view (signature is Tracker(org_id, email=None, demo_name=None, demo_catalog_id=None) and track_app_view(self, user_email, app_path)). Without this the middleware logged on every request — first that demo_name was an unexpected kwarg, then that demo_name was required. Co-authored-by: Isaac
- Remove the resources: block from app_dev.yaml that I tried as a permission grant — that yaml is the runtime container config and doesn't affect app-level permissions, the actual fix was the serving-endpoints update-permissions call documented in BRANCH_NOTES.md. - Update BRANCH_NOTES.md to cover the CAN_QUERY grant on the agent endpoint, the dbdemos_tracker constructor API change, and refresh the commit log section. Co-authored-by: Isaac
Author
|
@jeannefukumaru could you take a look when you have a chance? I couldn't add you as a formal reviewer due to permissions. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Branch:
sdk-bump-and-fixesThis branch bumps the Databricks/MLflow SDKs to current versions, switches the agent's LLMs from
dbdemos-openai-gpt4to Claude (Opus 4.7 / Sonnet 4.6), and fixes several pre-existing latent bugs surfaced by deploying end-to-end on a dev Azure workspace.Motivation
mlflow[databricks]==3.9.0rc0was pinned to a release candidate; several APIs the project uses (mlflow.genai.scorers,make_judge,mlflow.genai.evaluate) only stabilised in MLflow 3.10/3.11.databricks-sdk==0.73.0was ~30 minor versions behind the current 0.105.0.databricks-agents,databricks-vectorsearch,databricks-openai,databricks-mcp) all materially behind current.SDK upgrade
Notable transitive cleanup:
langchain/langgraphremoved from the lock — newerdatabricks-agentsno longer pulls them.Files:
pyproject.toml,requirements.txt,uv.lock.Source-of-truth report from the
sdk-reviewersubagent:SDK_REVIEW.md.Agent LLM migration to Claude
databricks-claude-opus-4-7databricks-claude-sonnet-4-6Files:
configs/agents/{supervisor,account,billing,product,tech_support}.yaml.Three code-level adjustments were needed for Claude compatibility:
temperatureparameter from all five agent configs. Databricks Foundation Model API rejectstemperaturefor the Claude 4.x family.telco_support_agent/agents/utils/message_formatting.pyfromif llm_endpoint == "databricks-claude-3-7-sonnet"toif "claude" in llm_endpoint.lower(). Claude requirescontent: Nonealongsidetool_callsrather than the literal string"tool call".customerexplicit. Tool specs strip thecustomerparameter from what the LLM sees (pertool_injection.py:23-72); the runtime auto-injects it. Claude follows prompt instructions more literally than GPT-4 and was asking the user for the customer ID; the new prompts explicitly say "the customer ID is auto-injected; never ask the user for it."UC function registration bug
tools/registry.py:_register_domain_functionspreviously triggered registration viaimportlib.import_module(...)and relied on module-level side-effects to register the SQL UDFs. But eachtools/<domain>/functions.pyhad itsregister_*(uc_config)calls insideif __name__ == "__main__":, so importing was a no-op and0/8UC functions ever got created. Symptom at runtime:Routine or Model 'telco_customer_support_dev.agent.get_billing_info' does not exist.Fix: each domain module now exposes a top-level
register_all(uc_config)function._register_domain_functionscallsmodule.register_all(uc_config)after import. Theif __name__ == "__main__":block is preserved for direct execution.Files:
telco_support_agent/tools/{billing,account,product}/functions.py,telco_support_agent/tools/registry.py.Data catalog plumbing
UCConfig.data_catalogdefaulted totelco_customer_support_prod(per a "data always comes from prod" assumption in the schema), and the three notebook config classes only setagent_catalogfrom the widget — notdata_catalog. The widget value wastelco_customer_support_devfor this workspace, butdata_catalogsilently fell back to prod. Two failure modes resulted:CREATE OR REPLACE FUNCTIONreferencedtelco_customer_support_prod.gold.customers(missing) → registration silently failed.uc_config.yamlartifact haddata_catalog=prod→RestException: TABLE_DOES_NOT_EXIST: Table 'telco_customer_support_prod.gold.knowledge_base_index'.Two fixes:
telco_support_agent/config/notebooks.py: all threeto_uc_config()methods (RunEvalsConfig,LogRegisterConfig,DeployAgentConfig) now also passdata_catalog=self.uc_catalog.telco_support_agent/ops/logging.py:_get_supervisor_resources(...)now accepts an optionaldata_catalogparameter (falling back touc_catalogwhen not provided), instead of hardcodingtelco_customer_support_prod. The call site inlog_agentpassesuc_config.data_catalog.For multi-environment setups that genuinely want shared prod data, callers can still pass an explicit
data_catalog.Deploy idempotency
databricks.agents.deploy()raisesBadRequest: Cannot create 2+ served entities with the same namewhen called with the same(model, version, endpoint)already deployed. The notebook job would fail on every rerun after a successful deploy.Fix:
telco_support_agent/ops/deployment.py:deploy_agentnow lists existing deployments viaagents.get_deployments(model_name=...)and short-circuits when a matching(endpoint_name, model_version)is found — returning the existing deployment and continuing through the rest of the function (waiting for ready, permissions, instructions). New versions still go through the normal deploy path.Bundle / app routing
databricks.yml: dev/staging/prod targets all routed to the Azure workspace (adb-7405608427441525.5.azuredatabricks.net).telco_support_agent/ui/app_dev.yaml:DATABRICKS_HOSTupdated to the Azure workspace URL;MLFLOW_EXPERIMENT_IDupdated to2366216092548657(the experiment for/Shared/telco_support_agent/dev/dev_telco_support_agenton Azure).The UI app
staging/prodyamls were left untouched (still one2-demo-west) since this branch only validates dev. Apply the same edits when bringing those environments online.App service principal
CAN_QUERYon the serving endpointDatabricks Apps run as a workspace-managed service principal. With no explicit grant, the app's calls to
/serving-endpoints/dev-telco-customer-support-agent/invocationsreturnedHTTP 403. The fix was applied out-of-band:This is workspace state — not in git. If anyone redeploys the endpoint from scratch (
agents.delete_deploymentthenagents.deploy) the grant resets. A follow-up could bake this into_set_permissionsintelco_support_agent/ops/deployment.pyso it's reproducible from the bundle.dbdemos_trackerAPI changeThe UI middleware was logging
Tracking error: ...on every request. The installeddbdemos_trackerAPI moveddemo_namefromTracker.track_app_view()to theTracker(...)constructor. Updated signatures:Tracker(org_id, email=None, demo_name=None, demo_catalog_id=None)Tracker.track_app_view(self, user_email, app_path)telco_support_agent/ui/backend/app/main.pyupdated accordingly.Verification
databricks bundle deploy -t dev --profile fe-vm-azureclean.[dev kyra_wulffert] telco_log_register_deploy_agent_dev— UC functions register (8/8), model logs to UC, endpoint deploys.custom_inputs):{"input":[{"role":"user","content":"What was the customer's data in May?"}],"custom_inputs":{"customer":"CUS-10001"}}returns real billing data via Claude → tool call → UC function → Delta read.cd telco_support_agent/ui && ./deploy.sh dev fe-vm-azurebuilds the React frontend and deploys. App reachable athttps://telco-support-agent-dev-7405608427441525.5.azure.databricksapps.com/.Tooling notes encountered
databricksCLI 0.291.0 hit an expired Terraform PGP signing key; upgrade to 0.298.0 viabrew upgrade databricksresolved it. Required to deploy bundles cloud-side.databricks serving-endpoints querystrips fields not in its OpenAI-shaped schema —custom_inputswas silently dropped. Usedatabricks api post /serving-endpoints/<name>/invocationsfor full-fidelity requests.telco_support_agent/ui/deploy.sh(it runsnpm run buildand copiesfrontend/dist→static/). The bundle deploy alone uploads raw source and starts the app in API-only mode (404 at/).Follow-ups left for later
telco_support_agent/data/generators/knowledge_base.py:346still referencesdatabricks-claude-3-7-sonnet(used at synthetic-data-gen time, not at agent runtime — left alone here).telco_support_agent/ui/app_{staging,prod}.yamlnot updated for Azure (the user's stash holds those changes).Tracker.track_app_view() got an unexpected keyword argument 'demo_name'— frontend demo telemetry, doesn't affect functionality.Commit log on this branch
(Most recent first.)
(Plus a follow-up commit revising this doc and removing a no-op
resources:block fromapp_dev.yaml.)This pull request and its description were written by Isaac.