Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 19 additions & 15 deletions experimental/databricks-execution-compute/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,7 @@ description: >-

# Databricks Execution & Compute

Run code on Databricks. Three execution modes—choose based on workload.

> **Path convention:** `<SKILL_ROOT>` in examples below = the directory containing this SKILL.md. Resolve it to the absolute path in your install (e.g. `~/.claude/skills/databricks-execution-compute`). Commands like `python <SKILL_ROOT>/scripts/compute.py ...` work from any cwd.
Run code on Databricks. Three execution modes—choose based on workload. All examples below use the Databricks CLI; see the `databricks-core` skill for install and authentication.

## Execution Mode Decision Matrix

Expand Down Expand Up @@ -95,10 +93,6 @@ Pure CLI flow: upload a local file as a workspace notebook, fire a one-time run

Always use `dbutils.notebook.exit(<string>)` in the notebook — `print()` is not captured by `get-run-output`. For JSON results: `dbutils.notebook.exit(json.dumps({...}))` then parse `.notebook_output.result` client-side.

**Convenience wrapper.** `scripts/compute.py execute-code` does upload + submit + wait + cleanup in one command and returns a single tidy JSON:

`python <SKILL_ROOT>/scripts/compute.py execute-code --file /local/path/to/train.py --compute-type serverless --timeout 1500 --environments '[{"environment_key":"ml_env","spec":{"client":"4","dependencies":["scikit-learn==1.5.2","mlflow==2.22.0"]}}]' | jq '{success, state, output, error, run_id, run_page_url, execution_duration_ms}'`

### Interactive Cluster → [reference](references/3-interactive-cluster.md)

**Avoid by default — prefer Serverless Job.** Only use an interactive cluster when:
Expand All @@ -108,14 +102,24 @@ Always use `dbutils.notebook.exit(<string>)` in the notebook — `print()` is no

Interactive clusters are **slow to start (3-8 min)** and cost money while running. Don't start one implicitly.

## CLI Commands

| Command | Purpose |
|---------|---------|
| `python <SKILL_ROOT>/scripts/compute.py execute-code` | Run code on serverless or an existing cluster |
| `python <SKILL_ROOT>/scripts/compute.py list-compute` | List clusters, node types, Spark versions |
| `python <SKILL_ROOT>/scripts/compute.py manage-cluster` | Create/start/terminate/delete clusters (see [3-interactive-cluster.md](references/3-interactive-cluster.md)) |
| `databricks warehouses create/list` | Manage SQL warehouses |
## CLI Command Map

All compute lifecycle and code-execution actions go through the Databricks CLI. Headline commands:

| Action | Command |
|--------|---------|
| Upload local file as workspace notebook | `databricks workspace import <WORKSPACE_PATH> --file <LOCAL> --format SOURCE --language PYTHON --overwrite` |
| Run serverless code (upload + submit + wait) | `databricks jobs submit --json @submit.json` (see Serverless Job section above; with `--no-wait` for async) |
| Get run state / wait | `databricks jobs get-run <RUN_ID>` (poll `.state.life_cycle_state`) |
| Fetch run output | `databricks jobs get-run-output <TASK_RUN_ID>` |
| List clusters | `databricks clusters list --output json` |
| Get cluster details | `databricks clusters get <CLUSTER_ID>` |
| Start / restart / terminate cluster | `databricks clusters start/restart/delete <CLUSTER_ID>` |
| Permanently delete cluster | `databricks clusters permanent-delete <CLUSTER_ID>` |
| Create cluster | `databricks clusters create --json '{...}'` (see [3-interactive-cluster.md](references/3-interactive-cluster.md)) |
| List node types / Spark versions | `databricks clusters list-node-types` / `databricks clusters spark-versions` |
| Execute code on a running cluster | `databricks api post /api/1.2/contexts/create` + `databricks api post /api/1.2/commands/execute` (see [3-interactive-cluster.md](references/3-interactive-cluster.md)) |
| SQL warehouses | `databricks warehouses create/list/get/start/stop/edit/delete` (see SQL Warehouses below) |

### SQL Warehouses

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

**Use when:** Running intensive Python code remotely (ML training, heavy processing) that doesn't need Spark, or when code shouldn't depend on the local machine staying connected.

> `<SKILL_ROOT>` in examples = the directory containing the parent SKILL.md — substitute the absolute install path (e.g. `~/.claude/skills/databricks-execution-compute`).

## When to Choose Serverless Job

- ML model training (runs independently of local machine)
Expand Down Expand Up @@ -85,22 +83,6 @@ dbutils.notebook.exit(json.dumps({"accuracy": 0.95, "model_path": "/Volumes/..."

Max output size is 5 MB. Larger results should be written to a Volume/object store and referenced by path.

## Convenience wrapper

`scripts/compute.py execute-code` does upload + submit + wait + cleanup in one command and returns a single JSON with `success`, `state`, `output` (the `dbutils.notebook.exit` payload), `error`, `run_id`, `run_page_url`, `execution_duration_ms`.

Minimal:

`python <SKILL_ROOT>/scripts/compute.py execute-code --file train.py --compute-type serverless`

With dependencies:

`python <SKILL_ROOT>/scripts/compute.py execute-code --file /path/to/train.py --compute-type serverless --timeout 1500 --environments '[{"environment_key":"ml_env","spec":{"client":"4","dependencies":["scikit-learn==1.5.2","mlflow==2.22.0","xgboost==2.1.3"]}}]'`

Long dependency list from a file:

`python <SKILL_ROOT>/scripts/compute.py execute-code --file /path/to/train.py --compute-type serverless --environments @env.json`

## Common Issues

| Issue | Solution |
Expand All @@ -109,7 +91,7 @@ Long dependency list from a file:
| `ModuleNotFoundError` | Add the package to the environments spec with `"client": "4"` |
| Dependencies listed but not installed | `"client": "1"` silently drops `dependencies`; use `"client": "4"` |
| `get-run-output` returns empty `notebook_output` | You passed the parent run_id, not `.tasks[0].run_id` |
| Job times out | Default 1800 s on the script wrapper; raise `--timeout` or use `jobs submit --no-wait` + your own polling |
| Job times out | Use `databricks jobs submit --no-wait` and poll `get-run` yourself, or set `tasks[].timeout_seconds` in the submit JSON to extend the per-task limit |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

databricks api get … --json '{…}' for status polling is fishy. In references/3-interactive-cluster.md §3, the polling loop sends a JSON body on an HTTP GET to /api/1.2/commands/status.
  That endpoint takes clusterId/contextId/commandId as query string parameters, and HTTP GET bodies are usually dropped. The replacement should be:
  databricks api get "/api/1.2/commands/status?clusterId=$CID&contextId=$CTX&commandId=$CMD

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Claude here.)

Confirmed your read and fixed in 7ebcdb0.

Looked at alternatives before patching: the modern Databricks CLI doesn't ship a command-execution subcommand (just checked v0.299.2 — no group covers the 1.2 endpoints), and the modern SDKs route the same way under the hood. So databricks api get with query-string params is the only correct surface for this legacy API. Done as you suggested:

databricks api get "/api/1.2/commands/status?clusterId=${CID}&contextId=${CTX}&commandId=${CMD}"

Added a one-liner above the snippet noting why the GET body has to go in the URL.


## When NOT to Use

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

**Use when:** You have an existing running cluster and need to preserve state across multiple tool calls, or need Scala/R support.

> `<SKILL_ROOT>` in examples = the directory containing the parent SKILL.md — substitute the absolute install path (e.g. `~/.claude/skills/databricks-execution-compute`).

## When to Choose Interactive Cluster

- Multiple sequential commands where variables must persist
Expand All @@ -23,7 +21,7 @@
**Starting a cluster takes 3-8 minutes and costs money.** Always check first:

```bash
python <SKILL_ROOT>/scripts/compute.py list-compute --resource clusters
databricks clusters list --cluster-sources UI,API --output json | jq '.[] | select(.state == "RUNNING") | {cluster_id, cluster_name, state, cluster_source}'
```

If no cluster is running, ask the user:
Expand All @@ -32,135 +30,161 @@ If no cluster is running, ask the user:
> 2. Use serverless (instant, no setup)
> Which do you prefer?"

## Basic Usage
Filter to user-created clusters (exclude job clusters, which dominate the list on busy workspaces):

```bash
databricks clusters list --cluster-sources UI,API --output json \
| jq '.[] | select(.state == "RUNNING")'
```

## Code Execution Flow (1.2 commands API)

The Databricks CLI doesn't ship a single "run code on a cluster" subcommand. Use the `1.2 commands` API directly via `databricks api`:

1. **Create an execution context** (one per language per cluster; reuse across commands for state).
2. **Submit the command** — returns a `commandId`.
3. **Poll status** until `status == "Finished"` (or `Error`).
4. **(Optional) Destroy the context** when done. Contexts also expire when the cluster terminates.

### 1. Create a context

### First Command: Creates Context
```bash
CTX=$(databricks api post /api/1.2/contexts/create --json '{
"language": "python",
"clusterId": "1234-567890-abcdef"
}' | jq -r '.id')
echo "$CTX" # e.g. ctx_abc123
```

Languages: `python`, `scala`, `sql`, `r`. You need one context per language; running `sql` requires a separate context from `python` on the same cluster.

### 2. Submit a command

```bash
python <SKILL_ROOT>/scripts/compute.py execute-code \
--code "import pandas as pd; df = pd.DataFrame({'a': [1, 2, 3]}); print(df)" \
--compute-type cluster \
--cluster-id "1234-567890-abcdef"
CMD=$(databricks api post /api/1.2/commands/execute --json '{
"language": "python",
"clusterId": "1234-567890-abcdef",
"contextId": "'"$CTX"'",
"command": "import pandas as pd; df = pd.DataFrame({\"a\": [1, 2, 3]}); print(df)"
}' | jq -r '.id')
echo "$CMD"
```

Response includes `context_id` for reuse:
```json
{
"success": true,
"output": " a\n0 1\n1 2\n2 3",
"context_id": "ctx_abc123",
"cluster_id": "1234-567890-abcdef"
}
### 3. Poll status and fetch results

The `/api/1.2/commands/status` endpoint takes its parameters in the query string — a JSON body on a GET request gets dropped by the server.

```bash
CID="1234-567890-abcdef"
while :; do
STATUS=$(databricks api get "/api/1.2/commands/status?clusterId=${CID}&contextId=${CTX}&commandId=${CMD}")
STATE=$(echo "$STATUS" | jq -r '.status')
[ "$STATE" = "Finished" ] && break
[ "$STATE" = "Error" ] && break
[ "$STATE" = "Cancelled" ] && break
sleep 2
done
echo "$STATUS" | jq '{status, results: .results}'
```

### Follow-up Commands: Reuse Context
`.results.resultType` indicates output type:
- `text` — `.results.data` is the captured stdout string.
- `error` — `.results.summary` has the error preamble; `.results.cause` has the traceback.
- `table` — `.results.schema` + `.results.data` (rows).

### 4. Follow-up commands reuse the context

State (variables, imports, `%pip install`-ed packages) persists across commands sharing the same `contextId`:

```bash
# Variables from first command still available
python <SKILL_ROOT>/scripts/compute.py execute-code \
--code "print(df.shape)" \
--compute-type cluster \
--cluster-id "1234-567890-abcdef" \
--context-id "ctx_abc123"
CMD2=$(databricks api post /api/1.2/commands/execute --json '{
"language": "python",
"clusterId": "1234-567890-abcdef",
"contextId": "'"$CTX"'",
"command": "print(df.shape)"
}' | jq -r '.id')
# poll as above
```

### Auto-Select Best Running Cluster
### 5. (Optional) Destroy the context

Contexts auto-expire when the cluster terminates. Destroy explicitly when you're done with a session:

```bash
# Get best running cluster
python <SKILL_ROOT>/scripts/compute.py list-compute --auto-select
# Returns: {"cluster_id": "1234-567890-abcdef"}

# Then execute on it
python <SKILL_ROOT>/scripts/compute.py execute-code \
--code "spark.range(100).show()" \
--compute-type cluster \
--cluster-id "1234-567890-abcdef"
databricks api post /api/1.2/contexts/destroy --json '{
"clusterId": "1234-567890-abcdef",
"contextId": "'"$CTX"'"
}'
```

## Language Support

The `language` field on context-create + command-execute controls the runtime:

```bash
# Scala
python <SKILL_ROOT>/scripts/compute.py execute-code --code 'println("Hello")' --compute-type cluster --language scala --cluster-id ...
databricks api post /api/1.2/contexts/create --json '{"language":"scala","clusterId":"..."}'

# SQL
python <SKILL_ROOT>/scripts/compute.py execute-code --code "SELECT * FROM table LIMIT 10" --compute-type cluster --language sql --cluster-id ...
databricks api post /api/1.2/contexts/create --json '{"language":"sql","clusterId":"..."}'

# R
python <SKILL_ROOT>/scripts/compute.py execute-code --code 'print("Hello")' --compute-type cluster --language r --cluster-id ...
databricks api post /api/1.2/contexts/create --json '{"language":"r","clusterId":"..."}'
```

Each language needs its own context on the same cluster.

## Installing Libraries

Install pip packages directly in the execution context:

```bash
python <SKILL_ROOT>/scripts/compute.py execute-code \
--code "%pip install faker" \
--compute-type cluster \
--cluster-id "..." \
--context-id "..."
```

If needed, restart Python to pick up new packages:
```bash
python <SKILL_ROOT>/scripts/compute.py execute-code \
--code "dbutils.library.restartPython()" \
--compute-type cluster \
--cluster-id "..." \
--context-id "..."
databricks api post /api/1.2/commands/execute --json '{
"language":"python","clusterId":"...","contextId":"...",
"command":"%pip install faker"
}'
```

## Context Lifecycle
If needed, restart Python in the same context to pick up new packages:

**Keep alive (default):** Context persists until cluster terminates.

**Destroy when done:**
```bash
python <SKILL_ROOT>/scripts/compute.py execute-code \
--code "print('Done!')" \
--compute-type cluster \
--cluster-id "..." \
--destroy-context
databricks api post /api/1.2/commands/execute --json '{
"language":"python","clusterId":"...","contextId":"...",
"command":"dbutils.library.restartPython()"
}'
```

## Managing Clusters

Two equivalent paths: the standalone script (convenience wrapper) or the raw `databricks` CLI (more fields exposed). Prefer the script for the common operations listed here.
All cluster lifecycle goes through `databricks clusters`:

```bash
# List all clusters
python <SKILL_ROOT>/scripts/compute.py list-compute --resource clusters
# List all clusters (full output)
databricks clusters list --cluster-sources UI,API --output json

# Get specific cluster status
python <SKILL_ROOT>/scripts/compute.py list-compute --cluster-id "1234-567890-abcdef"
# Get one cluster's state
databricks clusters get <CLUSTER_ID> | jq '{state, cluster_id, cluster_name}'

# Start a cluster (WITH USER APPROVAL ONLY - costs money, 3-8min startup)
python <SKILL_ROOT>/scripts/compute.py manage-cluster --action start --cluster-id "1234-567890-abcdef"
# Start a cluster (WITH USER APPROVAL ONLY costs money, 3-8 min startup)
databricks clusters start <CLUSTER_ID>

# Terminate a cluster (reversible)
python <SKILL_ROOT>/scripts/compute.py manage-cluster --action terminate --cluster-id "1234-567890-abcdef"
# Terminate (reversible — cluster definition kept, state lost)
databricks clusters delete <CLUSTER_ID>

# Create a new cluster
python <SKILL_ROOT>/scripts/compute.py manage-cluster --action create --name "my-cluster" --num-workers 2
```

### Filter running interactive clusters only (raw CLI)
# Permanent delete (irreversible)
databricks clusters permanent-delete <CLUSTER_ID>

Useful before asking the user which cluster to reuse. `--cluster-sources UI,API` excludes job clusters (which would otherwise dominate the list on busy workspaces):
# Restart
databricks clusters restart <CLUSTER_ID>

```bash
databricks clusters list --cluster-sources UI,API --output json \
| jq '.[] | select(.state == "RUNNING")'
# Resize
databricks clusters resize <CLUSTER_ID> --num-workers 4
```

### Create with a full spec (raw CLI)

The script's `manage-cluster --action create` is fine for quick defaults; for full control (DBR version, instance type, tags) use the raw CLI:
### Create with a full spec

```bash
# SPARK_VERSION is positional; custom_tags recommended for resource tracking
# SPARK_VERSION is positional. custom_tags recommended for resource tracking.
databricks clusters create 15.4.x-scala2.12 --json '{
"cluster_name": "my-cluster",
"node_type_id": "i3.xlarge",
Expand All @@ -170,13 +194,21 @@ databricks clusters create 15.4.x-scala2.12 --json '{
}'
```

Discover node types and DBR versions:

```bash
databricks clusters list-node-types | jq '.node_types[] | {node_type_id, memory_mb, num_cores}'
databricks clusters spark-versions | jq '.versions[] | {key, name}'
```

## Common Issues

| Issue | Solution |
|-------|----------|
| "No running cluster" | Ask user to start or use serverless |
| Context not found | Context expired; create new one |
| Library not found | `%pip install <library>` then restart Python if needed |
| `Context not found` | Context expired (cluster restarted, or destroyed); create a new one |
| Library not found mid-session | `%pip install <library>`, then `dbutils.library.restartPython()` if needed |
| Command stuck in `Running` | Send `databricks api post /api/1.2/commands/cancel --json '{"clusterId":"...","contextId":"...","commandId":"..."}'` |

## When NOT to Use

Expand Down
Loading