feat(benchmarks): implement Kaggle client (push/run functionality) by dolaameng · Pull Request #960 · Kaggle/kaggle-cli

dolaameng · 2026-04-08T22:25:14Z

Benchmarks CLI Reference (push & run)

The benchmarks CLI manages benchmark tasks — registering evaluation code, scheduling runs against models, monitoring progress, and downloading results.

Aliases: kaggle benchmarks or kaggle b

All task subcommands are under kaggle benchmarks tasks (alias: kaggle b t).

Commands

`push` — Register a task

Upload a Python source file as a benchmark task definition. The file is expected to be a .py file with percent delimiters (e.g., # %%). The CLI converts it to an .ipynb file before uploading. If the task already exists, it creates a new version.

kaggle b t push <task> -f <file>

Parameter	Flag	Required	Description
`task`	positional	Yes	Task name (e.g. `math-eval`)
`file`	`-f`, `--file`	Yes	Path to the Python source file defining the task

Behavior:

Validates the file exists and has a .py extension.
Reads the source file and parses it with Python's ast module to extract task names from @task decorators (supports both @task and @kbench.task styles, as well as @task(name="...") with explicit names).
- When an explicit name= keyword is provided, that name is used.
- When no explicit name is provided, the function name is title-cased with underscores replaced by spaces (e.g. my_test_task → "My Test Task").
Validates that the file contains at least one @task decorator. If none are found, raises ValueError and stops.
Validates that the given task name matches one of the task names extracted from the file.
Converts the .py file content to .ipynb format (Jupyter Notebook) using jupytext (assuming percent format), and adds a Python 3 kernelspec to the notebook metadata.
Checks the server for an existing task with the same slug:
- If the task exists and its creation_state is QUEUED or RUNNING (i.e. a previous version is still being built), the push is rejected with ValueError.
- If the task exists and is in COMPLETED or ERRORED state, the push proceeds (creates a new version).
- If the task does not exist (404), the push proceeds (creates a new task).
Sends the notebook content (JSON string) to create_benchmark_task.
If the server returns an error message in the response, raises ValueError with the error details.
On success, prints the task slug and its URL.

Errors:

ValueError: File <path> does not exist — file path is invalid.
ValueError: File <path> must be a .py file — file is not a Python file.
ValueError: No @task decorators found in file <path>. The file must define at least one task. — the file does not contain any @task-decorated functions.
ValueError: Task '<name>' not found in file <path>. Found tasks: ... — the task name doesn't match any @task-decorated function in the file.
ValueError: Task '<name>' is currently being created (pending). Cannot push now. — a previous version of this task is still being processed by the server.
ValueError: Failed to push task: <error> — the server returned an error message in the response.
HTTPError — server-side error (e.g. authentication failure, permission denied).

Example:

kaggle b t push math-eval -f tasks/math_eval.py

`run` — Schedule task runs

Schedule benchmark task execution against one or more models.

kaggle b t run <task> [-m <model> ...] [--wait]

Parameter	Flag	Required	Description
`task`	positional	Yes	Task name (e.g. `math-eval`)
`model`	`-m`, `--model`	No	Model slug(s) to run against. Accepts multiple space-separated values
`wait`	`--wait`	No	Wait for runs to complete. Can specify a timeout in seconds (0 or omit = indefinite)
`poll_interval`	`--poll-interval`	No	Seconds between status polls when using `--wait` (default: 10)

Behavior:

Task readiness check: Before scheduling, verifies that the task exists and its creation_state is COMPLETED. If the task is not ready:
- For ERRORED tasks, the error message includes the task info for debugging.
- For other non-completed states (e.g. QUEUED, RUNNING), raises ValueError indicating the task is not ready to run.
Model selection: If no -m is provided, fetches the list of available benchmark models via list_benchmark_models and prompts the user interactively:
```
No model specified. 5 model(s) available:
  1. gemini-pro (Gemini Pro)
  2. gemma-2b (Gemma 2B)
Enter model numbers (comma-separated), 'all':
```
- Enter comma-separated numbers (e.g. 1,3) to select specific models.
- Enter all to run against every available model.
- When there are more than 20 models, the list is paginated. Use n for next page and p for previous page.
- Invalid input (non-numeric, out-of-range index) raises ValueError.
- If no benchmark models exist on the server, raises ValueError: No benchmark models available. Cannot schedule runs.

Scheduling: Calls batch_schedule_benchmark_task_runs with the task slug and selected model slugs. Output:

Submitted run(s) for task 'math-eval'.
  gemini-pro: Scheduled
  gemma-2b: Scheduled
  gemini-flash: Skipped (<reason>)

Waiting (--wait): After scheduling, if --wait is specified, polls list_benchmark_task_runs at a fixed interval (default 10 seconds, configurable via --poll-interval) until all runs reach a terminal state (COMPLETED or ERRORED) or the timeout is reached. Output while waiting:
```
Waiting for run(s) to complete...
  2 run(s) still in progress...
  1 run(s) still in progress...
All runs completed:
  gemini-pro: COMPLETED
  gemma-2b: ERRORED
```
- If a timeout (in seconds) is specified and reached, it stops waiting and prints: Timed out waiting for runs after <timeout> seconds.
- If 0 or no value is specified for --wait, it waits indefinitely.

Errors:

ValueError: Task '<name>' is not ready to run (status: <state>). Only completed tasks can be run. — the task has not finished building (or errored during build).
ValueError: No benchmark models available. Cannot schedule runs. — no models exist on the server and none were specified via -m.
ValueError: Invalid selection: <input> — the user entered non-numeric or out-of-range input during interactive model selection.
HTTPError — server-side error (task not found, authentication failure, etc.).

Examples:

# Run against specific models
kaggle b t run math-eval -m gemini-pro gemma-2b

# Run and wait for completion
kaggle b t run math-eval -m gemini-pro --wait

# Wait with a custom poll interval (30 seconds)
kaggle b t run math-eval -m gemini-pro --wait --poll-interval 30

# Wait with a timeout (60 seconds)
kaggle b t run math-eval -m gemini-pro --wait 60

# Interactive model selection (prompts user)
kaggle b t run math-eval

End to end test

https://paste.googleplex.com/6483737513689088

src/kaggle/api/kaggle_api_extended.py

andrewmwang

Nice!

src/kaggle/api/kaggle_api_extended.py

rosbo

Great work. Just a few small comments.

rosbo · 2026-04-10T20:39:04Z

src/kaggle/cli.py

    command_models_update = "Update a model"

+    # Benchmarks commands
+    command_benchmarks_tasks_push = "Register a task from a Python source file"


It supports creating or updating an existing task. Not sure if register is the best verb here... You could use "create or update a task ...".

Good point. Done.

rosbo · 2026-04-10T20:42:18Z

src/kaggle/api/kaggle_api_extended.py

+        if task not in task_names:
+            raise ValueError(f"Task '{task}' not found in file {file}. Found tasks: {', '.join(task_names)}")
+
+    def benchmarks_tasks_push_cli(self, task, file):


Why are you passing "task" as a parameter to the CLI method instead of reading it from the Python file?

Yes it's because currently the users can define multiple tasks in a single file. We need user to decide which one to create.

rosbo · 2026-04-10T20:44:46Z

src/kaggle/test/conftest.py

@@ -0,0 +1,18 @@
+"""Shared test configuration for kaggle CLI tests.


Does this get imported automatically by pytest?

Yes. I think pytest auto-imports it for the test in the same dir. This is actually to avoid the import error from _introspect_token. The old username and password works just fine.

Extract push and run commands to new branch

46c5a9b

dolaameng marked this pull request as draft April 8, 2026 22:25

Add tests for push and run commands

46b090f

dolaameng force-pushed the dolaameng/benchmarks-cli-push-run branch from fdd2f5a to 2f5ef49 Compare April 9, 2026 03:24

dolaameng commented Apr 9, 2026

View reviewed changes

src/kaggle/api/kaggle_api_extended.py Show resolved Hide resolved

re-sync with sdk

bdbef87

dolaameng force-pushed the dolaameng/benchmarks-cli-push-run branch from 2f5ef49 to bdbef87 Compare April 9, 2026 17:07

add pagination

398c9c6

andrewmwang approved these changes Apr 9, 2026

View reviewed changes

src/kaggle/api/kaggle_api_extended.py Show resolved Hide resolved

src/kaggle/api/kaggle_api_extended.py Show resolved Hide resolved

src/kaggle/api/kaggle_api_extended.py Show resolved Hide resolved

list model pagination

2504099

dolaameng force-pushed the dolaameng/benchmarks-cli-push-run branch from 6cfc3e1 to 2504099 Compare April 9, 2026 21:06

address comments and improve readability

d422b0c

dolaameng marked this pull request as ready for review April 9, 2026 22:21

dolaameng requested review from jeward414, nl917 and rosbo April 9, 2026 22:24

dolaameng added 3 commits April 10, 2026 18:11

update kagglesdk in pyproject.toml and requirements

d14a77b

reformat

6b0679c

Merge main and resolve conflicts, bumping kagglesdk to 0.1.18

dc4434c

dolaameng requested review from jmasukawa and lucyhe April 10, 2026 18:28

rosbo requested changes Apr 10, 2026

View reviewed changes

addres comments.

3652f49

dolaameng requested a review from rosbo April 10, 2026 21:40

rosbo approved these changes Apr 10, 2026

View reviewed changes

dolaameng merged commit f74b05b into main Apr 10, 2026
5 checks passed

dolaameng deleted the dolaameng/benchmarks-cli-push-run branch April 10, 2026 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): implement Kaggle client (push/run functionality)#960

feat(benchmarks): implement Kaggle client (push/run functionality)#960
dolaameng merged 10 commits intomainfrom
dolaameng/benchmarks-cli-push-run

dolaameng commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

andrewmwang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rosbo left a comment

Uh oh!

rosbo Apr 10, 2026

Uh oh!

dolaameng Apr 10, 2026

Uh oh!

rosbo Apr 10, 2026

Uh oh!

dolaameng Apr 10, 2026

Uh oh!

rosbo Apr 10, 2026

Uh oh!

dolaameng Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,18 @@
		"""Shared test configuration for kaggle CLI tests.

Conversation

dolaameng commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks CLI Reference (push & run)

Commands

push — Register a task

run — Schedule task runs

End to end test

Uh oh!

Uh oh!

andrewmwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rosbo left a comment

Choose a reason for hiding this comment

Uh oh!

rosbo Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

dolaameng Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rosbo Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

dolaameng Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rosbo Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

dolaameng Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dolaameng commented Apr 8, 2026 •

edited

Loading

`push` — Register a task

`run` — Schedule task runs