Skip to content

feat(benchmarks): implement Kaggle client (push/run functionality)#960

Merged
dolaameng merged 10 commits intomainfrom
dolaameng/benchmarks-cli-push-run
Apr 10, 2026
Merged

feat(benchmarks): implement Kaggle client (push/run functionality)#960
dolaameng merged 10 commits intomainfrom
dolaameng/benchmarks-cli-push-run

Conversation

@dolaameng
Copy link
Copy Markdown
Contributor

@dolaameng dolaameng commented Apr 8, 2026

Benchmarks CLI Reference (push & run)

The benchmarks CLI manages benchmark tasks — registering evaluation code, scheduling runs against models, monitoring progress, and downloading results.

Aliases: kaggle benchmarks or kaggle b

All task subcommands are under kaggle benchmarks tasks (alias: kaggle b t).


Commands

push — Register a task

Upload a Python source file as a benchmark task definition. The file is expected to be a .py file with percent delimiters (e.g., # %%). The CLI converts it to an .ipynb file before uploading. If the task already exists, it creates a new version.

kaggle b t push <task> -f <file>
Parameter Flag Required Description
task positional Yes Task name (e.g. math-eval)
file -f, --file Yes Path to the Python source file defining the task

Behavior:

  1. Validates the file exists and has a .py extension.
  2. Reads the source file and parses it with Python's ast module to extract task names from @task decorators (supports both @task and @kbench.task styles, as well as @task(name="...") with explicit names).
    • When an explicit name= keyword is provided, that name is used.
    • When no explicit name is provided, the function name is title-cased with underscores replaced by spaces (e.g. my_test_task"My Test Task").
  3. Validates that the file contains at least one @task decorator. If none are found, raises ValueError and stops.
  4. Validates that the given task name matches one of the task names extracted from the file.
  5. Converts the .py file content to .ipynb format (Jupyter Notebook) using jupytext (assuming percent format), and adds a Python 3 kernelspec to the notebook metadata.
  6. Checks the server for an existing task with the same slug:
    • If the task exists and its creation_state is QUEUED or RUNNING (i.e. a previous version is still being built), the push is rejected with ValueError.
    • If the task exists and is in COMPLETED or ERRORED state, the push proceeds (creates a new version).
    • If the task does not exist (404), the push proceeds (creates a new task).
  7. Sends the notebook content (JSON string) to create_benchmark_task.
  8. If the server returns an error message in the response, raises ValueError with the error details.
  9. On success, prints the task slug and its URL.

Errors:

  • ValueError: File <path> does not exist — file path is invalid.
  • ValueError: File <path> must be a .py file — file is not a Python file.
  • ValueError: No @task decorators found in file <path>. The file must define at least one task. — the file does not contain any @task-decorated functions.
  • ValueError: Task '<name>' not found in file <path>. Found tasks: ... — the task name doesn't match any @task-decorated function in the file.
  • ValueError: Task '<name>' is currently being created (pending). Cannot push now. — a previous version of this task is still being processed by the server.
  • ValueError: Failed to push task: <error> — the server returned an error message in the response.
  • HTTPError — server-side error (e.g. authentication failure, permission denied).

Example:

kaggle b t push math-eval -f tasks/math_eval.py

run — Schedule task runs

Schedule benchmark task execution against one or more models.

kaggle b t run <task> [-m <model> ...] [--wait]
Parameter Flag Required Description
task positional Yes Task name (e.g. math-eval)
model -m, --model No Model slug(s) to run against. Accepts multiple space-separated values
wait --wait No Wait for runs to complete. Can specify a timeout in seconds (0 or omit = indefinite)
poll_interval --poll-interval No Seconds between status polls when using --wait (default: 10)

Behavior:

  1. Task readiness check: Before scheduling, verifies that the task exists and its creation_state is COMPLETED. If the task is not ready:

    • For ERRORED tasks, the error message includes the task info for debugging.
    • For other non-completed states (e.g. QUEUED, RUNNING), raises ValueError indicating the task is not ready to run.
  2. Model selection: If no -m is provided, fetches the list of available benchmark models via list_benchmark_models and prompts the user interactively:

    No model specified. 5 model(s) available:
      1. gemini-pro (Gemini Pro)
      2. gemma-2b (Gemma 2B)
    Enter model numbers (comma-separated), 'all':
    
    • Enter comma-separated numbers (e.g. 1,3) to select specific models.
    • Enter all to run against every available model.
    • When there are more than 20 models, the list is paginated. Use n for next page and p for previous page.
    • Invalid input (non-numeric, out-of-range index) raises ValueError.
    • If no benchmark models exist on the server, raises ValueError: No benchmark models available. Cannot schedule runs.
  3. Scheduling: Calls batch_schedule_benchmark_task_runs with the task slug and selected model slugs. Output:

    Submitted run(s) for task 'math-eval'.
      gemini-pro: Scheduled
      gemma-2b: Scheduled
      gemini-flash: Skipped (<reason>)
    
  4. Waiting (--wait): After scheduling, if --wait is specified, polls list_benchmark_task_runs at a fixed interval (default 10 seconds, configurable via --poll-interval) until all runs reach a terminal state (COMPLETED or ERRORED) or the timeout is reached. Output while waiting:

    Waiting for run(s) to complete...
      2 run(s) still in progress...
      1 run(s) still in progress...
    All runs completed:
      gemini-pro: COMPLETED
      gemma-2b: ERRORED
    
    • If a timeout (in seconds) is specified and reached, it stops waiting and prints: Timed out waiting for runs after <timeout> seconds.
    • If 0 or no value is specified for --wait, it waits indefinitely.

Errors:

  • ValueError: Task '<name>' is not ready to run (status: <state>). Only completed tasks can be run. — the task has not finished building (or errored during build).
  • ValueError: No benchmark models available. Cannot schedule runs. — no models exist on the server and none were specified via -m.
  • ValueError: Invalid selection: <input> — the user entered non-numeric or out-of-range input during interactive model selection.
  • HTTPError — server-side error (task not found, authentication failure, etc.).

Examples:

# Run against specific models
kaggle b t run math-eval -m gemini-pro gemma-2b

# Run and wait for completion
kaggle b t run math-eval -m gemini-pro --wait

# Wait with a custom poll interval (30 seconds)
kaggle b t run math-eval -m gemini-pro --wait --poll-interval 30

# Wait with a timeout (60 seconds)
kaggle b t run math-eval -m gemini-pro --wait 60

# Interactive model selection (prompts user)
kaggle b t run math-eval

End to end test

https://paste.googleplex.com/6483737513689088

@dolaameng dolaameng marked this pull request as draft April 8, 2026 22:25
@dolaameng dolaameng force-pushed the dolaameng/benchmarks-cli-push-run branch from fdd2f5a to 2f5ef49 Compare April 9, 2026 03:24
@dolaameng dolaameng force-pushed the dolaameng/benchmarks-cli-push-run branch from 2f5ef49 to bdbef87 Compare April 9, 2026 17:07
Copy link
Copy Markdown

@andrewmwang andrewmwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@dolaameng dolaameng force-pushed the dolaameng/benchmarks-cli-push-run branch from 6cfc3e1 to 2504099 Compare April 9, 2026 21:06
@dolaameng dolaameng marked this pull request as ready for review April 9, 2026 22:21
@dolaameng dolaameng requested review from jeward414, nl917 and rosbo April 9, 2026 22:24
@dolaameng dolaameng requested review from jmasukawa and lucyhe April 10, 2026 18:28
Copy link
Copy Markdown
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Just a few small comments.

command_models_update = "Update a model"

# Benchmarks commands
command_benchmarks_tasks_push = "Register a task from a Python source file"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It supports creating or updating an existing task. Not sure if register is the best verb here... You could use "create or update a task ...".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

if task not in task_names:
raise ValueError(f"Task '{task}' not found in file {file}. Found tasks: {', '.join(task_names)}")

def benchmarks_tasks_push_cli(self, task, file):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you passing "task" as a parameter to the CLI method instead of reading it from the Python file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's because currently the users can define multiple tasks in a single file. We need user to decide which one to create.

@@ -0,0 +1,18 @@
"""Shared test configuration for kaggle CLI tests.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get imported automatically by pytest?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think pytest auto-imports it for the test in the same dir. This is actually to avoid the import error from _introspect_token. The old username and password works just fine.

@dolaameng dolaameng requested a review from rosbo April 10, 2026 21:40
@dolaameng dolaameng merged commit f74b05b into main Apr 10, 2026
5 checks passed
@dolaameng dolaameng deleted the dolaameng/benchmarks-cli-push-run branch April 10, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants