feat(benchmarks): implement Kaggle client (push/run functionality)#960
feat(benchmarks): implement Kaggle client (push/run functionality)#960
Conversation
fdd2f5a to
2f5ef49
Compare
2f5ef49 to
bdbef87
Compare
6cfc3e1 to
2504099
Compare
rosbo
left a comment
There was a problem hiding this comment.
Great work. Just a few small comments.
src/kaggle/cli.py
Outdated
| command_models_update = "Update a model" | ||
|
|
||
| # Benchmarks commands | ||
| command_benchmarks_tasks_push = "Register a task from a Python source file" |
There was a problem hiding this comment.
It supports creating or updating an existing task. Not sure if register is the best verb here... You could use "create or update a task ...".
There was a problem hiding this comment.
Good point. Done.
| if task not in task_names: | ||
| raise ValueError(f"Task '{task}' not found in file {file}. Found tasks: {', '.join(task_names)}") | ||
|
|
||
| def benchmarks_tasks_push_cli(self, task, file): |
There was a problem hiding this comment.
Why are you passing "task" as a parameter to the CLI method instead of reading it from the Python file?
There was a problem hiding this comment.
Yes it's because currently the users can define multiple tasks in a single file. We need user to decide which one to create.
| @@ -0,0 +1,18 @@ | |||
| """Shared test configuration for kaggle CLI tests. | |||
There was a problem hiding this comment.
Does this get imported automatically by pytest?
There was a problem hiding this comment.
Yes. I think pytest auto-imports it for the test in the same dir. This is actually to avoid the import error from _introspect_token. The old username and password works just fine.
Benchmarks CLI Reference (push & run)
The benchmarks CLI manages benchmark tasks — registering evaluation code, scheduling runs against models, monitoring progress, and downloading results.
Aliases:
kaggle benchmarksorkaggle bAll task subcommands are under
kaggle benchmarks tasks(alias:kaggle b t).Commands
push— Register a taskUpload a Python source file as a benchmark task definition. The file is expected to be a
.pyfile with percent delimiters (e.g.,# %%). The CLI converts it to an.ipynbfile before uploading. If the task already exists, it creates a new version.taskmath-eval)file-f,--fileBehavior:
.pyextension.astmodule to extract task names from@taskdecorators (supports both@taskand@kbench.taskstyles, as well as@task(name="...")with explicit names).name=keyword is provided, that name is used.my_test_task→"My Test Task").@taskdecorator. If none are found, raisesValueErrorand stops..pyfile content to.ipynbformat (Jupyter Notebook) usingjupytext(assuming percent format), and adds a Python 3 kernelspec to the notebook metadata.creation_stateisQUEUEDorRUNNING(i.e. a previous version is still being built), the push is rejected withValueError.COMPLETEDorERROREDstate, the push proceeds (creates a new version).create_benchmark_task.ValueErrorwith the error details.Errors:
ValueError: File <path> does not exist— file path is invalid.ValueError: File <path> must be a .py file— file is not a Python file.ValueError: No @task decorators found in file <path>. The file must define at least one task.— the file does not contain any@task-decorated functions.ValueError: Task '<name>' not found in file <path>. Found tasks: ...— the task name doesn't match any@task-decorated function in the file.ValueError: Task '<name>' is currently being created (pending). Cannot push now.— a previous version of this task is still being processed by the server.ValueError: Failed to push task: <error>— the server returned an error message in the response.HTTPError— server-side error (e.g. authentication failure, permission denied).Example:
run— Schedule task runsSchedule benchmark task execution against one or more models.
taskmath-eval)model-m,--modelwait--waitpoll_interval--poll-interval--wait(default: 10)Behavior:
Task readiness check: Before scheduling, verifies that the task exists and its
creation_stateisCOMPLETED. If the task is not ready:ERROREDtasks, the error message includes the task info for debugging.QUEUED,RUNNING), raisesValueErrorindicating the task is not ready to run.Model selection: If no
-mis provided, fetches the list of available benchmark models vialist_benchmark_modelsand prompts the user interactively:1,3) to select specific models.allto run against every available model.nfor next page andpfor previous page.ValueError.ValueError: No benchmark models available. Cannot schedule runs.Scheduling: Calls
batch_schedule_benchmark_task_runswith the task slug and selected model slugs. Output:Waiting (
--wait): After scheduling, if--waitis specified, pollslist_benchmark_task_runsat a fixed interval (default 10 seconds, configurable via--poll-interval) until all runs reach a terminal state (COMPLETEDorERRORED) or the timeout is reached. Output while waiting:Timed out waiting for runs after <timeout> seconds.0or no value is specified for--wait, it waits indefinitely.Errors:
ValueError: Task '<name>' is not ready to run (status: <state>). Only completed tasks can be run.— the task has not finished building (or errored during build).ValueError: No benchmark models available. Cannot schedule runs.— no models exist on the server and none were specified via-m.ValueError: Invalid selection: <input>— the user entered non-numeric or out-of-range input during interactive model selection.HTTPError— server-side error (task not found, authentication failure, etc.).Examples:
End to end test
https://paste.googleplex.com/6483737513689088