Commit f74b05b
authored
feat(benchmarks): implement Kaggle client (push/run functionality) (#960)
# Benchmarks CLI Reference (push & run)
The benchmarks CLI manages benchmark tasks — registering evaluation
code, scheduling runs against models, monitoring progress, and
downloading results.
**Aliases**: `kaggle benchmarks` or `kaggle b`
All task subcommands are under `kaggle benchmarks tasks` (alias: `kaggle
b t`).
---
## Commands
### `push` — Register a task
Upload a Python source file as a benchmark task definition. The file is
expected to be a `.py` file with percent delimiters (e.g., `# %%`). The
CLI converts it to an `.ipynb` file before uploading. If the task
already exists, it creates a new version.
```
kaggle b t push <task> -f <file>
```
| Parameter | Flag | Required | Description |
|-----------|------|----------|-------------|
| `task` | positional | **Yes** | Task name (e.g. `math-eval`) |
| `file` | `-f`, `--file` | **Yes** | Path to the Python source file
defining the task |
**Behavior**:
1. Validates the file exists and has a `.py` extension.
2. Reads the source file and parses it with Python's `ast` module to
extract task names from `@task` decorators (supports both `@task` and
`@kbench.task` styles, as well as `@task(name="...")` with explicit
names).
- When an explicit `name=` keyword is provided, that name is used.
- When no explicit name is provided, the function name is title-cased
with underscores replaced by spaces (e.g. `my_test_task` → `"My Test
Task"`).
3. Validates that the file contains at least one `@task` decorator. If
none are found, raises `ValueError` and stops.
4. Validates that the given task name matches one of the task names
extracted from the file.
5. Converts the `.py` file content to `.ipynb` format (Jupyter Notebook)
using `jupytext` (assuming percent format), and adds a Python 3
kernelspec to the notebook metadata.
6. Checks the server for an existing task with the same slug:
- If the task exists and its `creation_state` is `QUEUED` or `RUNNING`
(i.e. a previous version is still being built), the push is **rejected**
with `ValueError`.
- If the task exists and is in `COMPLETED` or `ERRORED` state, the push
proceeds (creates a new version).
- If the task does not exist (404), the push proceeds (creates a new
task).
7. Sends the notebook content (JSON string) to `create_benchmark_task`.
8. If the server returns an error message in the response, raises
`ValueError` with the error details.
9. On success, prints the task slug and its URL.
**Errors**:
- `ValueError: File <path> does not exist` — file path is invalid.
- `ValueError: File <path> must be a .py file` — file is not a Python
file.
- `ValueError: No @task decorators found in file <path>. The file must
define at least one task.` — the file does not contain any
`@task`-decorated functions.
- `ValueError: Task '<name>' not found in file <path>. Found tasks: ...`
— the task name doesn't match any `@task`-decorated function in the
file.
- `ValueError: Task '<name>' is currently being created (pending).
Cannot push now.` — a previous version of this task is still being
processed by the server.
- `ValueError: Failed to push task: <error>` — the server returned an
error message in the response.
- `HTTPError` — server-side error (e.g. authentication failure,
permission denied).
**Example**:
```bash
kaggle b t push math-eval -f tasks/math_eval.py
```
### `run` — Schedule task runs
Schedule benchmark task execution against one or more models.
```
kaggle b t run <task> [-m <model> ...] [--wait]
```
| Parameter | Flag | Required | Description |
|-----------|------|----------|-------------|
| `task` | positional | **Yes** | Task name (e.g. `math-eval`) |
| `model` | `-m`, `--model` | No | Model slug(s) to run against. Accepts
multiple space-separated values |
| `wait` | `--wait` | No | Wait for runs to complete. Can specify a
timeout in seconds (0 or omit = indefinite) |
| `poll_interval` | `--poll-interval` | No | Seconds between status
polls when using `--wait` (default: 10) |
**Behavior**:
1. **Task readiness check**: Before scheduling, verifies that the task
exists and its `creation_state` is `COMPLETED`. If the task is not
ready:
- For `ERRORED` tasks, the error message includes the task info for
debugging.
- For other non-completed states (e.g. `QUEUED`, `RUNNING`), raises
`ValueError` indicating the task is not ready to run.
2. **Model selection**: If no `-m` is provided, fetches the list of
available benchmark models via `list_benchmark_models` and prompts the
user interactively:
```
No model specified. 5 model(s) available:
1. gemini-pro (Gemini Pro)
2. gemma-2b (Gemma 2B)
Enter model numbers (comma-separated), 'all':
```
- Enter comma-separated numbers (e.g. `1,3`) to select specific models.
- Enter `all` to run against every available model.
- When there are more than 20 models, the list is paginated. Use `n` for
next page and `p` for previous page.
- Invalid input (non-numeric, out-of-range index) raises `ValueError`.
- If no benchmark models exist on the server, raises `ValueError: No
benchmark models available. Cannot schedule runs.`
3. **Scheduling**: Calls `batch_schedule_benchmark_task_runs` with the
task slug and selected model slugs. Output:
```
Submitted run(s) for task 'math-eval'.
gemini-pro: Scheduled
gemma-2b: Scheduled
gemini-flash: Skipped (<reason>)
```
4. **Waiting** (`--wait`): After scheduling, if `--wait` is specified,
polls `list_benchmark_task_runs` at a fixed interval (default **10
seconds**, configurable via `--poll-interval`) until all runs reach a
terminal state (`COMPLETED` or `ERRORED`) or the timeout is reached.
Output while waiting:
```
Waiting for run(s) to complete...
2 run(s) still in progress...
1 run(s) still in progress...
All runs completed:
gemini-pro: COMPLETED
gemma-2b: ERRORED
```
- If a timeout (in seconds) is specified and reached, it stops waiting
and prints: `Timed out waiting for runs after <timeout> seconds.`
- If `0` or no value is specified for `--wait`, it waits indefinitely.
**Errors**:
- `ValueError: Task '<name>' is not ready to run (status: <state>). Only
completed tasks can be run.` — the task has not finished building (or
errored during build).
- `ValueError: No benchmark models available. Cannot schedule runs.` —
no models exist on the server and none were specified via `-m`.
- `ValueError: Invalid selection: <input>` — the user entered
non-numeric or out-of-range input during interactive model selection.
- `HTTPError` — server-side error (task not found, authentication
failure, etc.).
**Examples**:
```bash
# Run against specific models
kaggle b t run math-eval -m gemini-pro gemma-2b
# Run and wait for completion
kaggle b t run math-eval -m gemini-pro --wait
# Wait with a custom poll interval (30 seconds)
kaggle b t run math-eval -m gemini-pro --wait --poll-interval 30
# Wait with a timeout (60 seconds)
kaggle b t run math-eval -m gemini-pro --wait 60
# Interactive model selection (prompts user)
kaggle b t run math-eval
```
---
### End to end test
https://paste.googleplex.com/64837375136890881 parent a08af17 commit f74b05b
7 files changed
Lines changed: 814 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
2 | 7 | | |
3 | 8 | | |
4 | 9 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
2 | 11 | | |
3 | 12 | | |
4 | 13 | | |
5 | 14 | | |
6 | 15 | | |
7 | 16 | | |
| 17 | + | |
| 18 | + | |
8 | 19 | | |
9 | 20 | | |
10 | | - | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
11 | 28 | | |
12 | | - | |
| 29 | + | |
13 | 30 | | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
14 | 47 | | |
15 | 48 | | |
16 | 49 | | |
| |||
19 | 52 | | |
20 | 53 | | |
21 | 54 | | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
22 | 61 | | |
23 | 62 | | |
24 | 63 | | |
25 | 64 | | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
26 | 69 | | |
27 | 70 | | |
28 | 71 | | |
29 | 72 | | |
30 | 73 | | |
31 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
32 | 79 | | |
33 | 80 | | |
34 | 81 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
60 | 69 | | |
61 | 70 | | |
62 | 71 | | |
| |||
5429 | 5438 | | |
5430 | 5439 | | |
5431 | 5440 | | |
| 5441 | + | |
| 5442 | + | |
| 5443 | + | |
| 5444 | + | |
| 5445 | + | |
| 5446 | + | |
| 5447 | + | |
| 5448 | + | |
| 5449 | + | |
| 5450 | + | |
| 5451 | + | |
| 5452 | + | |
| 5453 | + | |
| 5454 | + | |
| 5455 | + | |
| 5456 | + | |
| 5457 | + | |
| 5458 | + | |
| 5459 | + | |
| 5460 | + | |
| 5461 | + | |
| 5462 | + | |
| 5463 | + | |
| 5464 | + | |
| 5465 | + | |
| 5466 | + | |
| 5467 | + | |
| 5468 | + | |
| 5469 | + | |
| 5470 | + | |
| 5471 | + | |
| 5472 | + | |
| 5473 | + | |
| 5474 | + | |
| 5475 | + | |
| 5476 | + | |
| 5477 | + | |
| 5478 | + | |
| 5479 | + | |
| 5480 | + | |
| 5481 | + | |
| 5482 | + | |
| 5483 | + | |
| 5484 | + | |
| 5485 | + | |
| 5486 | + | |
| 5487 | + | |
| 5488 | + | |
| 5489 | + | |
| 5490 | + | |
| 5491 | + | |
| 5492 | + | |
| 5493 | + | |
| 5494 | + | |
| 5495 | + | |
| 5496 | + | |
| 5497 | + | |
| 5498 | + | |
| 5499 | + | |
| 5500 | + | |
| 5501 | + | |
| 5502 | + | |
| 5503 | + | |
| 5504 | + | |
| 5505 | + | |
| 5506 | + | |
| 5507 | + | |
| 5508 | + | |
| 5509 | + | |
| 5510 | + | |
| 5511 | + | |
| 5512 | + | |
| 5513 | + | |
| 5514 | + | |
| 5515 | + | |
| 5516 | + | |
| 5517 | + | |
| 5518 | + | |
| 5519 | + | |
| 5520 | + | |
| 5521 | + | |
| 5522 | + | |
| 5523 | + | |
| 5524 | + | |
| 5525 | + | |
| 5526 | + | |
| 5527 | + | |
| 5528 | + | |
| 5529 | + | |
| 5530 | + | |
| 5531 | + | |
| 5532 | + | |
| 5533 | + | |
| 5534 | + | |
| 5535 | + | |
| 5536 | + | |
| 5537 | + | |
| 5538 | + | |
| 5539 | + | |
| 5540 | + | |
| 5541 | + | |
| 5542 | + | |
| 5543 | + | |
| 5544 | + | |
| 5545 | + | |
| 5546 | + | |
| 5547 | + | |
| 5548 | + | |
| 5549 | + | |
| 5550 | + | |
| 5551 | + | |
| 5552 | + | |
| 5553 | + | |
| 5554 | + | |
| 5555 | + | |
| 5556 | + | |
| 5557 | + | |
| 5558 | + | |
| 5559 | + | |
| 5560 | + | |
| 5561 | + | |
| 5562 | + | |
| 5563 | + | |
| 5564 | + | |
| 5565 | + | |
| 5566 | + | |
| 5567 | + | |
| 5568 | + | |
| 5569 | + | |
| 5570 | + | |
| 5571 | + | |
| 5572 | + | |
| 5573 | + | |
| 5574 | + | |
| 5575 | + | |
| 5576 | + | |
| 5577 | + | |
| 5578 | + | |
| 5579 | + | |
| 5580 | + | |
| 5581 | + | |
| 5582 | + | |
| 5583 | + | |
| 5584 | + | |
| 5585 | + | |
| 5586 | + | |
| 5587 | + | |
| 5588 | + | |
| 5589 | + | |
| 5590 | + | |
| 5591 | + | |
| 5592 | + | |
| 5593 | + | |
| 5594 | + | |
| 5595 | + | |
| 5596 | + | |
| 5597 | + | |
| 5598 | + | |
| 5599 | + | |
| 5600 | + | |
| 5601 | + | |
| 5602 | + | |
| 5603 | + | |
| 5604 | + | |
| 5605 | + | |
| 5606 | + | |
| 5607 | + | |
| 5608 | + | |
| 5609 | + | |
| 5610 | + | |
| 5611 | + | |
| 5612 | + | |
| 5613 | + | |
| 5614 | + | |
| 5615 | + | |
| 5616 | + | |
| 5617 | + | |
| 5618 | + | |
| 5619 | + | |
| 5620 | + | |
| 5621 | + | |
| 5622 | + | |
| 5623 | + | |
| 5624 | + | |
| 5625 | + | |
| 5626 | + | |
| 5627 | + | |
| 5628 | + | |
| 5629 | + | |
| 5630 | + | |
| 5631 | + | |
| 5632 | + | |
| 5633 | + | |
| 5634 | + | |
| 5635 | + | |
| 5636 | + | |
| 5637 | + | |
| 5638 | + | |
| 5639 | + | |
| 5640 | + | |
| 5641 | + | |
| 5642 | + | |
| 5643 | + | |
| 5644 | + | |
| 5645 | + | |
| 5646 | + | |
| 5647 | + | |
| 5648 | + | |
| 5649 | + | |
| 5650 | + | |
| 5651 | + | |
| 5652 | + | |
| 5653 | + | |
| 5654 | + | |
| 5655 | + | |
| 5656 | + | |
| 5657 | + | |
| 5658 | + | |
| 5659 | + | |
| 5660 | + | |
| 5661 | + | |
| 5662 | + | |
| 5663 | + | |
| 5664 | + | |
| 5665 | + | |
| 5666 | + | |
| 5667 | + | |
| 5668 | + | |
| 5669 | + | |
| 5670 | + | |
| 5671 | + | |
| 5672 | + | |
| 5673 | + | |
| 5674 | + | |
| 5675 | + | |
| 5676 | + | |
| 5677 | + | |
| 5678 | + | |
| 5679 | + | |
| 5680 | + | |
| 5681 | + | |
| 5682 | + | |
| 5683 | + | |
| 5684 | + | |
| 5685 | + | |
| 5686 | + | |
| 5687 | + | |
| 5688 | + | |
| 5689 | + | |
| 5690 | + | |
| 5691 | + | |
| 5692 | + | |
| 5693 | + | |
| 5694 | + | |
| 5695 | + | |
| 5696 | + | |
| 5697 | + | |
| 5698 | + | |
| 5699 | + | |
| 5700 | + | |
| 5701 | + | |
| 5702 | + | |
5432 | 5703 | | |
5433 | 5704 | | |
5434 | 5705 | | |
| |||
0 commit comments