-
Notifications
You must be signed in to change notification settings - Fork 4
feat: Automatic path resolution in InputSchema/OutputSchema #555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
1b804e4
7c7159e
3de262f
956531b
ace2313
a11ef26
e63438c
868a9d1
63ec719
18f9962
4aeb833
4701bc4
a24d156
882d193
cd551cd
eac8931
5de4b40
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| # Path Reference Example | ||
|
|
||
| A Tesseract that copies files and directories from `input_path` to `output_path`. | ||
| It demonstrates how to use `Path` in Tesseract schemas and how to compose custom | ||
| Pydantic validators on top of the built-in path-handling behaviour. | ||
|
|
||
| ## What `Path` does in a schema | ||
|
|
||
| When you annotate a field with `Path`, the schema generation layer automatically | ||
| injects path-handling validators at runtime. | ||
|
|
||
| **Input `Path` fields** — caller sends a relative string, `apply` receives an absolute `Path`: | ||
|
|
||
| ``` | ||
| caller sends → "sample_8.json" | ||
| built-in resolves → Path("/tesseract/input_data/sample_8.json") (checked: exists) | ||
| apply sees → Path("/tesseract/input_data/sample_8.json") | ||
| ``` | ||
|
|
||
| - Rejects any path that would escape `input_path` (path traversal protection). | ||
| - Raises `FileNotFoundError` if the resolved path does not exist. | ||
| - Accepts both files **and** directories (use `InputFileReference` for files only). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we not planning to deprecate |
||
|
|
||
| **Output `Path` fields** — `apply` returns an absolute `Path`, caller receives a relative string: | ||
|
|
||
| ``` | ||
| apply returns → Path("/tesseract/output_data/sample_8.copy") | ||
| built-in strips → Path("sample_8.copy") (checked: exists) | ||
| caller receives → "sample_8.copy" | ||
| ``` | ||
|
|
||
| - Raises `ValueError` if the path does not exist inside `output_path`. | ||
| - Accepts both files **and** directories (use `OutputFileReference` for files only). | ||
|
|
||
| ## Composing user-defined validators | ||
|
|
||
| `AfterValidator`s placed on a `Path`-annotated field are preserved, and in both | ||
| cases the user validator receives an already-resolved **absolute** `Path`: | ||
|
|
||
| ```python | ||
| def has_bin_sidecar(path: Path) -> Path: | ||
| """Check that any binref JSON has its .bin sidecar present.""" | ||
| if path.is_file(): | ||
| name = bin_reference(path) | ||
| if name is not None: | ||
| bin = path.parent / name | ||
| assert bin.exists(), f"Expected .bin file for json {path} not found at {bin}" | ||
| else: | ||
| raise ValueError(f"{path} does not exist or is not a file.") | ||
| return path | ||
|
|
||
| class InputSchema(BaseModel): | ||
| paths: list[Annotated[Path, AfterValidator(has_bin_sidecar)]] | ||
| ``` | ||
|
|
||
| The built-in path validators run at different points depending on direction: | ||
|
|
||
| **Input fields** — built-in validator runs **first**, user validators run after: | ||
|
|
||
| ``` | ||
| "sample_8.json" | ||
| → built-in → Path("/tesseract/input_data/sample_8.json") (resolved + existence check) | ||
| → has_bin_sidecar → Path("/tesseract/input_data/sample_8.json") (checks .bin sidecar present) | ||
| → apply receives → Path("/tesseract/input_data/sample_8.json") | ||
| ``` | ||
|
|
||
| **Output fields** — user validators run **first**, built-in validator runs after: | ||
|
|
||
| ``` | ||
| apply returns → Path("/tesseract/output_data/sample_8.copy") | ||
| → has_bin_sidecar → Path("/tesseract/output_data/sample_8.copy") (checks .bin sidecar was copied) | ||
| → built-in → Path("sample_8.copy") (existence check + prefix stripped) | ||
| → caller receives → "sample_8.copy" | ||
| ``` | ||
|
|
||
| This example uses output validators to confirm that `apply` copied the sidecar | ||
| `.bin` file alongside each JSON file. | ||
|
|
||
| ## Test data | ||
|
|
||
| The test dataset (`test_cases/testdata/`) contains: | ||
|
|
||
| | File | Array encoding | | ||
| | ------------------------------------------------------------------ | ----------------------------------------------- | | ||
| | `sample_0.json`, `sample_3.json`, `sample_6.json`, `sample_9.json` | `json` (inline) | | ||
| | `sample_1.json`, `sample_4.json`, `sample_7.json` | `base64` (inline) | | ||
| | `sample_2.json`, `sample_5.json`, `sample_8.json` | `binref` (references the shared `.bin` sidecar) | | ||
| | `sample_dir/` | directory containing `data.json` | | ||
|
|
||
| `generate_data.py` re-creates this dataset using a fixed RNG seed. | ||
|
|
||
| ## Running | ||
|
|
||
| ```bash | ||
| # local (no Docker) | ||
| uv run python test_tesseract.py | ||
|
|
||
| # build Docker image first, then re-run | ||
| uv run tesseract build . | ||
| uv run python test_tesseract.py | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| # Copyright 2025 Pasteur Labs. All Rights Reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| import json | ||
| import shutil | ||
| from pathlib import Path | ||
| from typing import Annotated | ||
|
|
||
| from pydantic import AfterValidator, BaseModel | ||
|
|
||
| from tesseract_core.runtime.config import get_config | ||
|
|
||
|
|
||
| def bin_reference(path: Path) -> str | None: | ||
| """Return the name of the .bin file if the json at 'path' references one, else None.""" | ||
| with open(path) as f: | ||
| contents = json.load(f) | ||
| if contents["data"]["encoding"] == "binref": | ||
| return contents["data"]["buffer"].split(":")[0] | ||
| return None | ||
|
|
||
|
|
||
| def has_bin_sidecar(path: Path) -> Path: | ||
| """Pydantic validator to check for .bin file next to any json file that references one.""" | ||
| if path.is_file(): | ||
| name = bin_reference(path) | ||
| if name is not None: | ||
| bin = path.parent / name | ||
| assert bin.exists(), ( | ||
| f"Expected .bin file for json {json} not found at {bin}" | ||
| ) | ||
| elif path.is_dir(): | ||
| return path | ||
| else: | ||
| raise ValueError(f"{path} does not exist.") | ||
| return path | ||
|
|
||
|
|
||
| class InputSchema(BaseModel): | ||
| paths: list[Annotated[Path, AfterValidator(has_bin_sidecar)]] | ||
|
|
||
|
|
||
| class OutputSchema(BaseModel): | ||
| paths: list[Annotated[Path, AfterValidator(has_bin_sidecar)]] | ||
|
|
||
|
|
||
| def apply(inputs: InputSchema) -> OutputSchema: | ||
| output_path = Path(get_config().output_path) | ||
| result = [] | ||
|
|
||
| for src in inputs.paths: | ||
| if src.is_dir(): | ||
| # copy any folder that is given | ||
| dest = output_path / src.name | ||
| shutil.copytree(src, dest) | ||
| else: | ||
| # copy any file that is given, and if it references a .bin file, copy that too | ||
| dest = output_path / src.with_suffix(".copy").name | ||
| shutil.copy(src, dest) | ||
| bin = bin_reference(src) | ||
| if bin is not None: | ||
| shutil.copy(src.parent / bin, dest.parent / bin) | ||
| result.append(dest) | ||
| return OutputSchema(paths=result) | ||
nmheim marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| name: pathreference | ||
| version: "1.0.0" | ||
| description: | | ||
| Tesseract that copies input files and directories to the output directory. | ||
| Demonstrates InputPathReference and OutputPathReference, which accept both | ||
| files and directories. | ||
|
|
||
| build_config: | ||
| package_data: [] | ||
| custom_build_steps: [] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| { "value": "world" } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| import shutil | ||
| from pathlib import Path | ||
|
|
||
| from rich import print | ||
|
|
||
| from tesseract_core import Tesseract | ||
|
|
||
|
|
||
| def clean(): | ||
| # delete before copy | ||
| if output_path.exists(): | ||
| shutil.rmtree(output_path) | ||
| output_path.mkdir() | ||
|
|
||
|
|
||
| input_path = Path("./test_cases/testdata") | ||
| output_path = Path("./output") | ||
|
|
||
| # mix of a file and a directory, both relative to input_path | ||
| paths = [ | ||
| "sample_0.json", | ||
| "sample_8.json", # contains .bin reference | ||
| "sample_dir", | ||
| ] | ||
|
|
||
| clean() | ||
| with Tesseract.from_tesseract_api( | ||
| "tesseract_api.py", input_path=input_path, output_path=output_path, stream_logs=True | ||
| ) as tess: | ||
| result = tess.apply({"paths": paths}) | ||
| print(result) | ||
| out_paths = [(output_path / p) for p in result["paths"]] | ||
| assert len(out_paths) == len(paths) | ||
| assert all(p.exists() for p in out_paths) | ||
| assert len(list(output_path.glob("*.bin"))) == 1 | ||
|
|
||
|
|
||
| clean() | ||
| with Tesseract.from_image( | ||
| "pathreference", input_path=input_path, output_path=output_path, stream_logs=True | ||
| ) as tess: | ||
| result = tess.apply({"paths": paths}) | ||
| print(result) | ||
| out_paths = [(output_path / p) for p in result["paths"]] | ||
| assert len(out_paths) == len(paths) | ||
| assert all(p.exists() for p in out_paths) | ||
| assert len(list(output_path.glob("*.bin"))) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit too much AI smell, try give it a human pass if you can and maybe we should add a section in the docs too.