Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 0 additions & 16 deletions examples/filereference/inputs.json

This file was deleted.

34 changes: 0 additions & 34 deletions examples/filereference/tesseract_api.py

This file was deleted.

8 changes: 0 additions & 8 deletions examples/filereference/tesseract_config.yaml

This file was deleted.

40 changes: 0 additions & 40 deletions examples/filereference/test_tesseract.py

This file was deleted.

File renamed without changes.
101 changes: 101 additions & 0 deletions examples/pathreference/README.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit too much AI smell, try give it a human pass if you can and maybe we should add a section in the docs too.

Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Path Reference Example

A Tesseract that copies files and directories from `input_path` to `output_path`.
It demonstrates how to use `Path` in Tesseract schemas and how to compose custom
Pydantic validators on top of the built-in path-handling behaviour.

## What `Path` does in a schema

When you annotate a field with `Path`, the schema generation layer automatically
injects path-handling validators at runtime.

**Input `Path` fields** — caller sends a relative string, `apply` receives an absolute `Path`:

```
caller sends → "sample_8.json"
built-in resolves → Path("/tesseract/input_data/sample_8.json") (checked: exists)
apply sees → Path("/tesseract/input_data/sample_8.json")
```

- Rejects any path that would escape `input_path` (path traversal protection).
- Raises `FileNotFoundError` if the resolved path does not exist.
- Accepts both files **and** directories (use `InputFileReference` for files only).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not planning to deprecate InputFileReference? Is there a reason to continue recommending it. If we're not deprecating it should we keep the filereference example?


**Output `Path` fields** — `apply` returns an absolute `Path`, caller receives a relative string:

```
apply returns → Path("/tesseract/output_data/sample_8.copy")
built-in strips → Path("sample_8.copy") (checked: exists)
caller receives → "sample_8.copy"
```

- Raises `ValueError` if the path does not exist inside `output_path`.
- Accepts both files **and** directories (use `OutputFileReference` for files only).

## Composing user-defined validators

`AfterValidator`s placed on a `Path`-annotated field are preserved, and in both
cases the user validator receives an already-resolved **absolute** `Path`:

```python
def has_bin_sidecar(path: Path) -> Path:
"""Check that any binref JSON has its .bin sidecar present."""
if path.is_file():
name = bin_reference(path)
if name is not None:
bin = path.parent / name
assert bin.exists(), f"Expected .bin file for json {path} not found at {bin}"
else:
raise ValueError(f"{path} does not exist or is not a file.")
return path

class InputSchema(BaseModel):
paths: list[Annotated[Path, AfterValidator(has_bin_sidecar)]]
```

The built-in path validators run at different points depending on direction:

**Input fields** — built-in validator runs **first**, user validators run after:

```
"sample_8.json"
→ built-in → Path("/tesseract/input_data/sample_8.json") (resolved + existence check)
→ has_bin_sidecar → Path("/tesseract/input_data/sample_8.json") (checks .bin sidecar present)
→ apply receives → Path("/tesseract/input_data/sample_8.json")
```

**Output fields** — user validators run **first**, built-in validator runs after:

```
apply returns → Path("/tesseract/output_data/sample_8.copy")
→ has_bin_sidecar → Path("/tesseract/output_data/sample_8.copy") (checks .bin sidecar was copied)
→ built-in → Path("sample_8.copy") (existence check + prefix stripped)
→ caller receives → "sample_8.copy"
```

This example uses output validators to confirm that `apply` copied the sidecar
`.bin` file alongside each JSON file.

## Test data

The test dataset (`test_cases/testdata/`) contains:

| File | Array encoding |
| ------------------------------------------------------------------ | ----------------------------------------------- |
| `sample_0.json`, `sample_3.json`, `sample_6.json`, `sample_9.json` | `json` (inline) |
| `sample_1.json`, `sample_4.json`, `sample_7.json` | `base64` (inline) |
| `sample_2.json`, `sample_5.json`, `sample_8.json` | `binref` (references the shared `.bin` sidecar) |
| `sample_dir/` | directory containing `data.json` |

`generate_data.py` re-creates this dataset using a fixed RNG seed.

## Running

```bash
# local (no Docker)
uv run python test_tesseract.py

# build Docker image first, then re-run
uv run tesseract build .
uv run python test_tesseract.py
```
64 changes: 64 additions & 0 deletions examples/pathreference/tesseract_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright 2025 Pasteur Labs. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0

import json
import shutil
from pathlib import Path
from typing import Annotated

from pydantic import AfterValidator, BaseModel

from tesseract_core.runtime.config import get_config


def bin_reference(path: Path) -> str | None:
"""Return the name of the .bin file if the json at 'path' references one, else None."""
with open(path) as f:
contents = json.load(f)
if contents["data"]["encoding"] == "binref":
return contents["data"]["buffer"].split(":")[0]
return None


def has_bin_sidecar(path: Path) -> Path:
"""Pydantic validator to check for .bin file next to any json file that references one."""
if path.is_file():
name = bin_reference(path)
if name is not None:
bin = path.parent / name
assert bin.exists(), (
f"Expected .bin file for json {json} not found at {bin}"
)
elif path.is_dir():
return path
else:
raise ValueError(f"{path} does not exist.")
return path


class InputSchema(BaseModel):
paths: list[Annotated[Path, AfterValidator(has_bin_sidecar)]]


class OutputSchema(BaseModel):
paths: list[Annotated[Path, AfterValidator(has_bin_sidecar)]]


def apply(inputs: InputSchema) -> OutputSchema:
output_path = Path(get_config().output_path)
result = []

for src in inputs.paths:
if src.is_dir():
# copy any folder that is given
dest = output_path / src.name
shutil.copytree(src, dest)
else:
# copy any file that is given, and if it references a .bin file, copy that too
dest = output_path / src.with_suffix(".copy").name
shutil.copy(src, dest)
bin = bin_reference(src)
if bin is not None:
shutil.copy(src.parent / bin, dest.parent / bin)
result.append(dest)
return OutputSchema(paths=result)
10 changes: 10 additions & 0 deletions examples/pathreference/tesseract_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: pathreference
version: "1.0.0"
description: |
Tesseract that copies input files and directories to the output directory.
Demonstrates InputPathReference and OutputPathReference, which accept both
files and directories.

build_config:
package_data: []
custom_build_steps: []
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"endpoint": "apply",
"expected_outputs": {
"data": [
"paths": [
"sample_0.copy",
"sample_1.copy",
"sample_2.copy",
Expand All @@ -11,7 +11,8 @@
"sample_6.copy",
"sample_7.copy",
"sample_8.copy",
"sample_9.copy"
"sample_9.copy",
"sample_dir"
]
},
"expected_exception": null,
Expand All @@ -23,7 +24,7 @@
},
"payload": {
"inputs": {
"data": [
"paths": [
"sample_0.json",
"sample_1.json",
"sample_2.json",
Expand All @@ -33,7 +34,8 @@
"sample_6.json",
"sample_7.json",
"sample_8.json",
"sample_9.json"
"sample_9.json",
"sample_dir"
]
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{ "value": "world" }
47 changes: 47 additions & 0 deletions examples/pathreference/test_tesseract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import shutil
from pathlib import Path

from rich import print

from tesseract_core import Tesseract


def clean():
# delete before copy
if output_path.exists():
shutil.rmtree(output_path)
output_path.mkdir()


input_path = Path("./test_cases/testdata")
output_path = Path("./output")

# mix of a file and a directory, both relative to input_path
paths = [
"sample_0.json",
"sample_8.json", # contains .bin reference
"sample_dir",
]

clean()
with Tesseract.from_tesseract_api(
"tesseract_api.py", input_path=input_path, output_path=output_path, stream_logs=True
) as tess:
result = tess.apply({"paths": paths})
print(result)
out_paths = [(output_path / p) for p in result["paths"]]
assert len(out_paths) == len(paths)
assert all(p.exists() for p in out_paths)
assert len(list(output_path.glob("*.bin"))) == 1


clean()
with Tesseract.from_image(
"pathreference", input_path=input_path, output_path=output_path, stream_logs=True
) as tess:
result = tess.apply({"paths": paths})
print(result)
out_paths = [(output_path / p) for p in result["paths"]]
assert len(out_paths) == len(paths)
assert all(p.exists() for p in out_paths)
assert len(list(output_path.glob("*.bin"))) == 1
Loading
Loading