Skip to content

feat: Automatic path resolution in InputSchema/OutputSchema#555

Open
nmheim wants to merge 17 commits intomainfrom
nh/path-references
Open

feat: Automatic path resolution in InputSchema/OutputSchema#555
nmheim wants to merge 17 commits intomainfrom
nh/path-references

Conversation

@nmheim
Copy link
Copy Markdown
Contributor

@nmheim nmheim commented Apr 6, 2026

Description of changes

Any schema field annotated with Path now automatically gets path validators injected at runtime — no need to use InputFileReference / OutputFileReference explicitly. Also fixes those types to correctly enforce file-only paths.

See examples/pathreference/README.md for details and usage examples.

Testing done

  • Existing test suite passes.
  • End-to-end test config updated from filereferencepathreference.

@PasteurBot
Copy link
Copy Markdown
Contributor

PasteurBot commented Apr 6, 2026

CLA signatures confirmed

All contributors have signed the Contributor License Agreement.
Posted by the CLA Assistant Lite bot.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 80.39216% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.96%. Comparing base (eb4438f) to head (5de4b40).

Files with missing lines Patch % Lines
tesseract_core/runtime/experimental.py 41.66% 7 Missing ⚠️
tesseract_core/runtime/schema_generation.py 92.30% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #555       +/-   ##
===========================================
+ Coverage   66.77%   76.96%   +10.19%     
===========================================
  Files          32       32               
  Lines        4409     4458       +49     
  Branches      730      739        +9     
===========================================
+ Hits         2944     3431      +487     
+ Misses       1226      726      -500     
- Partials      239      301       +62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nmheim nmheim force-pushed the nh/path-references branch from 9b11ec5 to 8cde6f2 Compare April 6, 2026 16:27
…utputPathReference

Generalizes file references to accept any existing filesystem path (file or
directory). Removes the is_file() constraint from input validation; existence
check is preserved. Output validator is unchanged.
@PasteurBot
Copy link
Copy Markdown
Contributor

PasteurBot commented Apr 6, 2026

Benchmark Results

Benchmarks use a no-op Tesseract to measure pure framework overhead.

🚀 1 faster, ⚠️ 0 slower, ✅ 35 unchanged

Notable changes

Benchmark Baseline Current Change Status
roundtrip/binref_100,000 0.957ms 0.718ms -24.9% 🚀 faster
Full results
Benchmark Baseline Current Change Status
api/apply_1,000 0.590ms 0.600ms +1.6%
api/apply_100,000 0.591ms 0.591ms -0.0%
api/apply_10,000,000 0.589ms 0.587ms -0.2%
cli/apply_1,000 1665.798ms 1648.952ms -1.0%
cli/apply_100,000 1666.173ms 1645.927ms -1.2%
cli/apply_10,000,000 1699.304ms 1711.606ms +0.7%
decoding/base64_1,000 0.038ms 0.037ms -1.5%
decoding/base64_100,000 0.892ms 0.892ms -0.1%
decoding/base64_10,000,000 99.713ms 98.451ms -1.3%
decoding/binref_1,000 0.195ms 0.199ms +1.9%
decoding/binref_100,000 0.234ms 0.239ms +2.0%
decoding/binref_10,000,000 10.535ms 10.295ms -2.3%
decoding/json_1,000 0.108ms 0.107ms -1.3%
decoding/json_100,000 8.948ms 9.095ms +1.6%
decoding/json_10,000,000 1070.545ms 1079.217ms +0.8%
encoding/base64_1,000 0.041ms 0.041ms -0.7%
encoding/base64_100,000 0.145ms 0.147ms +1.3%
encoding/base64_10,000,000 26.219ms 24.710ms -5.8%
encoding/binref_1,000 0.302ms 0.301ms -0.5%
encoding/binref_100,000 0.478ms 0.478ms +0.1%
encoding/binref_10,000,000 18.463ms 18.213ms -1.4%
encoding/json_1,000 0.151ms 0.153ms +1.1%
encoding/json_100,000 12.943ms 13.091ms +1.1%
encoding/json_10,000,000 1406.473ms 1406.180ms -0.0%
http/apply_1,000 3.092ms 3.132ms +1.3%
http/apply_100,000 8.736ms 8.688ms -0.5%
http/apply_10,000,000 760.457ms 763.853ms +0.4%
roundtrip/base64_1,000 0.090ms 0.089ms -1.2%
roundtrip/base64_100,000 1.049ms 1.049ms +0.0%
roundtrip/base64_10,000,000 126.255ms 123.247ms -2.4%
roundtrip/binref_1,000 0.513ms 0.522ms +1.8%
roundtrip/binref_100,000 0.957ms 0.718ms -24.9% 🚀 faster
roundtrip/binref_10,000,000 29.245ms 29.694ms +1.5%
roundtrip/json_1,000 0.272ms 0.272ms +0.1%
roundtrip/json_100,000 19.878ms 19.580ms -1.5%
roundtrip/json_10,000,000 2476.387ms 2463.231ms -0.5%
  • Runner: Linux 6.17.0-1008-azure x86_64

@nmheim nmheim force-pushed the nh/path-references branch from 8cde6f2 to 1b804e4 Compare April 6, 2026 16:42
@jpbrodrick89
Copy link
Copy Markdown
Contributor

Should we not keep InputFileReference for backwards compatibility?

@nmheim nmheim force-pushed the nh/path-references branch from d3737ac to a11ef26 Compare April 7, 2026 11:50
@nmheim nmheim changed the title Turn FileReferences into PathReferences feat: Add *DirectoryReferences and enforce is_file() in OutputFileReference Apr 7, 2026
Copy link
Copy Markdown
Contributor

@jpbrodrick89 jpbrodrick89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I like this, I'd approve but as it's still draft I will wait until you mark it ready until I give the full green light. Thanks!

@nmheim nmheim force-pushed the nh/path-references branch 2 times, most recently from 48111c6 to 5d05532 Compare April 7, 2026 16:29
@nmheim nmheim force-pushed the nh/path-references branch from 5d05532 to e63438c Compare April 7, 2026 16:32
@nmheim
Copy link
Copy Markdown
Contributor Author

nmheim commented Apr 7, 2026

Cool, I like this, I'd approve but as it's still draft I will wait until you mark it ready until I give the full green light. Thanks!

Yes, this was still quite drafty. After a chat with @xalelax I now settled on *PathReferences and keeping around *FileReferences. IMO we could remove those though, given that they were in the experimental module.

@nmheim nmheim changed the title feat: Add *DirectoryReferences and enforce is_file() in OutputFileReference feat: Automatic path resolution in InputSchema/OutputSchema Apr 8, 2026
@nmheim nmheim marked this pull request as ready for review April 8, 2026 18:44
@nmheim nmheim requested a review from jpbrodrick89 April 8, 2026 18:44
@nmheim
Copy link
Copy Markdown
Contributor Author

nmheim commented Apr 8, 2026

The end2end test that is currently failing is due to OutputSchema being called on the expected outputs in test_apply.json. Before I refactor that, let's agree on whether we want this or not :)

Copy link
Copy Markdown
Contributor

@linusseelinger linusseelinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff! 😎
Initially I was a bit hesitant about the "black magic" of injecting annotators, since it's not obvious to the user (principle of least surprise). However, paths have a very unique and important role, so should be fine to do something special here; especially if we use the opportunity to introduce a generous amount of instructive validation error messages, guiding users in case they mess things up!


InputFileReference = Annotated[Path, AfterValidator(_resolve_input_path)]
OutputFileReference = Annotated[Path, AfterValidator(_strip_output_path)]
def _strip_output_file(path: Path) -> Path:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this? E.g. merge _strip_output_file with _strip_output_path and _resolve_input_file with _resolve_input_path?

raise ValueError(
f"Invalid input file reference: {path}. "
f"Expected path to be relative to {input_path}, but got {tess_path}. "
"File references have to be relative to --input-path."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is really good! (i.e. instructive - as a Tess noob I understand what happened and what I need to do, even if I don't have immediate access to values and files involved)
Could we have similarly instructive and actionable messages for all other potential validation issues? E.g. making clear to the user that an output path needs to be in the output dir etc.? We should elegantly catch all kinds of nonsense that might happen here :D


- Rejects any path that would escape `input_path` (path traversal protection).
- Raises `FileNotFoundError` if the resolved path does not exist.
- Accepts both files **and** directories (use `InputFileReference` for files only).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not planning to deprecate InputFileReference? Is there a reason to continue recommending it. If we're not deprecating it should we keep the filereference example?

return tess_path


def _resolve_input_file(path: Path) -> Path:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add our deprecation warning to this function and also _strip_output_file? Not sure where else one can raise a deprecation warning for a type. That said, it's experimental so maybe we just don't raise a deprecation warning at all and add an issue to remind us to remove it at next major release.

return handler(_core_schema)


def _resolve_input_path(path: Path) -> Path:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just import these from schema_generation now?

@nmheim
Copy link
Copy Markdown
Contributor Author

nmheim commented Apr 9, 2026

Initially I was a bit hesitant about the "black magic" of injecting annotators, since it's not obvious to the user (principle of least surprise).

I've been wondering about the same. we could have a special ResolvedPath (or something with a different name) on which we trigger this machinery, so that not every Path is silently modified?

DICT_INDEX_SENTINEL = object()


def _resolve_input_path(path: Path) -> Path:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, schema_generation.py is pretty long already, shall we just move all the path validation helper functions to a new file path_validation.py?

Comment on lines +85 to +87
while _is_annotated(ttype):
ttype = ttype.__origin__
return ttype
Copy link
Copy Markdown
Contributor

@jpbrodrick89 jpbrodrick89 Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Is is possible to have nest annotations? Why does apply_function_to_model_tree not need to handle this (inner_type = treeobj.__origin__)? Is it better to have _core_type as a general helper function if this is possible/?

The optional ``is_leaf`` predicate, if provided, is checked first: when it returns
True for a node, ``func`` is called on that node immediately without further recursion.
This allows callers to treat compound types (e.g. ``Annotated[Path, ...]``) as atomic
leaves.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this confused me a bit maybe edit the last setence to something: "This allow func to act directly on compound types (e.g. Annotated[Path, ...]) and their annotations (such as AfterValidators)."

Comment on lines +93 to +101
if x is Path:
# Wrap with _resolve_input_path as the INNERMOST validator so that
# it runs before all user validators (if any)
return Annotated[Path, AfterValidator(_resolve_input_path)]
return x


def _inject_output_path_validator(x: Any, _: Any) -> Any:
if x is not Path and not _is_annotated_path(x):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your if statements work the opposite way I think I'd prefer editing the second if statement to if is Path or is_annotated_path (or just _core_type(x) is Path?)

Comment on lines +63 to +80
def _strip_output_path(path: Path) -> Path:
from tesseract_core.runtime.config import get_config

output_path = get_config().output_path
if path.is_relative_to(output_path):
return path.relative_to(output_path)
else:
return path


def _strip_output_exists(path: Path) -> Path:
from tesseract_core.runtime.config import get_config

stripped = _strip_output_path(path)
full_path = Path(get_config().output_path) / stripped
if not full_path.exists():
raise ValueError(f"Output path {full_path} does not exist.")
return stripped
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just combine into one function like resolve_input_path, something like:

Suggested change
def _strip_output_path(path: Path) -> Path:
from tesseract_core.runtime.config import get_config
output_path = get_config().output_path
if path.is_relative_to(output_path):
return path.relative_to(output_path)
else:
return path
def _strip_output_exists(path: Path) -> Path:
from tesseract_core.runtime.config import get_config
stripped = _strip_output_path(path)
full_path = Path(get_config().output_path) / stripped
if not full_path.exists():
raise ValueError(f"Output path {full_path} does not exist.")
return stripped
def _strip_output_path(path: Path) -> Path:
from tesseract_core.runtime.config import get_config
output_path = get_config().output_path
if path.is_relative_to(output_path):
if not path.exists():
raise ValueError(f"Output path {path} does not exist inside Tesseract")
return path.relative_to(output_path)
else:
full_path = output_path / path
if not full_path.exists():
if path.exists():
raise ValueError(f"Output path {path} is not in {output_path}. "
f"All output data must be copied to `--output-path` ({output_path})."
)
else:
raise ValueError(f"Output path {path} is not in {output_path} or Tesseract root")
return path

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit too much AI smell, try give it a human pass if you can and maybe we should add a section in the docs too.

@jpbrodrick89
Copy link
Copy Markdown
Contributor

Nice work! I'm pretty ambivalent about whether black magic on Path is fine or to create our own ResolvedPath. WDYT @xalelax ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants