feat: Automatic path resolution in InputSchema/OutputSchema#555
feat: Automatic path resolution in InputSchema/OutputSchema#555
Conversation
CLA signatures confirmedAll contributors have signed the Contributor License Agreement. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #555 +/- ##
===========================================
+ Coverage 66.77% 76.96% +10.19%
===========================================
Files 32 32
Lines 4409 4458 +49
Branches 730 739 +9
===========================================
+ Hits 2944 3431 +487
+ Misses 1226 726 -500
- Partials 239 301 +62 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9b11ec5 to
8cde6f2
Compare
…utputPathReference Generalizes file references to accept any existing filesystem path (file or directory). Removes the is_file() constraint from input validation; existence check is preserved. Output validator is unchanged.
Benchmark ResultsBenchmarks use a no-op Tesseract to measure pure framework overhead. 🚀 1 faster, Notable changes
Full results
|
8cde6f2 to
1b804e4
Compare
|
Should we not keep InputFileReference for backwards compatibility? |
d3737ac to
a11ef26
Compare
*DirectoryReferences and enforce is_file() in OutputFileReference
jpbrodrick89
left a comment
There was a problem hiding this comment.
Cool, I like this, I'd approve but as it's still draft I will wait until you mark it ready until I give the full green light. Thanks!
48111c6 to
5d05532
Compare
5d05532 to
e63438c
Compare
Yes, this was still quite drafty. After a chat with @xalelax I now settled on |
*DirectoryReferences and enforce is_file() in OutputFileReference|
The end2end test that is currently failing is due to |
linusseelinger
left a comment
There was a problem hiding this comment.
Cool stuff! 😎
Initially I was a bit hesitant about the "black magic" of injecting annotators, since it's not obvious to the user (principle of least surprise). However, paths have a very unique and important role, so should be fine to do something special here; especially if we use the opportunity to introduce a generous amount of instructive validation error messages, guiding users in case they mess things up!
|
|
||
| InputFileReference = Annotated[Path, AfterValidator(_resolve_input_path)] | ||
| OutputFileReference = Annotated[Path, AfterValidator(_strip_output_path)] | ||
| def _strip_output_file(path: Path) -> Path: |
There was a problem hiding this comment.
Can we simplify this? E.g. merge _strip_output_file with _strip_output_path and _resolve_input_file with _resolve_input_path?
| raise ValueError( | ||
| f"Invalid input file reference: {path}. " | ||
| f"Expected path to be relative to {input_path}, but got {tess_path}. " | ||
| "File references have to be relative to --input-path." |
There was a problem hiding this comment.
This error message is really good! (i.e. instructive - as a Tess noob I understand what happened and what I need to do, even if I don't have immediate access to values and files involved)
Could we have similarly instructive and actionable messages for all other potential validation issues? E.g. making clear to the user that an output path needs to be in the output dir etc.? We should elegantly catch all kinds of nonsense that might happen here :D
|
|
||
| - Rejects any path that would escape `input_path` (path traversal protection). | ||
| - Raises `FileNotFoundError` if the resolved path does not exist. | ||
| - Accepts both files **and** directories (use `InputFileReference` for files only). |
There was a problem hiding this comment.
Are we not planning to deprecate InputFileReference? Is there a reason to continue recommending it. If we're not deprecating it should we keep the filereference example?
| return tess_path | ||
|
|
||
|
|
||
| def _resolve_input_file(path: Path) -> Path: |
There was a problem hiding this comment.
should we add our deprecation warning to this function and also _strip_output_file? Not sure where else one can raise a deprecation warning for a type. That said, it's experimental so maybe we just don't raise a deprecation warning at all and add an issue to remind us to remove it at next major release.
| return handler(_core_schema) | ||
|
|
||
|
|
||
| def _resolve_input_path(path: Path) -> Path: |
There was a problem hiding this comment.
should we just import these from schema_generation now?
I've been wondering about the same. we could have a special |
| DICT_INDEX_SENTINEL = object() | ||
|
|
||
|
|
||
| def _resolve_input_path(path: Path) -> Path: |
There was a problem hiding this comment.
Actually, schema_generation.py is pretty long already, shall we just move all the path validation helper functions to a new file path_validation.py?
| while _is_annotated(ttype): | ||
| ttype = ttype.__origin__ | ||
| return ttype |
There was a problem hiding this comment.
Is this necessary? Is is possible to have nest annotations? Why does apply_function_to_model_tree not need to handle this (inner_type = treeobj.__origin__)? Is it better to have _core_type as a general helper function if this is possible/?
| The optional ``is_leaf`` predicate, if provided, is checked first: when it returns | ||
| True for a node, ``func`` is called on that node immediately without further recursion. | ||
| This allows callers to treat compound types (e.g. ``Annotated[Path, ...]``) as atomic | ||
| leaves. |
There was a problem hiding this comment.
The need for this confused me a bit maybe edit the last setence to something: "This allow func to act directly on compound types (e.g. Annotated[Path, ...]) and their annotations (such as AfterValidators)."
| if x is Path: | ||
| # Wrap with _resolve_input_path as the INNERMOST validator so that | ||
| # it runs before all user validators (if any) | ||
| return Annotated[Path, AfterValidator(_resolve_input_path)] | ||
| return x | ||
|
|
||
|
|
||
| def _inject_output_path_validator(x: Any, _: Any) -> Any: | ||
| if x is not Path and not _is_annotated_path(x): |
There was a problem hiding this comment.
your if statements work the opposite way I think I'd prefer editing the second if statement to if is Path or is_annotated_path (or just _core_type(x) is Path?)
| def _strip_output_path(path: Path) -> Path: | ||
| from tesseract_core.runtime.config import get_config | ||
|
|
||
| output_path = get_config().output_path | ||
| if path.is_relative_to(output_path): | ||
| return path.relative_to(output_path) | ||
| else: | ||
| return path | ||
|
|
||
|
|
||
| def _strip_output_exists(path: Path) -> Path: | ||
| from tesseract_core.runtime.config import get_config | ||
|
|
||
| stripped = _strip_output_path(path) | ||
| full_path = Path(get_config().output_path) / stripped | ||
| if not full_path.exists(): | ||
| raise ValueError(f"Output path {full_path} does not exist.") | ||
| return stripped |
There was a problem hiding this comment.
Maybe just combine into one function like resolve_input_path, something like:
| def _strip_output_path(path: Path) -> Path: | |
| from tesseract_core.runtime.config import get_config | |
| output_path = get_config().output_path | |
| if path.is_relative_to(output_path): | |
| return path.relative_to(output_path) | |
| else: | |
| return path | |
| def _strip_output_exists(path: Path) -> Path: | |
| from tesseract_core.runtime.config import get_config | |
| stripped = _strip_output_path(path) | |
| full_path = Path(get_config().output_path) / stripped | |
| if not full_path.exists(): | |
| raise ValueError(f"Output path {full_path} does not exist.") | |
| return stripped | |
| def _strip_output_path(path: Path) -> Path: | |
| from tesseract_core.runtime.config import get_config | |
| output_path = get_config().output_path | |
| if path.is_relative_to(output_path): | |
| if not path.exists(): | |
| raise ValueError(f"Output path {path} does not exist inside Tesseract") | |
| return path.relative_to(output_path) | |
| else: | |
| full_path = output_path / path | |
| if not full_path.exists(): | |
| if path.exists(): | |
| raise ValueError(f"Output path {path} is not in {output_path}. " | |
| f"All output data must be copied to `--output-path` ({output_path})." | |
| ) | |
| else: | |
| raise ValueError(f"Output path {path} is not in {output_path} or Tesseract root") | |
| return path |
There was a problem hiding this comment.
A bit too much AI smell, try give it a human pass if you can and maybe we should add a section in the docs too.
|
Nice work! I'm pretty ambivalent about whether black magic on |
Description of changes
Any schema field annotated with
Pathnow automatically gets path validators injected at runtime — no need to useInputFileReference/OutputFileReferenceexplicitly. Also fixes those types to correctly enforce file-only paths.See
examples/pathreference/README.mdfor details and usage examples.Testing done
filereference→pathreference.