Skip to content

feature: FileByteType#162

Open
maxbreuker wants to merge 5 commits intomainfrom
feature/Format-Type-Bytes
Open

feature: FileByteType#162
maxbreuker wants to merge 5 commits intomainfrom
feature/Format-Type-Bytes

Conversation

@maxbreuker
Copy link
Contributor

@maxbreuker maxbreuker commented Feb 18, 2026

Summary

This PR introduces validation modes for magic-bytes checks to support strict vs. lazy ("relaxed") validation rules for certain formats (currently: PDF). The main motivation is that many real-world PDFs are still considered valid by common viewers even if the %%EOF marker is not the very last bytes of the file, as long as it appears close to the end.

Motivation

Our current PDF validation requires %%EOF to be at the physical end of the file, which rejects PDFs that are tolerated by common PDF readers. A more compatible rule is: %%EOF must occur somewhere within the last 1024 bytes of the file. This PR adds a mode-aware validation path to support that without duplicating the PDF file type implementation.

What changed

  • Added FileByteType validation modes:

    • Strict (intended for strict validation)
    • Lazy (intended for relaxed / viewer-compatible validation)
  • Extended public APIs to accept an optional validation mode:

    • IFormFileTypeProvider.FindValidatedTypeAsync(..., FileByteType validationType = Strict)
    • IValidator.IsValidAsync(..., FileByteType validationType = Strict)
  • Updated FileByteFilter so magic-byte checks can be registered either:

    • globally (applies to all modes), or
    • mode-specific (applies only to Strict or Lazy)
      by using the existing fluent methods with an optional FileByteType parameter.

PDF behavior

  • Strict keeps the previous behavior: the file must end with one of the supported %%EOF variants.
  • Lazy allows %%EOF to appear anywhere within the last 1024 bytes of the file (TailContains(1024, "%%EOF")).

Tests

  • Updated Validator unit tests for the new validationType parameter.
  • Added PDF unit tests to cover strict vs. relaxed (Lazy) behavior, including trailing bytes and the “last 1024 bytes” window.

Notes / Backwards compatibility

  • Existing formats remain unchanged unless they opt into mode-specific checks.
  • If callers do not pass validationType, the default is used (intended: Strict).

- Add FileByteType enum to represent validation strictness modes (default: Strict)
- Extend IFormFileTypeProvider.FindValidatedTypeAsync with optional `validationType` parameter
- Extend IValidator.IsValidAsync with optional `validationType` parameter
- Update FormFileTypeProvider to pass `validationType` through to the validator
- Update Validator to prefer FileByteFilter.Matches(byte[], FileByteType) when available, with fallback to IFileType.Matches(byte[])
- Keep existing behavior for all callers that do not pass `validationType` (Strict remains the default)
- Keep strict PDF validation by requiring %%EOF at the physical end of the file (existing EndsWithAnyOf variants)
- Add relaxed PDF validation mode using TailContains(1024, "%%EOF") to accept PDFs with trailing bytes after EOF
- Route strict vs. relaxed behavior through FileByteType (Strict vs. Default) within a single Pdf.cs implementation
- Improve compatibility with PDFs tolerated by common viewers while preserving strict validation as the default
private readonly List<ByteCheck> _neededByteChecks = [];
private readonly List<ByteCheck[]> _oneOfEachByteChecks = [];
private readonly List<byte?[]> _anywhereByteChecks = [];
private readonly List<TailContainsCheck> _tailContainsChecks = [];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiple list fields are intentional. I store checks by check kind (fixed offset / one-of / anywhere / tail-window) and by validation mode.
• The “base” lists are global and always evaluated. This keeps existing formats backwards compatible, because they continue to register checks without specifying a mode.
• The Strict lists are evaluated only when Matches(..., Strict) is requested.
• The Lazy lists are evaluated only when Matches(..., Lazy) is requested.

This avoids mutating filter instances at runtime and allows a single format (e.g. PDF) to define strict and relaxed rules side-by-side without duplicating file type classes.

I’m happy to adjust this if there’s a cleaner approach (e.g. grouping the lists into a small CheckSet per mode or using a dictionary-based structure) - suggestions welcome.

- Update tests for new IsValidAsync `validationType` parameter (default: Strict)
- Add tests covering Strict mode EOF-at-end requirement
- Add tests covering Default mode TailContains(1024) behavior (trailing bytes + EOF window)
- Ensure PDFs with invalid header/EOF markers are still rejected
- Update README to document FileByteType validation modes (Strict vs Lazy) and new optional parameters
- Add usage examples for selecting validationType in IFormFileTypeProvider and Validator
- Document mode-specific magic byte configuration on FileByteFilter (optional FileByteType parameter)
- Bump MagicBytesValidator package version in csproj to reflect the new API/documentation
@maxbreuker maxbreuker force-pushed the feature/Format-Type-Bytes branch from 19b6810 to 02834a7 Compare February 18, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments