Skip to content

feat: augmentation robustness pass for image and video benchmarks#112

Open
dylanuys wants to merge 9 commits into
mainfrom
dev
Open

feat: augmentation robustness pass for image and video benchmarks#112
dylanuys wants to merge 9 commits into
mainfrom
dev

Conversation

@dylanuys

@dylanuys dylanuys commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

feat: augmentation robustness pass for image and video benchmarks

Summary

Adds a second evaluation pass to `gasbench run` that re-evaluates the same samples under a fixed suite of real-world image/video degradations, producing a dedicated `aug_sn34_score`, a per-sample probability-degradation signal, and an `augmentation_robustness` ratio alongside the existing base score.

The motivation is that deepfake detectors deployed in the wild never see pristine generated content — every image or video passes through at least one lossy encode/decode cycle before a human or system can flag it. A model that achieves a high base sn34_score but collapses under JPEG recompression or platform re-encoding is not actually useful; this PR makes that failure mode visible and scoreable.


Background and motivation

Several detection robustness studies establish that codec and JPEG compression are by far the dominant real-world degradation sources for forensic detectors:

FaceForensics++ (Rössler et al., ICCV 2019) introduced the canonical evaluation protocol: train and test at two H.264 compression levels — CRF 23 ("c23", light, YouTube-tier) and CRF 40 ("c40", heavy, WhatsApp/Messenger-tier). Nearly every subsequent deepfake detection paper reports results on these two tracks because platform re-encoding is the single most common reason detector performance degrades between the lab and production.

DeeperForensics-1.0 (Jiang et al., CVPR 2020) extended this with a seven-distortion robustness benchmark and showed that blur and noise have far less impact on detector performance degradation than codec/JPEG artifacts — motivating our decision to drop Gaussian blur from the suite.

Social media pipeline studies (Verdoliva et al., 2020; Gragnaniello et al., 2021) showed that Instagram, WhatsApp, and Twitter each apply chroma-subsampled JPEG at platform-specific quality settings (roughly q55–85) and that this single operation destroys the high-frequency DCT-domain fingerprints most detectors rely on.


What changes

New augmentation functions (processing/transforms.py)

apply_robustness_augmentations (image)

A fixed pipeline grounded in the social media compression literature. All parameters are pinned constants — not CLI args — to ensure aug_sn34_score values are comparable across runs:

Step Operation Rationale
1 Bilinear downscale 0.5× then upscale Thumbnail/CDN resize pipeline; destroys sub-pixel forensic traces
2 JPEG roundtrip at quality 55 First platform upload (WhatsApp ≈ 55, Telegram ≈ 55–65)
3 WebP roundtrip at quality 75 CDN re-host and browser delivery (Google, Twitter/X serve WebP directly to browsers; our browser extension receives WebP from CDNs)
4 JPEG roundtrip at quality 80 Re-share / re-host recompression (Instagram ≈ 78–85, Twitter ≈ 85)

Chroma subsampling is pinned to 4:2:0 throughout (matching platform behavior).

Gaussian blur was removed because it models optical/focus degradation at capture time, not anything in a social media distribution pipeline, and because DeeperForensics empirically showed it has lower impact on detector performance than compression artifacts.

apply_video_robustness_augmentations (video)

Implements the FaceForensics++ evaluation protocol — full H.264 encode/decode roundtrip at CRF 23 (YouTube-tier). Cascade of backends:

  1. ffmpeg (primary) — real -crf flag, forces even W/H for libx264 yuv420p compatibility
  2. cv2 avc1 (fallback) — maps CRF [18, 51] → cv2 quality [100, 0]
  3. Per-frame JPEG (last resort) — preserves chroma subsampling artifacts if no video codec available
CRF Platform analogue
23 YouTube, high-quality streaming (FF++ c23)
40 WhatsApp, Messenger heavy compression (FF++ c40)

Suite params (CRF, FPS, scale factor) are pinned constants in code for benchmark integrity.

Second evaluation pass (image_bench.py, video_bench.py)

When --n-aug-per-dataset N is set, a second pass iterates the same N samples per dataset (same seed → same files → same sample_ids) through the robustness transform before inference. Aug-pass rows are written to the same tracker and parquet with aug_pass=True.

The two passes are fully decoupled — base metrics, per-dataset accuracy tables, and the existing sn34_score are computed on base-pass rows only. Aug rows are filtered out before any existing aggregation path.

Deterministic aug seed: the aug transform seed defaults to 42 when --seed is not provided, ensuring reproducible results without requiring an explicit seed. DatasetIterator seed is unchanged so sample selection and the sample_id join remain consistent.

Metrics (recording.py)

When aug rows are present, compute_metrics_from_df computes:

Field Meaning
aug_sn34_score sn34 computed on augmented samples (uses same provenance weights as base)
augmentation_robustness aug_sn34 / base_sn34 — ratio close to 1.0 = robust
aug_paired_samples Samples matched across base and aug passes via sample_id
aug_mean_prob_degradation Mean drop in P(correct class) per paired sample
aug_p95_prob_degradation 95th-percentile degradation (captures tail behaviour)

Per-sample degradation is computed by joining on sample_id (SHA-256 hash of file-identity fields), so the same file evaluated clean and augmented is always paired correctly.

CLI flags

--n-aug-per-dataset N    Number of samples per dataset for the robustness pass (default: 0, disabled)
--aug-weight W           Weight of aug_sn34_score in any blended downstream scoring (default: 0.2)
--aug-cache-dir PATH     Cache pre-augmented arrays to disk; subsequent runs load from cache
                         instead of re-augmenting (recommended for repeated bmcore runs)

Suite-spec parameters (JPEG quality, scale factor, WebP quality, CRF) are intentionally not exposed as CLI flags — they are pinned constants in code to ensure aug_sn34_score values are comparable across runs.

Example — image benchmark with aug robustness pass, 100 samples per dataset, results cached:

gasbench run --image-model ./my_model/ --n-aug-per-dataset 100 --aug-cache-dir /cache/aug

Aug array cache (--aug-cache-dir)

First run: augmented arrays written to {aug_cache_dir}/{sample_id[:2]}/{sample_id}_v1.npy (2-char prefix sharding for filesystem scalability). Subsequent runs load from disk — no ffmpeg, no PIL, no recompute. Atomic write via temp file + os.replace, safe under concurrent worker threads. Version suffix (img_v1 / vid_v1) auto-invalidates on suite changes.


What was explicitly excluded

  • Gaussian blur — models optical degradation at capture, not distribution. DeeperForensics shows it is among the least impactful distortions on detector performance.
  • Additive noise — same reasoning; not present in platform pipelines.
  • Adversarial perturbations — out of scope; these are not real-world transforms.
  • Color jitter / random flips — change the forensic content, not the distribution channel artifacts.

References

  1. Rössler, A. et al. FaceForensics++: Learning to Detect Manipulated Facial Images. ICCV 2019.
  2. Jiang, L. et al. DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection. CVPR 2020.
  3. Verdoliva, L. Media Forensics and DeepFakes: an Overview. IEEE J-STSP 2020.
  4. Gragnaniello, D. et al. Are GAN generated images easy to detect? A critical analysis of the state-of-the-art. ICME 2021.

🤖 Generated with Claude Code

dylanuys and others added 3 commits June 18, 2026 13:33
Adds a second evaluation pass per dataset that re-evaluates the same
samples under real-world distribution transforms, producing a separate
aug_sn34_score and per-sample degradation stats alongside the base score.

Image suite (apply_robustness_augmentations):
- Downscale 0.5x + upscale (thumbnail/CDN pipeline simulation)
- JPEG roundtrip at q55 (heavy platform upload, e.g. WhatsApp)
- Second JPEG roundtrip at q80 (re-share recompression)
- Gaussian blur removed — not an internet-pipeline transform

Video suite (apply_video_robustness_augmentations):
- H.264 encode/decode roundtrip via cv2 avc1 codec
- CRF maps to cv2 quality scale (CRF 23 → q85, CRF 40 → q33)
- Mirrors FaceForensics++ c23/c40 evaluation protocol
- Falls back to per-frame JPEG if codec unavailable

Metrics added to results when n_aug_per_dataset > 0:
- aug_sn34_score: sn34 computed on augmented samples
- augmentation_robustness: aug_sn34 / base_sn34 ratio
- aug_paired_samples / aug_mean_prob_degradation / aug_p95_prob_degradation
  (per-sample degradation via sample_id join across base/aug passes)

New CLI flags: --n-aug-per-dataset N, --aug-weight W, --robustness-crf CRF

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: add augmentation robustness pass to image and video benchmarks
…der to robustness aug

Extends the augmentation robustness pass with real-world distribution-channel
transforms missing from the initial suite:

- WebP roundtrip (compress_image_webp_pil) wired into the image chain between
  the two JPEG passes, modeling CDN/platform re-hosting (Facebook, Google).
  WebP's VP8 intra coding leaves a different artifact family than JPEG DCT, so
  detectors that survive repeated JPEG can still fail it. Opt out with
  webp_quality=None.
- JPEG chroma subsampling pinned to 4:2:0 (subsampling=2) so the pass actually
  applies the high-frequency DCT-fingerprint destruction its motivation cites,
  instead of letting Pillow vary subsampling by quality/version.
- Faithful FaceForensics++ CRF reproduction via the ffmpeg CLI
  (_h264_roundtrip_ffmpeg, real -crf). Encode order is now ffmpeg -> cv2 avc1
  -> per-frame JPEG, with the path used recorded in params["method"]. cv2's
  VIDEOWRITER_PROP_QUALITY does not map to CRF and is ignored on many builds,
  so the previous path did not reliably produce the requested CRF.
- Video resolution ladder via scale_factor (default 1.0 = unchanged FF++
  behavior; <1.0 downscales frames before encode to model platform transcodes).

Verified end-to-end with numpy/opencv/Pillow/ffmpeg: WebP alters pixels, the
full image chain is deterministic per seed, webp opt-out works, and the video
path engages the ffmpeg -crf encoder with correct output shape.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kenobijon and others added 6 commits June 19, 2026 14:37
feat: WebP re-host, faithful ffmpeg CRF, and video resolution ladder for robustness aug
libx264 with yuv420p rejects odd width/height, so _h264_roundtrip_ffmpeg
exited non-zero on any frame with an odd dimension and the pass silently fell
back to cv2/per-frame JPEG — while params["method"] could still imply a CRF
roundtrip that never happened. The resolution ladder's max(2, round(...)) did
not guarantee evenness either.

Trim a trailing row/column to even dimensions once before encoding, covering
both the ffmpeg -crf and cv2 avc1 paths (both are H.264/4:2:0). At most one
pixel per axis is dropped, and the frame is resized to target afterward, so
the effect on the benchmark is nil. Guarded so a degenerate <2px axis is left
untouched for the fallback to handle.

Verified: 97x131 input (both dims odd), odd input with scale_factor=0.5, and
an even-dim regression case all now engage method="ffmpeg_crf".

Reported by Cursor bot review on #113.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix: force even dimensions before H.264 encode (odd sizes broke ffmpeg CRF)
robustness_jpeg_quality, robustness_scale_factor, robustness_crf were
user-configurable CLI args, but benchmark scores must be comparable
across runs — suite params should be pinned constants, not flags.

Removed from: cli.py, run_benchmark, execute_benchmark, run_image_benchmark,
run_video_benchmark, BenchmarkRunConfig, PrefetchPipeline,
VideoPrefetchPipeline. Kept: --n-aug-per-dataset and --aug-weight (scope
and scoring controls are legitimate user choices).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Seed: aug pass transform seed defaults to 42 when --seed is not
   provided, ensuring augmentation results are reproducible even without
   an explicit seed. DatasetIterator seed is unchanged so sample
   selection (and sample_id join) remains consistent.

2. Cache: --aug-cache-dir <path> writes post-augmentation arrays to
   disk on first run ({sample_id[:2]}/{sample_id}_v1.npy, sharded for
   filesystem scalability). Subsequent runs load from cache instead of
   re-running transforms. Atomic write via temp-file + os.replace.
   Cache key uses build_sample_id() (same hash as the parquet join key)
   and embeds a version suffix (img_v1 / vid_v1) for auto-invalidation
   on suite changes. Opt-in only — no change when flag is omitted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: aug CLI cleanup, deterministic seed, and optional array cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants