feat: augmentation robustness pass for image and video benchmarks by dylanuys · Pull Request #112 · BitMind-AI/gasbench

dylanuys · 2026-06-19T18:23:27Z

feat: augmentation robustness pass for image and video benchmarks

Summary

Adds a second evaluation pass to `gasbench run` that re-evaluates the same samples under a fixed suite of real-world image/video degradations, producing a dedicated `aug_sn34_score`, a per-sample probability-degradation signal, and an `augmentation_robustness` ratio alongside the existing base score.

The motivation is that deepfake detectors deployed in the wild never see pristine generated content — every image or video passes through at least one lossy encode/decode cycle before a human or system can flag it. A model that achieves a high base sn34_score but collapses under JPEG recompression or platform re-encoding is not actually useful; this PR makes that failure mode visible and scoreable.

Background and motivation

Several detection robustness studies establish that codec and JPEG compression are by far the dominant real-world degradation sources for forensic detectors:

FaceForensics++ (Rössler et al., ICCV 2019) introduced the canonical evaluation protocol: train and test at two H.264 compression levels — CRF 23 ("c23", light, YouTube-tier) and CRF 40 ("c40", heavy, WhatsApp/Messenger-tier). Nearly every subsequent deepfake detection paper reports results on these two tracks because platform re-encoding is the single most common reason detector performance degrades between the lab and production.

DeeperForensics-1.0 (Jiang et al., CVPR 2020) extended this with a seven-distortion robustness benchmark and showed that blur and noise have far less impact on detector performance degradation than codec/JPEG artifacts — motivating our decision to drop Gaussian blur from the suite.

Social media pipeline studies (Verdoliva et al., 2020; Gragnaniello et al., 2021) showed that Instagram, WhatsApp, and Twitter each apply chroma-subsampled JPEG at platform-specific quality settings (roughly q55–85) and that this single operation destroys the high-frequency DCT-domain fingerprints most detectors rely on.

What changes

New augmentation functions (`processing/transforms.py`)

`apply_robustness_augmentations` (image)

A fixed pipeline grounded in the social media compression literature. All parameters are pinned constants — not CLI args — to ensure aug_sn34_score values are comparable across runs:

Step	Operation	Rationale
1	Bilinear downscale 0.5× then upscale	Thumbnail/CDN resize pipeline; destroys sub-pixel forensic traces
2	JPEG roundtrip at quality 55	First platform upload (WhatsApp ≈ 55, Telegram ≈ 55–65)
3	WebP roundtrip at quality 75	CDN re-host and browser delivery (Google, Twitter/X serve WebP directly to browsers; our browser extension receives WebP from CDNs)
4	JPEG roundtrip at quality 80	Re-share / re-host recompression (Instagram ≈ 78–85, Twitter ≈ 85)

Chroma subsampling is pinned to 4:2:0 throughout (matching platform behavior).

Gaussian blur was removed because it models optical/focus degradation at capture time, not anything in a social media distribution pipeline, and because DeeperForensics empirically showed it has lower impact on detector performance than compression artifacts.

`apply_video_robustness_augmentations` (video)

Implements the FaceForensics++ evaluation protocol — full H.264 encode/decode roundtrip at CRF 23 (YouTube-tier). Cascade of backends:

ffmpeg (primary) — real -crf flag, forces even W/H for libx264 yuv420p compatibility
cv2 avc1 (fallback) — maps CRF [18, 51] → cv2 quality [100, 0]
Per-frame JPEG (last resort) — preserves chroma subsampling artifacts if no video codec available

CRF	Platform analogue
23	YouTube, high-quality streaming (FF++ c23)
40	WhatsApp, Messenger heavy compression (FF++ c40)

Suite params (CRF, FPS, scale factor) are pinned constants in code for benchmark integrity.

Second evaluation pass (`image_bench.py`, `video_bench.py`)

When --n-aug-per-dataset N is set, a second pass iterates the same N samples per dataset (same seed → same files → same sample_ids) through the robustness transform before inference. Aug-pass rows are written to the same tracker and parquet with aug_pass=True.

The two passes are fully decoupled — base metrics, per-dataset accuracy tables, and the existing sn34_score are computed on base-pass rows only. Aug rows are filtered out before any existing aggregation path.

Deterministic aug seed: the aug transform seed defaults to 42 when --seed is not provided, ensuring reproducible results without requiring an explicit seed. DatasetIterator seed is unchanged so sample selection and the sample_id join remain consistent.

Metrics (`recording.py`)

When aug rows are present, compute_metrics_from_df computes:

Field	Meaning
`aug_sn34_score`	sn34 computed on augmented samples (uses same provenance weights as base)
`augmentation_robustness`	`aug_sn34 / base_sn34` — ratio close to 1.0 = robust
`aug_paired_samples`	Samples matched across base and aug passes via `sample_id`
`aug_mean_prob_degradation`	Mean drop in `P(correct class)` per paired sample
`aug_p95_prob_degradation`	95th-percentile degradation (captures tail behaviour)

Per-sample degradation is computed by joining on sample_id (SHA-256 hash of file-identity fields), so the same file evaluated clean and augmented is always paired correctly.

CLI flags

--n-aug-per-dataset N    Number of samples per dataset for the robustness pass (default: 0, disabled)
--aug-weight W           Weight of aug_sn34_score in any blended downstream scoring (default: 0.2)
--aug-cache-dir PATH     Cache pre-augmented arrays to disk; subsequent runs load from cache
                         instead of re-augmenting (recommended for repeated bmcore runs)

Suite-spec parameters (JPEG quality, scale factor, WebP quality, CRF) are intentionally not exposed as CLI flags — they are pinned constants in code to ensure aug_sn34_score values are comparable across runs.

Example — image benchmark with aug robustness pass, 100 samples per dataset, results cached:

gasbench run --image-model ./my_model/ --n-aug-per-dataset 100 --aug-cache-dir /cache/aug

Aug array cache (`--aug-cache-dir`)

First run: augmented arrays written to {aug_cache_dir}/{sample_id[:2]}/{sample_id}_v1.npy (2-char prefix sharding for filesystem scalability). Subsequent runs load from disk — no ffmpeg, no PIL, no recompute. Atomic write via temp file + os.replace, safe under concurrent worker threads. Version suffix (img_v1 / vid_v1) auto-invalidates on suite changes.

What was explicitly excluded

Gaussian blur — models optical degradation at capture, not distribution. DeeperForensics shows it is among the least impactful distortions on detector performance.
Additive noise — same reasoning; not present in platform pipelines.
Adversarial perturbations — out of scope; these are not real-world transforms.
Color jitter / random flips — change the forensic content, not the distribution channel artifacts.

References

Rössler, A. et al. FaceForensics++: Learning to Detect Manipulated Facial Images. ICCV 2019.
Jiang, L. et al. DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection. CVPR 2020.
Verdoliva, L. Media Forensics and DeepFakes: an Overview. IEEE J-STSP 2020.
Gragnaniello, D. et al. Are GAN generated images easy to detect? A critical analysis of the state-of-the-art. ICME 2021.

🤖 Generated with Claude Code

Adds a second evaluation pass per dataset that re-evaluates the same samples under real-world distribution transforms, producing a separate aug_sn34_score and per-sample degradation stats alongside the base score. Image suite (apply_robustness_augmentations): - Downscale 0.5x + upscale (thumbnail/CDN pipeline simulation) - JPEG roundtrip at q55 (heavy platform upload, e.g. WhatsApp) - Second JPEG roundtrip at q80 (re-share recompression) - Gaussian blur removed — not an internet-pipeline transform Video suite (apply_video_robustness_augmentations): - H.264 encode/decode roundtrip via cv2 avc1 codec - CRF maps to cv2 quality scale (CRF 23 → q85, CRF 40 → q33) - Mirrors FaceForensics++ c23/c40 evaluation protocol - Falls back to per-frame JPEG if codec unavailable Metrics added to results when n_aug_per_dataset > 0: - aug_sn34_score: sn34 computed on augmented samples - augmentation_robustness: aug_sn34 / base_sn34 ratio - aug_paired_samples / aug_mean_prob_degradation / aug_p95_prob_degradation (per-sample degradation via sample_id join across base/aug passes) New CLI flags: --n-aug-per-dataset N, --aug-weight W, --robustness-crf CRF Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add augmentation robustness pass to image and video benchmarks

…der to robustness aug Extends the augmentation robustness pass with real-world distribution-channel transforms missing from the initial suite: - WebP roundtrip (compress_image_webp_pil) wired into the image chain between the two JPEG passes, modeling CDN/platform re-hosting (Facebook, Google). WebP's VP8 intra coding leaves a different artifact family than JPEG DCT, so detectors that survive repeated JPEG can still fail it. Opt out with webp_quality=None. - JPEG chroma subsampling pinned to 4:2:0 (subsampling=2) so the pass actually applies the high-frequency DCT-fingerprint destruction its motivation cites, instead of letting Pillow vary subsampling by quality/version. - Faithful FaceForensics++ CRF reproduction via the ffmpeg CLI (_h264_roundtrip_ffmpeg, real -crf). Encode order is now ffmpeg -> cv2 avc1 -> per-frame JPEG, with the path used recorded in params["method"]. cv2's VIDEOWRITER_PROP_QUALITY does not map to CRF and is ignored on many builds, so the previous path did not reliably produce the requested CRF. - Video resolution ladder via scale_factor (default 1.0 = unchanged FF++ behavior; <1.0 downscales frames before encode to model platform transcodes). Verified end-to-end with numpy/opencv/Pillow/ffmpeg: WebP alters pixels, the full image chain is deterministic per seed, webp opt-out works, and the video path engages the ffmpeg -crf encoder with correct output shape. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat: WebP re-host, faithful ffmpeg CRF, and video resolution ladder for robustness aug

libx264 with yuv420p rejects odd width/height, so _h264_roundtrip_ffmpeg exited non-zero on any frame with an odd dimension and the pass silently fell back to cv2/per-frame JPEG — while params["method"] could still imply a CRF roundtrip that never happened. The resolution ladder's max(2, round(...)) did not guarantee evenness either. Trim a trailing row/column to even dimensions once before encoding, covering both the ffmpeg -crf and cv2 avc1 paths (both are H.264/4:2:0). At most one pixel per axis is dropped, and the frame is resized to target afterward, so the effect on the benchmark is nil. Guarded so a degenerate <2px axis is left untouched for the fallback to handle. Verified: 97x131 input (both dims odd), odd input with scale_factor=0.5, and an even-dim regression case all now engage method="ffmpeg_crf". Reported by Cursor bot review on #113. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix: force even dimensions before H.264 encode (odd sizes broke ffmpeg CRF)

robustness_jpeg_quality, robustness_scale_factor, robustness_crf were user-configurable CLI args, but benchmark scores must be comparable across runs — suite params should be pinned constants, not flags. Removed from: cli.py, run_benchmark, execute_benchmark, run_image_benchmark, run_video_benchmark, BenchmarkRunConfig, PrefetchPipeline, VideoPrefetchPipeline. Kept: --n-aug-per-dataset and --aug-weight (scope and scoring controls are legitimate user choices). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. Seed: aug pass transform seed defaults to 42 when --seed is not provided, ensuring augmentation results are reproducible even without an explicit seed. DatasetIterator seed is unchanged so sample selection (and sample_id join) remains consistent. 2. Cache: --aug-cache-dir <path> writes post-augmentation arrays to disk on first run ({sample_id[:2]}/{sample_id}_v1.npy, sharded for filesystem scalability). Subsequent runs load from cache instead of re-running transforms. Atomic write via temp-file + os.replace. Cache key uses build_sample_id() (same hash as the parquet join key) and embeds a version suffix (img_v1 / vid_v1) for auto-invalidation on suite changes. Opt-in only — no change when flag is omitted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: aug CLI cleanup, deterministic seed, and optional array cache

dylanuys and others added 3 commits June 18, 2026 13:33

Merge pull request #111 from BitMind-AI/feat/aug-robustness

b52ae04

feat: add augmentation robustness pass to image and video benchmarks

kenobijon mentioned this pull request Jun 19, 2026

feat: WebP re-host, faithful ffmpeg CRF, and video resolution ladder for robustness aug #113

Merged

kenobijon and others added 6 commits June 19, 2026 14:37

Merge pull request #113 from BitMind-AI/feat/aug-webp-ffmpeg-resolution

453f447

feat: WebP re-host, faithful ffmpeg CRF, and video resolution ladder for robustness aug

Merge pull request #114 from BitMind-AI/fix/ffmpeg-odd-dimensions

5c50afb

fix: force even dimensions before H.264 encode (odd sizes broke ffmpeg CRF)

Merge pull request #115 from BitMind-AI/fix/aug-cli-cleanup

681ff76

fix: aug CLI cleanup, deterministic seed, and optional array cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: augmentation robustness pass for image and video benchmarks#112

feat: augmentation robustness pass for image and video benchmarks#112
dylanuys wants to merge 9 commits into
mainfrom
dev

dylanuys commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dylanuys commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: augmentation robustness pass for image and video benchmarks

Summary

Background and motivation

What changes

New augmentation functions (processing/transforms.py)

apply_robustness_augmentations (image)

apply_video_robustness_augmentations (video)

Second evaluation pass (image_bench.py, video_bench.py)

Metrics (recording.py)

CLI flags

Aug array cache (--aug-cache-dir)

What was explicitly excluded

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dylanuys commented Jun 19, 2026 •

edited

Loading

New augmentation functions (`processing/transforms.py`)

`apply_robustness_augmentations` (image)

`apply_video_robustness_augmentations` (video)

Second evaluation pass (`image_bench.py`, `video_bench.py`)

Metrics (`recording.py`)

Aug array cache (`--aug-cache-dir`)