Conversation
Adds a second evaluation pass per dataset that re-evaluates the same samples under real-world distribution transforms, producing a separate aug_sn34_score and per-sample degradation stats alongside the base score. Image suite (apply_robustness_augmentations): - Downscale 0.5x + upscale (thumbnail/CDN pipeline simulation) - JPEG roundtrip at q55 (heavy platform upload, e.g. WhatsApp) - Second JPEG roundtrip at q80 (re-share recompression) - Gaussian blur removed — not an internet-pipeline transform Video suite (apply_video_robustness_augmentations): - H.264 encode/decode roundtrip via cv2 avc1 codec - CRF maps to cv2 quality scale (CRF 23 → q85, CRF 40 → q33) - Mirrors FaceForensics++ c23/c40 evaluation protocol - Falls back to per-frame JPEG if codec unavailable Metrics added to results when n_aug_per_dataset > 0: - aug_sn34_score: sn34 computed on augmented samples - augmentation_robustness: aug_sn34 / base_sn34 ratio - aug_paired_samples / aug_mean_prob_degradation / aug_p95_prob_degradation (per-sample degradation via sample_id join across base/aug passes) New CLI flags: --n-aug-per-dataset N, --aug-weight W, --robustness-crf CRF Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: add augmentation robustness pass to image and video benchmarks
…der to robustness aug Extends the augmentation robustness pass with real-world distribution-channel transforms missing from the initial suite: - WebP roundtrip (compress_image_webp_pil) wired into the image chain between the two JPEG passes, modeling CDN/platform re-hosting (Facebook, Google). WebP's VP8 intra coding leaves a different artifact family than JPEG DCT, so detectors that survive repeated JPEG can still fail it. Opt out with webp_quality=None. - JPEG chroma subsampling pinned to 4:2:0 (subsampling=2) so the pass actually applies the high-frequency DCT-fingerprint destruction its motivation cites, instead of letting Pillow vary subsampling by quality/version. - Faithful FaceForensics++ CRF reproduction via the ffmpeg CLI (_h264_roundtrip_ffmpeg, real -crf). Encode order is now ffmpeg -> cv2 avc1 -> per-frame JPEG, with the path used recorded in params["method"]. cv2's VIDEOWRITER_PROP_QUALITY does not map to CRF and is ignored on many builds, so the previous path did not reliably produce the requested CRF. - Video resolution ladder via scale_factor (default 1.0 = unchanged FF++ behavior; <1.0 downscales frames before encode to model platform transcodes). Verified end-to-end with numpy/opencv/Pillow/ffmpeg: WebP alters pixels, the full image chain is deterministic per seed, webp opt-out works, and the video path engages the ffmpeg -crf encoder with correct output shape. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat: WebP re-host, faithful ffmpeg CRF, and video resolution ladder for robustness aug
libx264 with yuv420p rejects odd width/height, so _h264_roundtrip_ffmpeg exited non-zero on any frame with an odd dimension and the pass silently fell back to cv2/per-frame JPEG — while params["method"] could still imply a CRF roundtrip that never happened. The resolution ladder's max(2, round(...)) did not guarantee evenness either. Trim a trailing row/column to even dimensions once before encoding, covering both the ffmpeg -crf and cv2 avc1 paths (both are H.264/4:2:0). At most one pixel per axis is dropped, and the frame is resized to target afterward, so the effect on the benchmark is nil. Guarded so a degenerate <2px axis is left untouched for the fallback to handle. Verified: 97x131 input (both dims odd), odd input with scale_factor=0.5, and an even-dim regression case all now engage method="ffmpeg_crf". Reported by Cursor bot review on #113. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix: force even dimensions before H.264 encode (odd sizes broke ffmpeg CRF)
robustness_jpeg_quality, robustness_scale_factor, robustness_crf were user-configurable CLI args, but benchmark scores must be comparable across runs — suite params should be pinned constants, not flags. Removed from: cli.py, run_benchmark, execute_benchmark, run_image_benchmark, run_video_benchmark, BenchmarkRunConfig, PrefetchPipeline, VideoPrefetchPipeline. Kept: --n-aug-per-dataset and --aug-weight (scope and scoring controls are legitimate user choices). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Seed: aug pass transform seed defaults to 42 when --seed is not
provided, ensuring augmentation results are reproducible even without
an explicit seed. DatasetIterator seed is unchanged so sample
selection (and sample_id join) remains consistent.
2. Cache: --aug-cache-dir <path> writes post-augmentation arrays to
disk on first run ({sample_id[:2]}/{sample_id}_v1.npy, sharded for
filesystem scalability). Subsequent runs load from cache instead of
re-running transforms. Atomic write via temp-file + os.replace.
Cache key uses build_sample_id() (same hash as the parquet join key)
and embeds a version suffix (img_v1 / vid_v1) for auto-invalidation
on suite changes. Opt-in only — no change when flag is omitted.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: aug CLI cleanup, deterministic seed, and optional array cache
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: augmentation robustness pass for image and video benchmarks
Summary
Adds a second evaluation pass to `gasbench run` that re-evaluates the same samples under a fixed suite of real-world image/video degradations, producing a dedicated `aug_sn34_score`, a per-sample probability-degradation signal, and an `augmentation_robustness` ratio alongside the existing base score.
The motivation is that deepfake detectors deployed in the wild never see pristine generated content — every image or video passes through at least one lossy encode/decode cycle before a human or system can flag it. A model that achieves a high base sn34_score but collapses under JPEG recompression or platform re-encoding is not actually useful; this PR makes that failure mode visible and scoreable.
Background and motivation
Several detection robustness studies establish that codec and JPEG compression are by far the dominant real-world degradation sources for forensic detectors:
FaceForensics++ (Rössler et al., ICCV 2019) introduced the canonical evaluation protocol: train and test at two H.264 compression levels — CRF 23 ("c23", light, YouTube-tier) and CRF 40 ("c40", heavy, WhatsApp/Messenger-tier). Nearly every subsequent deepfake detection paper reports results on these two tracks because platform re-encoding is the single most common reason detector performance degrades between the lab and production.
DeeperForensics-1.0 (Jiang et al., CVPR 2020) extended this with a seven-distortion robustness benchmark and showed that blur and noise have far less impact on detector performance degradation than codec/JPEG artifacts — motivating our decision to drop Gaussian blur from the suite.
Social media pipeline studies (Verdoliva et al., 2020; Gragnaniello et al., 2021) showed that Instagram, WhatsApp, and Twitter each apply chroma-subsampled JPEG at platform-specific quality settings (roughly q55–85) and that this single operation destroys the high-frequency DCT-domain fingerprints most detectors rely on.
What changes
New augmentation functions (
processing/transforms.py)apply_robustness_augmentations(image)A fixed pipeline grounded in the social media compression literature. All parameters are pinned constants — not CLI args — to ensure
aug_sn34_scorevalues are comparable across runs:Chroma subsampling is pinned to 4:2:0 throughout (matching platform behavior).
Gaussian blur was removed because it models optical/focus degradation at capture time, not anything in a social media distribution pipeline, and because DeeperForensics empirically showed it has lower impact on detector performance than compression artifacts.
apply_video_robustness_augmentations(video)Implements the FaceForensics++ evaluation protocol — full H.264 encode/decode roundtrip at CRF 23 (YouTube-tier). Cascade of backends:
-crfflag, forces even W/H for libx264 yuv420p compatibilitySuite params (CRF, FPS, scale factor) are pinned constants in code for benchmark integrity.
Second evaluation pass (
image_bench.py,video_bench.py)When
--n-aug-per-dataset Nis set, a second pass iterates the sameNsamples per dataset (same seed → same files → samesample_ids) through the robustness transform before inference. Aug-pass rows are written to the same tracker and parquet withaug_pass=True.The two passes are fully decoupled — base metrics, per-dataset accuracy tables, and the existing sn34_score are computed on base-pass rows only. Aug rows are filtered out before any existing aggregation path.
Deterministic aug seed: the aug transform seed defaults to
42when--seedis not provided, ensuring reproducible results without requiring an explicit seed. DatasetIterator seed is unchanged so sample selection and thesample_idjoin remain consistent.Metrics (
recording.py)When aug rows are present,
compute_metrics_from_dfcomputes:aug_sn34_scoreaugmentation_robustnessaug_sn34 / base_sn34— ratio close to 1.0 = robustaug_paired_samplessample_idaug_mean_prob_degradationP(correct class)per paired sampleaug_p95_prob_degradationPer-sample degradation is computed by joining on
sample_id(SHA-256 hash of file-identity fields), so the same file evaluated clean and augmented is always paired correctly.CLI flags
Suite-spec parameters (JPEG quality, scale factor, WebP quality, CRF) are intentionally not exposed as CLI flags — they are pinned constants in code to ensure
aug_sn34_scorevalues are comparable across runs.Example — image benchmark with aug robustness pass, 100 samples per dataset, results cached:
Aug array cache (
--aug-cache-dir)First run: augmented arrays written to
{aug_cache_dir}/{sample_id[:2]}/{sample_id}_v1.npy(2-char prefix sharding for filesystem scalability). Subsequent runs load from disk — no ffmpeg, no PIL, no recompute. Atomic write via temp file +os.replace, safe under concurrent worker threads. Version suffix (img_v1/vid_v1) auto-invalidates on suite changes.What was explicitly excluded
References
🤖 Generated with Claude Code