A testbed for evaluating the privacy and robustness of Privacy-Preserving Federated Learning (PPFL). FLTest gives software-defined control and visibility into FL testing: run the same experiment across multiple FL frameworks, inject attacks and defenses as composable hooks, and apply differential and metamorphic tests plus a pitfall checker — all from a single YAML config.
FLTest is the NSF PDaSP (Track 3) FLTEST project — A Testbed for Enhancing Privacy and Robustness of Federated Learning Systems — supported by the U.S. National Science Foundation (Award #2452817-19).
Principal Investigators: Ali Anwar (University of Minnesota) · Muhammad Ali Gulzar (Virginia Tech) · Fatima Anwar (University of Massachusetts Amherst)
A survey of 50 FL robustness papers found wildly inconsistent setups (MNIST-only, IID-only, naive attacks, no personalized metrics), which inflates privacy/robustness claims. FLTest makes a rigorous setup the default and checks for the common pitfalls.
- One abstraction, many backends. Every FL framework implements a single
run_simulation()adapter. Built in: a dependency-light reference PyTorch FedAvg oracle, Flower, and NVFlare (optional extra). - Everything is a hook. Attacks, defenses, and metric listeners are hook plugins that
share one
HookContext, so a plugin written once runs across every backend and multiple plugins compose on a single run. - Attacks:
label_flip,sign_flip,gaussian,backdoor(with attack-success-rate),dlg(gradient-inversion privacy attack). - Defenses (PPFL):
gradient_noise(DP-style clip+noise),norm_clip, and robust aggregationkrum/trimmed_mean/median. - Differential testing: same config across frameworks must agree within tolerance (cross-framework parity); or the same spec run twice must be identical (determinism).
- Metamorphic testing:
clients_scale(N→2N),rounds_monotonic,attack_strength,dp_noiserelations. - Pitfall checker + recommender: flags the six FL-evaluation pitfalls from the project and emits copy-pasteable counter-experiments.
- Config fuzzer: any list-valued knob (e.g.
dataset: [mnist, cifar10]) is expanded into a grid of runs.
conda env create -f environment.yml # creates env "fltest" (Python 3.11)
conda activate fltest
pip install -e ".[dev]" # core (reference + Flower) + test tooling
pip install -e ".[nvflare]" # optional NVFlare backend (needs Python <=3.11)CPU is the default and is deterministic; device: mps (Apple Silicon) or device: cuda
are selectable for speed (with the usual GPU non-determinism caveat).
fltest list # available frameworks/attacks/defenses/metrics
fltest run examples/configs/differential.yaml
fltest diff examples/configs/differential_3way.yaml # cross-framework parity
fltest metamorphic examples/configs/metamorphic.yaml
fltest pitfalls examples/configs/pitfalls_demo.yaml
fltest run examples/configs/attack_label_flip.yaml
fltest run examples/configs/dlg.yaml # privacy attack
fltest run examples/configs/defense_robust.yaml # backdoor vs median aggLoadable hook files (slide-style), no config edits:
export FLTEST_HOOKS=examples/hooks/atk_dlg,examples/hooks/def_gradient_noise
fltest run examples/configs/dlg.yamlname: my_eval
dataset: [mnist, cifar10] # a list => fuzzed into a grid
data_distribution: [iid, dirichlet]
model_name: LeNet
num_clients: 10
num_rounds: 10
attacks: [{name: backdoor, params: {infection_rate: 0.3}, target_clients: [0,1]}]
defenses: [{name: median}]
metrics: [accuracy, loss, per_client]
runs: # one per framework => cross-framework differential
- {framework: reference}
- {framework: flwr}
- {framework: nvflare}
testing:
differential: {mode: cross_framework, metric: accuracy, tolerance: 0.05}
metamorphic:
- {relation: clients_scale, values: [10, 20], tolerance: 0.05}pytest tests/ -qSee docs/ARCHITECTURE.md for the design and examples/configs/ for runnable configs.
- NVFlare runs each client in its own simulator process, so client-side hooks
(attacks/defenses at
before/after_client_train) do not apply to it — it is used for cross-framework differential parity of the vanilla FedAvg path. The reference and Flower backends support the full hook surface. - DLG
source: gradient(default) demonstrates raw-gradient invertibility. Thesource: shared_updatemode (reconstruct from the uploaded update) is faithful only under single-step (FedSGD) training. - A
Dockerfile(CPU, Linux) is provided as a deliverable; the verified path is the conda env above.
This material is based upon work supported by the U.S. National Science Foundation under the Privacy-preserving Data Sharing in Practice (PDaSP) program, Track 3 — Usable Tools and Testbeds for Confidential Data Sharing, Award #2452817-19. The PDaSP program is supported by the NSF together with its co-sponsors (U.S. Department of Transportation, Intel, NIST, and Broadcom). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or its co-sponsors. Program information: https://pdasp.net/projects/.

