This document explains why Parley looks the way it does, by walking through the systems it positions itself against. It is grounded in a literature survey (VLA policies, speech LLMs, robot benchmarks, ASR robustness, evaluation toolkit architecture). Every link below is a real system the design was checked against.
The closest things to "spoken-instruction VLA benchmarks" today don't exist as a single piece of software. The mainstream robot benchmarks condition policies on ground-truth text strings — none of them natively expose a spoken-instruction channel or audio perturbations:
- LIBERO — Spatial/Object/Goal/Long splits via BDDL predicates. Now the de facto VLA leaderboard.
- CALVIN — five-subtask chains across env splits A/B/C/D.
- RLBench, Meta-World, ManiSkill2/3, robomimic, SimplerEnv, VLABench, VIMA-Bench.
Meanwhile the speech-side benchmarks (AIR-Bench, Dynamic-SUPERB, SLURP) measure transcript / intent quality but never reach for an end-to-end robot success metric.
Parley is the missing piece: an evaluation harness that runs the
full chain — audio → ASR → grounding → VLA policy → env → success
— and attributes task-success degradation to speech-side
perturbations. The toolkit is small and synthetic on purpose: the
self-contained env + codec ASR keep CI honest while the protocol layer
lets adapters for real systems (Whisper / Qwen-Audio / OpenVLA / Octo /
π0 / LIBERO / ManiSkill) plug in cleanly.
The 2024–2026 VLA landscape converges on a pretrained VLM coupled to a "action expert" that emits continuous action chunks via flow-matching or diffusion:
| System | Action representation | Reference |
|---|---|---|
| RT-1, RT-2 | 256-bin discrete tokens | RT-1, RT-2 |
| Octo | T5 + diffusion head | octo-models/octo |
| OpenVLA | Llama-2 + DINOv2/SigLIP, autoregressive | openvla/openvla |
| OpenVLA-OFT | parallel decoding, L1 regression, ~26× throughput | openvla-oft |
| π₀ / π₀-FAST / π₀.5 | PaliGemma + 300M expert, 50-step chunks, flow-matching | Physical-Intelligence/openpi |
| GR00T N1 | Eagle-2 System 2 + DiT System 1, K=16 | arXiv:2503.14734 |
| SmolVLA | ~450M, asynchronous predict/execute | HF blog |
| RDT-1B | diffusion transformer | RoboticsDiffusionTransformer |
The axes that vary across systems are chunk length K, action
representation (discrete / FAST / continuous / denoised), camera
count, and proprioception. Parley's VLAPolicy Protocol is shaped
to abstract over all of them: a policy sees an Observation (frame +
transcript + grounding), returns an Action with an explicit
action_space tag, and the env decodes it. Chunking is implemented by
the policy itself caching its output across act() calls — the engine
doesn't impose a single-step assumption.
Three tiers of frontend exist today; Parley's SpeechFrontend Protocol
covers all of them with the same (Audio) → Transcript shape:
- Pure ASR: Whisper large-v3 / Whisper turbo, Distil-Whisper, SeamlessM4T v2.
- Audio-LLMs that follow spoken instructions: Qwen2-Audio, SALMONN, SpeechGPT.
- Full-duplex S2S: Moshi (Mimi 12.5 Hz codec, ~200 ms latency).
Spoken instruction-following is benchmarked on AIR-Bench, Dynamic-SUPERB Phase-2, and SLU sets like SLURP.
The bundled CodecSpeechFrontend is not meant to compete with these
— it's a deterministic round-trippable substitute that makes WER
measurable in CI without GPUs or model downloads. The on-disk synth
audio is the codec encoding of the instruction text; the codec frontend
decodes it back. Clean audio round-trips perfectly; perturbed audio
produces realistic word substitutions and insertions. Real Whisper is
provided as an opt-in extra (pip install 'parley-bench[whisper]') for
when measurement-of-Whisper is the point.
The literature agrees on roughly these axes; Parley implements one representative metric per axis (with room to add more):
- WER, CER on normalized text. Standard practice computes them via jiwer with the Whisper normalizer; Parley uses a direct DP edit-distance to keep zero deps but exposes sub/ins/del counts in the same shape.
- Keyword recall for "content words" (verbs, colors, shapes, directions) — closer to what downstream grounding cares about than raw WER.
- Slot F1 + exact match over
(verb, target, destination). Mirrors SLURP-style SLU scoring.
- Success rate (LIBERO/CALVIN-style binary predicate, averaged).
- ActionMSE, DTW distance against an oracle reference action sequence when one exists. Useful for offline comparisons à la robomimic.
- Latency p50/p95/p99, RTF. The total ÷ audio-duration ratio matches the convention in ASR papers and toolkits.
- Δ vs clean per perturbation, mean and max degradation across the perturbation panel.
- Sensitivity index ΔTask/ΔWER — interpretable as: a 1-point increase in WER costs N points of task success.
- Worst-group report — the lowest cell of a target metric across the grouping axis (currently perturbation; the shape supports future accent / L1 strata).
- Percentile bootstrap CI for each cell (lm-eval-harness style).
- Paired-bootstrap p-value for pipeline-vs-pipeline significance.
Drawn from CHiME / MUSAN / DEMAND noise-robustness practice plus spoken-language-variation literature. Parley ships:
| Perturbation | Why it matters |
|---|---|
additive_noise (SNR-calibrated) |
canonical robustness ladder; CHiME-style |
gain, clip |
cheap mic / quiet-talker proxies |
mu_law |
G.711 telephony codec round-trip |
band_limit |
ITU-T G.712 narrowband passband |
spectral_decimate |
poor-man's lossy-codec degradation |
reverb |
smears adjacent codec symbols — the failure mode rooms cause |
packet_loss |
VoIP-style dropped chunks |
time_stretch, pitch_shift |
speaking-rate / pitch variance |
| Perturbation | Why it matters |
|---|---|
disfluency |
self-repetitions ("the the red cube") |
filler |
"uhm", "uh" insertions — speech-natural and rarely benched |
accent_subst |
configurable lexical-style substitution |
What's deliberately not implemented (yet):
- Real RIRs (e.g. OpenSLR SLR28) — would add a download dep and aren't necessary for the synthetic env.
- Real noise corpora (MUSAN, DEMAND). Again, an extra.
- Per-speaker / accent strata (e.g. Common Voice,
L2-ARCTIC,
EdAcc). The
worst_group_reportsurface accommodates these once a real dataset is plugged in.
- Scenario / Adapter / Metric / RunSpec from HELM — separation between task definition, prompting strategy, and scoring.
- Decorator + name-string registry following lm-evaluation-harness and Inspect. Plugins register themselves; configs reference them by name.
- Versioned result schema — MTEB does this with
.v2task suffixes; Parley does it withREPORT_SCHEMA_VERSIONon the JSON dump. - Per-instance JSONL/JSON persistence then aggregate. Mirrors HELM's helm-server and Inspect's eval-log viewer.
- Bootstrap percentile CI following lm-eval's recipe: resample with
replacement, deterministic per
(metric, seed).
Worth being honest about:
- The synthetic env is not a physics simulator. It's a controllable testbed. Numbers it produces are calibrated against itself, not against a real robot.
- The codec ASR is robust to surprisingly low SNRs (~-20 dB before WER spikes) because its log-spaced tone vocabulary has high bin SNR even under heavy noise. Real ASR breaks earlier; Whisper adapter measures that earlier breakage.
- Action chunking, off-policy IL evaluation, and real RIR convolution are deliberately out of scope for v0. The protocol surface accommodates them; the implementations don't yet exist.