ERR@HRI 3.0: Multimodal Detection of Errors and Anticipation in Human-Robot Interactions

Official repository for the ERR@HRI 3.0 Challenge at ICMI 2026 (5–9 October, Napoli, Italy). Participants will find everything needed to get started: dataset access, evaluation scripts, and the baseline implementation.

Challenge website: here Contact: Maria Teresa Parreira (mb2554@cornell.edu)

Challenge Overview

ERR@HRI 3.0 addresses the problem of multimodal error detection in human-robot interaction by providing two complementary datasets that span the temporal spectrum of error management — from anticipatory responses before failures occur to reactive responses during observed errors.

Building on ERR@HRI 2024 (ICMI'24) and ERR@HRI 2.0 (ACM MM'25), this third edition takes a key step forward: rather than pre-extracted features, participants receive raw, non-anonymized video data collected in naturalistic, crowdsourced settings. This enables end-to-end learning approaches and exposes models to the variability characteristic of real-world deployments (diverse lighting, camera angles, participant positioning, and environmental contexts).

Challenge Goals

Advance multimodal error detection through end-to-end learning from raw visual data
Promote development of robust models that handle naturalistic data variability
Explore the relationship between anticipatory and reactive error responses
Establish benchmarks for generalizable error detection across contexts and temporal stages
Foster community development of open, reproducible error detection methods

Datasets

ERR@HRI 3.0 provides two complementary datasets, both collected via crowdsourcing (Prolific) as raw audiovisual webcam recordings (.mp4). Neither dataset is anonymized.

Characteristic	BAD Dataset	Bad Idea Dataset
Participants	45	29
Total recordings	1,645 videos	865 videos
Total duration	25,527 s	1,851 s
Frame rate	30 fps	30 fps
Temporal focus	During failure	Before failure
Error type	Observed failures	Anticipated outcomes

BAD (Bystander Affect Detection) Dataset

Webcam recordings of 45 participants' spontaneous facial reactions while watching 46 robot and human failure scenarios (average clip: ~15.5 s at 30 fps). Stimuli videos are also provided (46 .mp4 files).

Labels: Binary — 1 (Failure: human or robot failure scenario) vs. 0 (Control: no failure, 6 videos per participant). Labels are encoded in the video filename as QID<stimulus_id>_<label>.mp4.

Data Splits

Split	Participants	Total videos	Label 0 (Control)	Label 1 (Failure)	Total Duration
trainval	36	1,319	173 (13.1%)	1,146 (86.9%)	20,567 s
test	9	326	43 (13.2%)	283 (86.8%)	4,960 s
Total	45	1,645	216 (13.1%)	1,429 (86.9%)	25,527 s

Detailed per-video statistics (participant, split, video name, label, and duration in seconds) are provided in baddataset_stats.csv. Video names follow the convention QID<stimulus_id>_<label>.mp4, where <stimulus_id> identifies the stimulus scenario and <label> is the binary label (0 = Control, 1 = Failure).

Per-Participant Statistics

Participant	Split	Videos	Label 0 (Control)	Label 1 (Failure)	Duration (s)
1001	trainval	31	5	26	519.3
1115	trainval	37	5	32	595.9
1290	trainval	43	5	38	597.3
1402	trainval	38	5	33	605.4
1447	trainval	45	5	40	641.2
1674	trainval	32	5	27	520.9
2144	trainval	31	5	26	508.0
2206	trainval	36	5	31	600.0
2485	trainval	30	5	25	531.2
2568	trainval	37	5	32	595.1
2615	trainval	44	5	39	619.8
2733	trainval	37	5	32	605.3
2767	trainval	20	5	15	344.1
3367	trainval	27	5	22	414.8
3537	trainval	26	5	21	476.9
4195	trainval	39	4	35	630.8
4297	trainval	41	5	36	625.1
4456	trainval	45	5	40	589.6
4667	trainval	37	5	32	590.6
4806	trainval	22	3	19	365.6
4807	trainval	44	5	39	616.5
4928	trainval	30	5	25	517.1
5141	trainval	34	3	31	528.8
5230	trainval	38	5	33	611.8
5409	trainval	37	5	32	575.4
5729	trainval	45	5	40	621.5
6247	trainval	36	3	33	630.5
6682	trainval	37	5	32	622.3
6778	trainval	45	5	40	630.4
7446	trainval	41	5	36	616.6
7675	trainval	45	5	40	639.9
7694	trainval	37	5	32	590.6
7749	trainval	39	5	34	609.1
9009	trainval	38	5	33	621.2
9483	trainval	33	5	28	504.3
9922	trainval	42	5	37	654.0
1179	test	30	5	25	491.0
1769	test	40	5	35	630.6
2505	test	34	5	29	544.8
3272	test	45	5	40	627.8
4302	test	45	5	40	578.3
4891	test	37	5	32	590.0
6095	test	37	5	32	574.8
7209	test	21	3	18	315.6
8564	test	37	5	32	607.4

Bad Idea Dataset

Webcam recordings of 29 participants' anticipatory reactions while predicting whether 30 action scenarios would end well or poorly — captured before outcomes were revealed (average clip: ~1.95 s). Stimuli videos also provided (30 .mp4 files).

Labels: Binary — participant's predicted outcome: 0 (good outcome expected, i.e. "Well") vs. 1 (bad outcome expected, i.e. "Poorly"). Labels reflect participant prediction, not actual outcome.

Data Splits

Split	Participants	Total videos	Label 0 (Well)	Label 1 (Poorly)
trainval	23	685	360 (52.6%)	325 (47.4%)
test	6	180	103 (57.2%)	77 (42.8%)
Total	29	865	463 (53.5%)	402 (46.5%)

Note: participant 1483 is missing one video (q_20_main); participant 9055 is missing one video (q_12_main). All other participants have 30 videos each.

Per-Participant Statistics

Detailed per-video statistics (participant, split, video name, label, and frame count) are provided in badidea_dataset_stats.csv. Frame names follow the convention q_ID_main_LABEL_30fps_frameNNNN.png, where q_ID_main is the video identifier, LABEL is the binary label (0 or 1), and NNNN is the zero-padded frame index.

Summary per participant:

Participant	Split	Videos	Label 0	Label 1
1048	trainval	30	19	11
1251	trainval	30	18	12
1483	trainval	28	15	13
1676	trainval	30	21	9
2103	trainval	30	11	19
2698	trainval	30	16	14
2946	trainval	30	14	16
3157	trainval	30	12	18
3203	trainval	30	15	15
3339	trainval	30	17	13
3882	trainval	30	15	15
5099	trainval	30	16	14
5124	trainval	30	19	11
5233	trainval	30	12	18
5310	trainval	30	12	18
7136	trainval	30	17	13
7797	trainval	30	16	14
8184	trainval	30	21	9
8758	trainval	30	12	18
9055	trainval	27	13	14
9385	trainval	30	16	14
9777	trainval	30	16	14
9941	trainval	30	17	13
2313	test	30	17	13
5009	test	30	16	14
6488	test	30	17	13
7782	test	30	24	6
8436	test	30	15	15
8786	test	30	14	16

Challenge Tracks

Track 1: Bystander Reaction Detection (BAD Dataset)

Binary classification of whether the participant is observing a failure vs. control scenario.

Track 2: Anticipatory Response Prediction (Bad Idea Dataset)

Binary classification of participants' predicted outcome (well/poorly) from their anticipatory facial behavior.

Cross-Dataset Generalization (Optional)

Participants are encouraged to explore transfer learning across datasets (e.g., train on BAD, test on Bad Idea). Evaluated separately with its own award.

Evaluation

Models are evaluated on multiple metrics:

Offline: F1-score (primary), Accuracy, AUC
Windowed predictions: Fixed-length sliding windows; a video is correctly classified if any window predicts the correct label
Earliest Detection Time: For correctly classified videos, percentage of video elapsed before first correct prediction (lower is better)
False Negative Rate per video: Count of windows that "miss" positive predictions in error/bad-outcome videos

Teams are ranked separately per track based on F1-score. Winners selected for: Track 1, Track 2, Best Overall Performance, and Best Cross-Dataset Generalization.

Evaluation scripts will be released in this repository by May 1, 2026.

Timeline

Date	Milestone
March 1, 2026	Challenge announcement and call for participation
March 15, 2026	Registration opens; training and validation data released
May 1, 2026	Baseline models and evaluation scripts released
June 1, 2026	Test data released (without labels)
June 8, 2026	Submission deadline for predictions
June 15, 2026	Paper submission deadline
July 8, 2026	Notification of acceptance
July 23, 2026	Camera-ready papers due (Hard Deadline)
October 5, 2026	Challenge workshop at ICMI 2026, Napoli, Italy

Data Preparation

The datasets are provided as raw .mp4 video files. Before training with the baseline, you need to extract frames. A utility script is provided for the BAD dataset:

python utils/extract_frames.py                # both trainval and test splits
python utils/extract_frames.py --split trainval
python utils/extract_frames.py --no_npy       # PNG frames only
python utils/extract_frames.py --no_frames    # NPY arrays only (frames must already exist)

This script extracts frames at 5 fps, saves PNG frames and/or NPY arrays, and writes a label_data.csv to each output directory. The Bad Idea dataset is already provided as pre-extracted frames.

After extraction, your data directory should look like:

<dataset_dir>/
├── trainval/                  # raw videos (per participant)
├── trainval_frames/           # extracted PNG frames
│   ├── label_data.csv         # one row per frame: participant_id, q_id, label
│   └── <participant_id>/
│       └── q_<id>_main_<label>_5fps_frame<NNNN>.png
├── trainval_npy/              # pre-processed NPY arrays (same structure as frames)
│   ├── label_data.csv
│   └── <participant_id>/
│       └── q_<id>_main_<label>_5fps_frame<NNNN>.npy
└── test/
    └── ...

Note on label_data.csv: This file contains one row per frame (not per video). All frames from the same video share the same label — this is by design, since the baseline dataset classes index individual frames during training.

Registration & Data Access

The datasets contain non-anonymized visual data. To receive access, participants must:

Register via the challenge website: [https://sites.google.com/view/errhri30/]
Read and sign the DUA (Data Use Agreement)
Receive dataset download link

Data usage terms include: no redistribution rights; datasets may only be used for this challenge; no use to defame participants; proper data security measures required.

Each team may submit up to 3 times on the test set. Participating teams must submit a short paper describing their approach (ICMI 2026 template). Code release is strongly encouraged. All accepted papers will be published in the ACM ICMI 2026 proceedings.

Baseline

A baseline implementation is provided in the badnet/ directory. See BASELINE.md for full documentation, quick-start instructions, and results.

Baseline results to be added by May 1, 2026.

Repository Structure

├── badnet/                     # Baseline implementation
│   ├── badnet_pytorch.py      # Core models and dataset classes
│   ├── train_badnet.py        # Training script
│   ├── get_metrics.py         # Evaluation metrics
│   ├── create_image_splits.py # Data splitting utilities
│   ├── resize_dataset.py      # Dataset preprocessing (frames → NPY)
│   └── badnet.sub             # SLURM submission script
├── utils/
│   └── extract_frames.py      # Frame extraction from raw .mp4 videos (BAD dataset)
├── eval.py                    # Official evaluation script
├── baddataset_stats.csv       # Per-video stats for BAD dataset (participant, split, video name, label, duration)
├── badidea_dataset_stats.csv  # Per-video stats for Bad Idea dataset (participant, split, video name, label, frame count)
├── BASELINE.md                # Baseline documentation
├── EVALUATION.md              # Evaluation protocol and submission format
├── challenge_proposal.pdf     # Challenge proposal (Parreira et al., ICMI'26)
└── LICENSE

Organizers

Maria Teresa Parreira — Cornell University, USA
Micol Spitale — Politecnico di Milano, Italy
Maia Stiber — Microsoft Research, USA
Shiye Cao — Johns Hopkins University, USA
Amama Mahmood — Johns Hopkins University, USA
Chien-Ming Huang — Johns Hopkins University, USA
Hatice Gunes — University of Cambridge, UK
Wendy Ju — Cornell Tech, USA

Citation

If you use the datasets or code from this challenge, please cite:

@inproceedings{parreira2026errhri,
  title={ERR@HRI 3.0 Challenge: Multimodal Detection of Errors and Anticipation in Human-Robot Interactions},
  author={Parreira, Maria Teresa and Spitale, Micol and Stiber, Maia and Cao, Shiye and Mahmood, Amama and Huang, Chien-Ming and Gunes, Hatice and Ju, Wendy},
  booktitle={Proceedings of the 28th ACM International Conference on Multimodal Interaction (ICMI '26)},
  year={2026},
  publisher={ACM}
}

License

This project is licensed under the MIT License — see the LICENSE file for details.

Issues and Support

For questions about the baseline code, please use the GitHub issue tracker. For questions about registration, data access, or challenge rules, contact the organizers via the challenge website.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ERR@HRI 3.0: Multimodal Detection of Errors and Anticipation in Human-Robot Interactions

Challenge Overview

Challenge Goals

Datasets

BAD (Bystander Affect Detection) Dataset

Data Splits

Per-Participant Statistics

Bad Idea Dataset

Data Splits

Per-Participant Statistics

Challenge Tracks

Track 1: Bystander Reaction Detection (BAD Dataset)

Track 2: Anticipatory Response Prediction (Bad Idea Dataset)

Cross-Dataset Generalization (Optional)

Evaluation

Timeline

Data Preparation

Registration & Data Access

Baseline

Repository Structure

Organizers

Citation

License

Issues and Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
badnet		badnet
utils		utils
.gitignore		.gitignore
BASELINE.md		BASELINE.md
EVALUATION.md		EVALUATION.md
LICENSE		LICENSE
README.md		README.md
challenge_proposal.pdf		challenge_proposal.pdf
eval.py		eval.py

Folders and files

Latest commit

History

Repository files navigation

ERR@HRI 3.0: Multimodal Detection of Errors and Anticipation in Human-Robot Interactions

Challenge Overview

Challenge Goals

Datasets

BAD (Bystander Affect Detection) Dataset

Data Splits

Per-Participant Statistics

Bad Idea Dataset

Data Splits

Per-Participant Statistics

Challenge Tracks

Track 1: Bystander Reaction Detection (BAD Dataset)

Track 2: Anticipatory Response Prediction (Bad Idea Dataset)

Cross-Dataset Generalization (Optional)

Evaluation

Timeline

Data Preparation

Registration & Data Access

Baseline

Repository Structure

Organizers

Citation

License

Issues and Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages