A compact ML pipeline that trains a neural network to tag boosted W/Z→qq̄ jets and compares it to a cut-based baseline, then applies the model to ATLAS Open Data.
Learn more about the data source: https://opendata.atlas.cern/
This project provides an end-to-end workflow to:
- Load MC and data from CSVs (
pythia.csv,jets.csv) - Build features from event objects (leading jet/lepton kinematics and angular variables)
- Train a small MLP classifier (PyTorch)
- Evaluate training/validation curves, ROC and purity
- Choose a working point and apply to real data
Outputs (plots, CSVs, and model weights) are written under out/<revision>/.
NeuralNetworkTaggerLHC/
├── data-exercise-template-v7.py # Main script: data loading, MLP training, evaluation, plots
├── jets.csv # ATLAS Open Data (flattened events)
├── pythia.csv # MC in the same format (+ truth label for jets)
├── README.md # This file
├── setup.sh # Quick setup script (Linux/macOS)
├── setup.bat # Quick setup script (Windows)
├── requirements.txt # Python dependencies
├── flake.nix # Nix development environment (optional)
├── flake.lock # Nix lockfile
├── aux/ # Optional: generators and skimming utilities
│ ├── main213.cc
│ └── skimevents.py
└── report/
└── report.md # Notes/report template (optional)
- Lightweight event model (
Particle,Event) and CSV loader - Cut-based selection (baseline) with configurable kinematic cuts
- MLP classifier in PyTorch with train/val split and scheduler
- Metrics and plots:
- Training vs validation loss and accuracy
- Score distribution and chosen threshold t*
- ROC curve (manual AUC)
- Purity vs threshold and efficiencies
- Application to real data (
jets.csv) with before/after selection mass histograms
Python packages used by the main script:
- numpy
- matplotlib
- scikit-learn
- torch (PyTorch)
An optional Nix flake is provided for fully reproducible dev shells.
Linux/macOS:
# Run the setup script (uses Python venv and pip)
bash ./setup.shWindows:
:: Run the setup script (creates venv and installs requirements)
setup.batAfter setup, activate the environment:
- Bash/zsh:
source .venv/bin/activate - Fish:
source .venv/bin/activate.fish - Windows (cmd):
.venv\Scripts\activate
python -m venv .venv
# Bash/zsh: source .venv/bin/activate
# Fish: source .venv/bin/activate.fish
pip install -r requirements.txtIf PyTorch fails to install from PyPI on your platform, consult https://pytorch.org for the appropriate install command (CPU/GPU, CUDA version), or use the Nix or conda options below.
conda create -n nn-tagger python=3.12
conda activate nn-tagger
pip install -r requirements.txtThe repo includes a Nix flake that provides a development shell with all core Python deps:
nix develop- Ensure your environment is activated (venv, conda, or nix shell).
- Run the main script:
python data-exercise-template-v7.pyKey configurable settings are defined near the top of the script:
- Input files:
MC_FILE,DATA_FILE - Baseline cuts:
min_pt_j,min_pt_l,min_dphi,eta_j_max - Training:
epochs,batch_size, optimizer/scheduler - Output revision tag:
revision(controlsout/<rev>/directory)
Artifacts are saved to out/<revision>/, including:
<title>_model.pt— trained model weights<title>.png— training curves (loss/accuracy/LR)<title>_scores.png— test score distribution with t*<title>_roc.png— ROC plot (manual AUC)<title>_purity.png— purity vs threshold with working point<title>_jetmass.png— data mass histogram before/after cuts- CSVs: training history, ROC points, purity curve (if writing succeeds)
Below are example figures produced by data-exercise-template-v7.py (see also report/report.md). Paths assume the default revision='17':
If these images are not present yet, run the script to generate them; they will be written under out/<revision>/.
Each row contains a single reconstructed particle. Events are grouped by event_id and split into large-R jets (pid=±90) and leptons (others). For each event, the script builds a feature vector from the leading jet and leading lepton.
- Reproducibility: seeds for
random,numpy, andtorchare set to 42. - GPU is optional; the model is tiny and trains fast on CPU. For GPU, install the appropriate CUDA-enabled PyTorch wheel.
- If you change the feature set, adjust the network input dimension accordingly.
- Linux: if
ImportError: libstdc++.so.6appears when importing NumPy, install your system's C++ runtime (e.g.,sudo apt install libstdc++6). Using the provided Nix flake also avoids these issues. - Fish shell: activate with
source .venv/bin/activate.fish. - Force CPU-only torch wheel on Linux: run
TORCH_CPU=1 bash setup.sh.
This project is licensed under the terms in the LICENSE file.
Built for educational purposes using ATLAS Open Data. Thanks to the open data community for resources and inspiration.
Part of Applied Computational Physics and Machine Learning coursework




