Skip to content

PyTorch MLP for boosted W/Z→qq̄ jet tagging on ATLAS Open Data; compares to cuts and outputs ROC/purity plots.

License

Notifications You must be signed in to change notification settings

FreaxMATE/NeuralNetworkTaggerLHC

Repository files navigation

Neural Network Tagger for Hadronic V→qq̄ Jets

A compact ML pipeline that trains a neural network to tag boosted W/Z→qq̄ jets and compares it to a cut-based baseline, then applies the model to ATLAS Open Data.

Learn more about the data source: https://opendata.atlas.cern/

🌟 Overview

This project provides an end-to-end workflow to:

  • Load MC and data from CSVs (pythia.csv, jets.csv)
  • Build features from event objects (leading jet/lepton kinematics and angular variables)
  • Train a small MLP classifier (PyTorch)
  • Evaluate training/validation curves, ROC and purity
  • Choose a working point and apply to real data

Outputs (plots, CSVs, and model weights) are written under out/<revision>/.

📁 Project Structure

NeuralNetworkTaggerLHC/
├── data-exercise-template-v7.py   # Main script: data loading, MLP training, evaluation, plots
├── jets.csv                       # ATLAS Open Data (flattened events)
├── pythia.csv                     # MC in the same format (+ truth label for jets)
├── README.md                      # This file
├── setup.sh                       # Quick setup script (Linux/macOS)
├── setup.bat                      # Quick setup script (Windows)
├── requirements.txt               # Python dependencies
├── flake.nix                      # Nix development environment (optional)
├── flake.lock                     # Nix lockfile
├── aux/                           # Optional: generators and skimming utilities
│   ├── main213.cc
│   └── skimevents.py
└── report/
    └── report.md                  # Notes/report template (optional)

🚀 Features

  • Lightweight event model (Particle, Event) and CSV loader
  • Cut-based selection (baseline) with configurable kinematic cuts
  • MLP classifier in PyTorch with train/val split and scheduler
  • Metrics and plots:
    • Training vs validation loss and accuracy
    • Score distribution and chosen threshold t*
    • ROC curve (manual AUC)
    • Purity vs threshold and efficiencies
  • Application to real data (jets.csv) with before/after selection mass histograms

🛠️ Dependencies

Python packages used by the main script:

  • numpy
  • matplotlib
  • scikit-learn
  • torch (PyTorch)

An optional Nix flake is provided for fully reproducible dev shells.

💻 Setup & Installation

Option 1: Quick Setup (Recommended)

Linux/macOS:

# Run the setup script (uses Python venv and pip)
bash ./setup.sh

Windows:

:: Run the setup script (creates venv and installs requirements)
setup.bat

After setup, activate the environment:

  • Bash/zsh: source .venv/bin/activate
  • Fish: source .venv/bin/activate.fish
  • Windows (cmd): .venv\Scripts\activate

Option 2: Manual Setup (pip)

python -m venv .venv
# Bash/zsh:  source .venv/bin/activate
# Fish:      source .venv/bin/activate.fish
pip install -r requirements.txt

If PyTorch fails to install from PyPI on your platform, consult https://pytorch.org for the appropriate install command (CPU/GPU, CUDA version), or use the Nix or conda options below.

Option 3: Conda (optional)

conda create -n nn-tagger python=3.12
conda activate nn-tagger
pip install -r requirements.txt

Option 4: Nix (Advanced/Optional)

The repo includes a Nix flake that provides a development shell with all core Python deps:

nix develop

🎯 Usage

  1. Ensure your environment is activated (venv, conda, or nix shell).
  2. Run the main script:
python data-exercise-template-v7.py

Key configurable settings are defined near the top of the script:

  • Input files: MC_FILE, DATA_FILE
  • Baseline cuts: min_pt_j, min_pt_l, min_dphi, eta_j_max
  • Training: epochs, batch_size, optimizer/scheduler
  • Output revision tag: revision (controls out/<rev>/ directory)

📊 Output

Artifacts are saved to out/<revision>/, including:

  • <title>_model.pt — trained model weights
  • <title>.png — training curves (loss/accuracy/LR)
  • <title>_scores.png — test score distribution with t*
  • <title>_roc.png — ROC plot (manual AUC)
  • <title>_purity.png — purity vs threshold with working point
  • <title>_jetmass.png — data mass histogram before/after cuts
  • CSVs: training history, ROC points, purity curve (if writing succeeds)

📈 Results (sample outputs)

Below are example figures produced by data-exercise-template-v7.py (see also report/report.md). Paths assume the default revision='17':

Training: loss, accuracy, learning rate

ROC curve on MC test split

Purity vs threshold with working point t*

Classifier scores on MC test split

Data: jet mass before/after cuts

If these images are not present yet, run the script to generate them; they will be written under out/<revision>/.

🧪 Data format (CSV)

Each row contains a single reconstructed particle. Events are grouped by event_id and split into large-R jets (pid=±90) and leptons (others). For each event, the script builds a feature vector from the leading jet and leading lepton.

⚙️ Tips & Notes

  • Reproducibility: seeds for random, numpy, and torch are set to 42.
  • GPU is optional; the model is tiny and trains fast on CPU. For GPU, install the appropriate CUDA-enabled PyTorch wheel.
  • If you change the feature set, adjust the network input dimension accordingly.

Troubleshooting

  • Linux: if ImportError: libstdc++.so.6 appears when importing NumPy, install your system's C++ runtime (e.g., sudo apt install libstdc++6). Using the provided Nix flake also avoids these issues.
  • Fish shell: activate with source .venv/bin/activate.fish.
  • Force CPU-only torch wheel on Linux: run TORCH_CPU=1 bash setup.sh.

📄 License

This project is licensed under the terms in the LICENSE file.

🤝 Acknowledgments

Built for educational purposes using ATLAS Open Data. Thanks to the open data community for resources and inspiration.


Part of Applied Computational Physics and Machine Learning coursework

About

PyTorch MLP for boosted W/Z→qq̄ jet tagging on ATLAS Open Data; compares to cuts and outputs ROC/purity plots.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published