Three consecutive runner-up finishes. 84 points. 89 points. A squad built over years under Mikel Arteta. The numbers say Arsenal are close — but close isn't the title.
Unfinished Business is a data-driven research project that combines football analytics + machine learning to study Arsenal's Premier League performance from the 2022–23 season through the ongoing 2025–26 campaign.
| # | Goal |
|---|---|
| 1 | Build a structured multi-season match dataset |
| 2 | Perform statistical and ML-based performance analysis |
| 3 | Identify key factors influencing wins/losses |
| 4 | Compare multiple ML models for prediction |
| 5 | Extend to player-level tactical insights |
The dataset contains match-level structured data across 4 seasons.
- arsenal_22-23.csv
- arsenal_23-24.csv
- arsenal_24-25.csv
- arsenal_25-26.csv
| Feature | Description |
|---|---|
| season | Season |
| gw | Gameweek |
| opponent | Opponent team |
| venue | Home/Away |
| goals_for | Goals scored |
| goals_against | Goals conceded |
| result | W/D/L |
| points | Match points |
| opp_table_position | Opponent rank |
| opp_strength_bucket | Strength category |
- Missing values handled
- Created features:
- win (target variable)
- goal_diff
- is_home
- Rolling form (form_last5)
- Opponent encoding
- Strength encoding
Final features used:
- Logistic Regression
- Decision Tree
- Random Forest
- KNN
- SVM
Graphs are stored in outputs/
- Recent form (last 5 matches) is the strongest predictor
- Home advantage significantly impacts results
- Performance drops against stronger opponents
- Matches cluster into distinct performance patterns
pip install -r requirements.txt
python main.py
This project will be extended with a player-level dataset, enabling:
- Player impact modeling
- Best XI combinations
- Tactical pattern extraction
- Advanced prediction models
Anant Jain



