This project involves an end-to-end data engineering and analytics pipeline using the FIFA 21 dataset (18,979 players). The primary objective was to transform a highly "dirty" dataset—characterized by inconsistent units, special characters, and structural noise into a high-integrity analytical asset.
Beyond cleaning, the project explores the strategic economics of football, identifying trends in player value, development windows, and organizational brand power.
Raw data is rarely ready for a business report. This project tackled several complex cleaning hurdles using Python and Pandas:
- The Challenge:
HeightandWeightwere stored as objects with mixed units (e.g.,6'0",175cm,185lbs,70kg). - The Solution: I engineered custom parsing functions to standardize all entries into Centimeters (cm) and Kilograms (kg).
- The Challenge:
Value,Wage, andRelease Clausecontained currency symbols (€) and text suffixes (M, K). - The Solution: Developed a robust logic to strip symbols and multiply values by their respective magnitudes, converting them into numeric floats.
- Star Ratings: Stripped the "★" symbol from
Weak_Foot,Skill_Moves, andInternational_Reputation. - Structural Noise: Removed hidden newline characters (
\n) from theClubcolumn. - Integrity Checks: Handled the
Hitscolumn by parsing "K" values and filling null entries with0.
- Joined Date: Converted the
Joinedcolumn into Datetime objects. - Tenure Metric: Created a new feature,
Years_at_Club, to analyze player loyalty.
Analysis identified that wages follow an exponential growth curve. While player compensation remains relatively flat for ratings 50–80, "Elite" players (85+) command a massive market premium.

A hexbin jointplot revealed a strong correlation between stature and heading accuracy. Notably, the data isolated a "Goalkeeper Cluster"—tall players with near-zero heading accuracy.

By visualizing the gap between a player's current rating and their potential, I identified the Return on Investment (ROI) Zone between ages 16 and 23.

The data reveals a "Selection Bias" in loyalty. Long-tenure players (10+ years) consistently hold higher average ratings, proving that only high-performers survive at a club long-term.

Boxplot analysis revealed that left-footed players have a higher median Overall_Rating than right-footed players, suggesting a higher "quality floor" for these specialists.

Identified the "Heavyweights" of football. FC Bayern München, PSG, and Real Madrid lead the world in "Star Density," showcasing recruitment strategies focused on global prestige.

Clubs set release clauses significantly higher than market value—often by 20% to 50%—to act as a financial deterrent against rival acquisitions.

- Language: Python 3.x
- Data Wrangling: Pandas, NumPy
- Visualization: Seaborn, Matplotlib
- Environment: Jupyter Notebook
This project demonstrates that the most valuable insights are hidden behind the messiest columns. By moving beyond basic cleaning and into strategic feature engineering, I was able to quantify the financial and developmental patterns that govern professional football.
Author: Saviour Amegayie
Date: February 2026
License: MIT