Sugarcane is one of the world's most economically significant crops, accounting for approximately 80% of global sugar production and serving as a primary feedstock for sugar, biofuel, molasses, and ethanol. It is cultivated across tropical and subtropical regions spanning multiple continents.
This project performs a focused Exploratory Data Analysis (EDA) on a dataset of global sugarcane production statistics — examining how production volume, cultivated acreage, and yield efficiency vary across countries and continents.
- Dataset
- Project Structure
- Data Cleaning Steps
- Analysis Sections
- Key Findings
- Tech Stack
- How to Run
- Author
| Attribute | Details |
|---|---|
| Source | Sugarcane Dataset — Kaggle (komalharshita) |
| File | List of Countries by Sugarcane Production.csv |
| Rows | One row per country |
| Target | No target — this is an unsupervised EDA |
| # | Column | Type | Description |
|---|---|---|---|
| 1 | Country | Categorical | Name of the country |
| 2 | Continent | Categorical | Continent the country belongs to |
| 3 | Production (Tons) | Numerical | Total sugarcane produced (metric tons) |
| 4 | Production per Person (Kg) | Numerical | Per-capita sugarcane production (kg) |
| 5 | Acreage (Hectare) | Numerical | Area under sugarcane cultivation (hectares) |
| 6 | Yield (Kg / Hectare) | Numerical | Productivity per unit area (kg per hectare) |
sugarcane-eda/
│
├── sugarcane_eda.ipynb # Main EDA notebook (portfolio version)
├── List of Countries by Sugarcane Production.csv # Raw dataset
└── README.md # This file
The raw dataset requires significant cleaning before analysis is possible:
- String formatting — Numerical columns use European formatting (
.as thousands separator,,as decimal). These are stripped and swapped to Python-parseable format. - Column renaming — Spaces and special characters replaced to allow clean pandas attribute access.
- Missing value handling — Null rows identified and removed; index reset.
- Dropping redundant columns —
indexandUnnamed: 0columns removed after reset. - Type conversion — All numerical columns cast from
object(string) tofloat.
| Section | Description |
|---|---|
| 6.1 | Continent-wise country count — bar chart of producing countries per continent |
| 6.2 | Distribution plots — histogram + KDE for all four numerical features |
| 6.3 | Boxplots — spread, IQR, and outlier detection per feature |
| 6.4 | Summary statistics — mean, std, quartiles, min/max |
-
Asia and South America account for the largest number of sugarcane-producing countries, reflecting the crop's preference for tropical and subtropical climates.
-
Production (Tons) and Acreage (Hectare) are both strongly right-skewed — a small number of countries produce the vast majority of the world's sugarcane (notably Brazil and India).
-
Yield (Kg/Hectare) shows less extreme skew than raw production, suggesting that some smaller-acreage countries achieve competitive yield efficiency through advanced agricultural practices.
-
Production per Person (Kg) varies widely — countries with small populations but large cultivation areas show very high per-capita values, even if total absolute production is modest.
-
The boxplots confirm that outliers dominate the production and acreage distributions. Log-transformation of these columns is recommended before any downstream modelling.
- Apply log-transformation to Production, Acreage, and Yield before further analysis
- Perform bivariate analysis: Acreage vs Production, Yield vs Continent
- Identify the top 10 producers and visualise them on a world map (geopandas / plotly)
- Investigate whether higher acreage always leads to proportionally higher production
- Add a bivariate heatmap to surface correlations between all numerical features
| Library | Purpose |
|---|---|
pandas |
Data loading, cleaning, and manipulation |
matplotlib |
Base plotting framework |
seaborn |
Statistical visualisations |
IPython.display |
Custom HTML/CSS rendering in notebook |
- Open the notebook on Kaggle.
- Fork the notebook and run all cells directly in the Kaggle environment — the dataset path is pre-configured.
# Clone the repository
git clone https://github.com/komalharshita/sugarcane-eda.git
cd sugarcane-eda
# Install dependencies
pip install pandas matplotlib seaborn notebook
# Launch Jupyter
jupyter notebook sugarcane_eda_portfolio.ipynbNote: Update the dataset path in the loading cell:
df = pd.read_csv('List of Countries by Sugarcane Production.csv')
- Sugarcane Dataset — Kaggle (komalharshita)
- Seaborn Documentation
- Matplotlib Documentation
- Food and Agriculture Organization of the United Nations (FAO) — FAOSTAT crop production data.
Komal Harshita Computer Science Student
If you found this project useful, consider giving it a ⭐ on GitHub and an upvote on Kaggle!