💚 Global Sugarcane Production — EDA

💚 Global Sugarcane Production — EDA

Exploring production patterns, acreage, and yield across countries and continents

Overview

Sugarcane is one of the world's most economically significant crops, accounting for approximately 80% of global sugar production and serving as a primary feedstock for sugar, biofuel, molasses, and ethanol. It is cultivated across tropical and subtropical regions spanning multiple continents.

This project performs a focused Exploratory Data Analysis (EDA) on a dataset of global sugarcane production statistics — examining how production volume, cultivated acreage, and yield efficiency vary across countries and continents.

Dataset

Attribute	Details
Source	Sugarcane Dataset — Kaggle (komalharshita)
File	`List of Countries by Sugarcane Production.csv`
Rows	One row per country
Target	No target — this is an unsupervised EDA

Feature Summary

#	Column	Type	Description
1	Country	Categorical	Name of the country
2	Continent	Categorical	Continent the country belongs to
3	Production (Tons)	Numerical	Total sugarcane produced (metric tons)
4	Production per Person (Kg)	Numerical	Per-capita sugarcane production (kg)
5	Acreage (Hectare)	Numerical	Area under sugarcane cultivation (hectares)
6	Yield (Kg / Hectare)	Numerical	Productivity per unit area (kg per hectare)

Project Structure

sugarcane-eda/
│
├── sugarcane_eda.ipynb                             # Main EDA notebook (portfolio version)
├── List of Countries by Sugarcane Production.csv   # Raw dataset
└── README.md                                       # This file

Data Cleaning Steps

The raw dataset requires significant cleaning before analysis is possible:

String formatting — Numerical columns use European formatting (. as thousands separator, , as decimal). These are stripped and swapped to Python-parseable format.
Column renaming — Spaces and special characters replaced to allow clean pandas attribute access.
Missing value handling — Null rows identified and removed; index reset.
Dropping redundant columns — index and Unnamed: 0 columns removed after reset.
Type conversion — All numerical columns cast from object (string) to float.

Analysis Sections

Section	Description
6.1	Continent-wise country count — bar chart of producing countries per continent
6.2	Distribution plots — histogram + KDE for all four numerical features
6.3	Boxplots — spread, IQR, and outlier detection per feature
6.4	Summary statistics — mean, std, quartiles, min/max

Key Findings

Asia and South America account for the largest number of sugarcane-producing countries, reflecting the crop's preference for tropical and subtropical climates.
Production (Tons) and Acreage (Hectare) are both strongly right-skewed — a small number of countries produce the vast majority of the world's sugarcane (notably Brazil and India).
Yield (Kg/Hectare) shows less extreme skew than raw production, suggesting that some smaller-acreage countries achieve competitive yield efficiency through advanced agricultural practices.
Production per Person (Kg) varies widely — countries with small populations but large cultivation areas show very high per-capita values, even if total absolute production is modest.
The boxplots confirm that outliers dominate the production and acreage distributions. Log-transformation of these columns is recommended before any downstream modelling.

Recommended Next Steps

Apply log-transformation to Production, Acreage, and Yield before further analysis
Perform bivariate analysis: Acreage vs Production, Yield vs Continent
Identify the top 10 producers and visualise them on a world map (geopandas / plotly)
Investigate whether higher acreage always leads to proportionally higher production
Add a bivariate heatmap to surface correlations between all numerical features

Tech Stack

Library	Purpose
`pandas`	Data loading, cleaning, and manipulation
`matplotlib`	Base plotting framework
`seaborn`	Statistical visualisations
`IPython.display`	Custom HTML/CSS rendering in notebook

How to Run

Option 1 — Kaggle (recommended)

Open the notebook on Kaggle.
Fork the notebook and run all cells directly in the Kaggle environment — the dataset path is pre-configured.

Option 2 — Local

# Clone the repository
git clone https://github.com/komalharshita/sugarcane-eda.git
cd sugarcane-eda

# Install dependencies
pip install pandas matplotlib seaborn notebook

# Launch Jupyter
jupyter notebook sugarcane_eda_portfolio.ipynb

Note: Update the dataset path in the loading cell:
df = pd.read_csv('List of Countries by Sugarcane Production.csv')

References

Sugarcane Dataset — Kaggle (komalharshita)
Seaborn Documentation
Matplotlib Documentation
Food and Agriculture Organization of the United Nations (FAO) — FAOSTAT crop production data.

Author

Komal Harshita Computer Science Student

If you found this project useful, consider giving it a ⭐ on GitHub and an upvote on Kaggle!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
eda-sugarcane.ipynb		eda-sugarcane.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💚 Global Sugarcane Production — EDA

Overview

Table of Contents

Dataset

Feature Summary

Project Structure

Data Cleaning Steps

Analysis Sections

Key Findings

Recommended Next Steps

Tech Stack

How to Run

Option 1 — Kaggle (recommended)

Option 2 — Local

References

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💚 Global Sugarcane Production — EDA

Overview

Table of Contents

Dataset

Feature Summary

Project Structure

Data Cleaning Steps

Analysis Sections

Key Findings

Recommended Next Steps

Tech Stack

How to Run

Option 1 — Kaggle (recommended)

Option 2 — Local

References

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages