Skip to content

komalharshita/EDA-Sugarcane-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Sugarcane Fields

💚 Global Sugarcane Production — EDA

Exploring production patterns, acreage, and yield across countries and continents


Kaggle GitHub LinkedIn Python Jupyter


Overview

Sugarcane is one of the world's most economically significant crops, accounting for approximately 80% of global sugar production and serving as a primary feedstock for sugar, biofuel, molasses, and ethanol. It is cultivated across tropical and subtropical regions spanning multiple continents.

This project performs a focused Exploratory Data Analysis (EDA) on a dataset of global sugarcane production statistics — examining how production volume, cultivated acreage, and yield efficiency vary across countries and continents.


Table of Contents

  1. Dataset
  2. Project Structure
  3. Data Cleaning Steps
  4. Analysis Sections
  5. Key Findings
  6. Tech Stack
  7. How to Run
  8. Author

Dataset

Attribute Details
Source Sugarcane Dataset — Kaggle (komalharshita)
File List of Countries by Sugarcane Production.csv
Rows One row per country
Target No target — this is an unsupervised EDA

Feature Summary

# Column Type Description
1 Country Categorical Name of the country
2 Continent Categorical Continent the country belongs to
3 Production (Tons) Numerical Total sugarcane produced (metric tons)
4 Production per Person (Kg) Numerical Per-capita sugarcane production (kg)
5 Acreage (Hectare) Numerical Area under sugarcane cultivation (hectares)
6 Yield (Kg / Hectare) Numerical Productivity per unit area (kg per hectare)

Project Structure

sugarcane-eda/
│
├── sugarcane_eda.ipynb                             # Main EDA notebook (portfolio version)
├── List of Countries by Sugarcane Production.csv   # Raw dataset
└── README.md                                       # This file

Data Cleaning Steps

The raw dataset requires significant cleaning before analysis is possible:

  1. String formatting — Numerical columns use European formatting (. as thousands separator, , as decimal). These are stripped and swapped to Python-parseable format.
  2. Column renaming — Spaces and special characters replaced to allow clean pandas attribute access.
  3. Missing value handling — Null rows identified and removed; index reset.
  4. Dropping redundant columnsindex and Unnamed: 0 columns removed after reset.
  5. Type conversion — All numerical columns cast from object (string) to float.

Analysis Sections

Section Description
6.1 Continent-wise country count — bar chart of producing countries per continent
6.2 Distribution plots — histogram + KDE for all four numerical features
6.3 Boxplots — spread, IQR, and outlier detection per feature
6.4 Summary statistics — mean, std, quartiles, min/max

Key Findings

  1. Asia and South America account for the largest number of sugarcane-producing countries, reflecting the crop's preference for tropical and subtropical climates.

  2. Production (Tons) and Acreage (Hectare) are both strongly right-skewed — a small number of countries produce the vast majority of the world's sugarcane (notably Brazil and India).

  3. Yield (Kg/Hectare) shows less extreme skew than raw production, suggesting that some smaller-acreage countries achieve competitive yield efficiency through advanced agricultural practices.

  4. Production per Person (Kg) varies widely — countries with small populations but large cultivation areas show very high per-capita values, even if total absolute production is modest.

  5. The boxplots confirm that outliers dominate the production and acreage distributions. Log-transformation of these columns is recommended before any downstream modelling.


Recommended Next Steps

  • Apply log-transformation to Production, Acreage, and Yield before further analysis
  • Perform bivariate analysis: Acreage vs Production, Yield vs Continent
  • Identify the top 10 producers and visualise them on a world map (geopandas / plotly)
  • Investigate whether higher acreage always leads to proportionally higher production
  • Add a bivariate heatmap to surface correlations between all numerical features

Tech Stack

Library Purpose
pandas Data loading, cleaning, and manipulation
matplotlib Base plotting framework
seaborn Statistical visualisations
IPython.display Custom HTML/CSS rendering in notebook

How to Run

Option 1 — Kaggle (recommended)

  1. Open the notebook on Kaggle.
  2. Fork the notebook and run all cells directly in the Kaggle environment — the dataset path is pre-configured.

Option 2 — Local

# Clone the repository
git clone https://github.com/komalharshita/sugarcane-eda.git
cd sugarcane-eda

# Install dependencies
pip install pandas matplotlib seaborn notebook

# Launch Jupyter
jupyter notebook sugarcane_eda_portfolio.ipynb

Note: Update the dataset path in the loading cell:

df = pd.read_csv('List of Countries by Sugarcane Production.csv')

References


Author

Komal Harshita Computer Science Student


If you found this project useful, consider giving it a ⭐ on GitHub and an upvote on Kaggle!

About

This project performs a focused Exploratory Data Analysis (EDA) on a dataset of global sugarcane production statistics — examining how production volume, cultivated acreage, and yield efficiency vary across countries and continents.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors