Supply Chain Emissions Modeling Using Industry and Commodity Data (2010–2016) Problem Statement:
You have annual supply chain emission data from 2010–2016 categorized into industries and commodities. The goal is to develop a regression model that can predict the Supply Chain Emission Factors with Margins based on descriptive and quality metrics (substance, unit, reliability, temporal/geographical/technological/data collection correlations, etc.).
To analyze and predict greenhouse gas (GHG) emissions from various U.S. industries and commodities using official supply chain data sourced from data.gov.
The aim is to build robust regression models that can predict GHG Emissions (kg CO₂ equivalent) based on industry, commodity, and quality-related descriptive features.
- Dataset Name: Supply Chain Greenhouse Gas Emission Factors
- Source: data.gov
- Years Covered: 2010 – 2016
The dataset contains annual supply chain GHG emission factors across various U.S. industries and commodities.
- Multiple Excel sheets loaded from 2010–2016.
- Combined Industry and Commodity data into one dataset.
- Handle missing values (drop or impute).
- Convert data types to appropriate formats.
- Standardize column names.
- Remove duplicate rows.
- Encode categorical features (e.g., Industry_Name, Commodity, Substance, Source).
- Normalize/scale numeric columns for model compatibility.
- Analyzed distributions of GHG emissions.
- Investigated relationships between descriptive metrics and emission factors.
- Detected outliers and patterns across industries and commodities.
GHG_Emissions_kgCO2e
- Industry & Commodity Codes
- Reliability and Correlation scores
- Year, Source, Substance, Units (encoded)
- Linear Regression (Baseline)
- Random Forest Regression
- (Optional: Other regressors like XGBoost, SVR for comparison)
| Metric | Description |
|---|---|
| RMSE | Root Mean Squared Error |
| MAE | Mean Absolute Error |
| R² Score | Coefficient of Determination |
- Language: Python 3.x
- Libraries:
- Pandas
- NumPy
- Scikit-learn
- Matplotlib
- Seaborn
1️⃣ Step 1: Import Required Libraries
2️⃣ Step 2: Load Dataset
3️⃣ Step 3: Data Preprocessing (Cleaning + EDA + Encoding)
4️⃣ Step 4: Split Dataset (Train-Test Split)
5️⃣ Step 5: Model Training
6️⃣ Step 6: Prediction and Evaluation
7️⃣ Step 7: Hyperparameter Tuning
8️⃣ Step 8: Comparative Study and Best Model Selection
Satyam Mishra