A machine learning solution for the Kaggle Forest Cover Type Prediction competition, achieving competitive scores through advanced feature engineering and ensemble methods.
The goal is to predict the forest cover type (7 categories) for 30m x 30m cells in Roosevelt National Forest, Colorado, using only cartographic variables—no remotely sensed data.
Cover Types:
- Spruce/Fir
- Lodgepole Pine
- Ponderosa Pine
- Cottonwood/Willow
- Aspen
- Douglas-fir
- Krummholz
| Feature Category | Description |
|---|---|
| Distance Metrics | Euclidean & Manhattan distances to hydrology |
| Distance Ratios | Hydrology/Road, Hydrology/Fire, Road/Fire ratios |
| Elevation Interactions | Combined elevation with vertical hydrology distance |
| Hillshade Statistics | Mean and range across time points (9am, noon, 3pm) |
| Aspect Decomposition | North-South and East-West components (cos/sin transform) |
| Categorical IDs | Wilderness Area ID, Soil Type ID from one-hot columns |
| Domain Features | Climate zone, stony soil indicator |
Cross-validation target encoding with smoothing to prevent data leakage:
- Encoding computed only on training folds
- Test set receives averaged encodings across all folds
- Smoothing blends category mean with global mean for rare categories
Three gradient boosting models with optimized hyperparameters:
| Model | Key Strengths |
|---|---|
| LightGBM | Fast training, efficient memory usage |
| XGBoost | Robust regularization, histogram-based |
| CatBoost | Excellent categorical handling, symmetric trees |
- Stacking: Logistic Regression meta-learner trained on out-of-fold predictions
- Blending: Weighted average (LightGBM: 45%, CatBoost: 35%, XGBoost: 20%)
├── forest_cover_type_ensemble.ipynb # Main notebook (run this)
├── stacking.py # Original Python script
├── train.csv # Training data
├── test-full.csv # Test data
├── full_submission.csv # Sample submission format
├── submission_stacking_*.csv # Stacking predictions
├── submission_blending_*.csv # Blending predictions
└── README.md
numpy
pandas
scikit-learn
lightgbm
xgboost
catboost
tqdm
pip install numpy pandas scikit-learn lightgbm xgboost catboost tqdmjupyter notebook forest_cover_type_ensemble.ipynbRun cells sequentially. The notebook includes detailed explanations for each step.
python stacking.pyKey parameters in the notebook/script:
N_FOLDS = 7 # Cross-validation folds
SEED = 42 # Random seed for reproducibilityModel hyperparameters are pre-tuned for optimal performance. Training takes approximately 15-30 minutes depending on hardware.
The solution generates two submission files:
| Method | Description |
|---|---|
submission_stacking_*.csv |
Meta-learner ensemble |
submission_blending_*.csv |
Weighted average ensemble |
Submit both to Kaggle to compare leaderboard performance.
- Hyperparameter optimization with Optuna/GridSearch
- Additional models (ExtraTrees, Neural Networks)
- Feature selection using permutation importance
- Pseudo-labeling with high-confidence test predictions
- More aggressive feature interactions
This project is for educational purposes as part of HEC coursework.