PREPARE_2_social

code used for the PREPARE Challenge - Phase 2: Model Arena (Social Determinants Track) hosted by the National Institute on Aging via Driven Data

Files

ViewRawData.R: code written in R using ggplot and gridExtra packages to view relationships between individual features and the target variable (composite cognitive score)
HandingTemporalFeaturesandTargets.py: Community code contribution, details different ways to handle survey year and score year to better represent the temporal relationships in training data.
scikit_models.py: Code written in python using scikit-learn and supporting packages (e.g. pandas, numpy) to create and train models used in the main file.
train_scikit.py: Main training file for scikit-learn (statistical machine learning) models. Cleans raw data, creates train/val splits, and calls training functions from scikit_models.py. Aggregates results, chooses the result with the best rmse and applies it to the test set, then formats the resulting target predictions for submission.
manipulateSavedModels.py: File used to extract the features used for each best model and the weights assigned to those features to assess feature importance for interpretation.

Folders

/two_scores: model and train files designed around data transformed to predict both 2016 and 2021 scores for every uid.
/year_and_score: model and train files designed around target file including a column of years and a column of scores
/score_only: model and train files designed around score prediction only (no year)
/two_model: train all models on only one target year, pull year-specific predictions from corresponding model based on year in submission file

Results

Best results obtained:

two-scores
- test rmse: 39.215 || model: xgboost + pls + kernelRidge || features: top 4 features of the decomposed space | all features
year-and-score
- test rmse: || model: || features:
score-only
- test rmse: 40.966 || model: Lasso || features: feature set reduced by removing features with >80% nan, <4% absolute correlation with target
two_model:
- test rmse : 40.8529 || model: 21-linear SVR | 16-Lasso || features: feature set reduced by removing features with >80% nan, <4% absolute correlation with target

Code info

Code was developed using

Windows 10
Python 3.8 (Anaconda distribution, Spyder 5.5.1)
- NumPy 1.24.3
- pandas 2.03
- scikit-learn 1.3.0
- lightgbm 4.5.0
- py-xgboost 2.1.1
R 3.6.2 (Rstudio)
- ggplot 3.2.1
- gridExtra 2.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PREPARE_2_social

Files

Folders

Results

Code info

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
score_only		score_only
two_model		two_model
two_scores		two_scores
HandingTemporalFeaturesandTargets.py		HandingTemporalFeaturesandTargets.py
LICENSE		LICENSE
README.md		README.md
ViewRawData.R		ViewRawData.R
manipulateSavedModels.py		manipulateSavedModels.py

Folders and files

Latest commit

History

Repository files navigation

PREPARE_2_social

Files

Folders

Results

Code info

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages