Scrape Data
- sos_list_scraper.py
- scraped strength of schedule data
- gamelog_scraper.py
- scrapes teams gamelog data
- player_scraper.py
- scrapes roster and playter per 100 posession data
Transform Data
- gamelog_stats_transform.py
- data_merger.py
- concats seasons
- merges roster and player_per100 data
- position_cluster.py
- creates position clusters
- creates team experiecnce factor
- matchup_creator.py
- merges gamelogs, clustering and experience dataframes
- slices up dataframe to create matchups
- saves final_model_data
Modelling
- model_optimization.py - skipping this step for now
- model_dumper.py
- model_tests.py
Predicting
- win_or_lose.py
- bracket_generator.py
Utils
- scraping_utils.py
- filters.py
TODO for next update
- update how Game Type dates are handled in
add_game_typeingamelog_scraper.py- use csv and more efficient logic
- repull historic data
- add season to file names on files that need archived
- player_per100_full_date.pkl
- player_stats_full.pkl
- roster_full_data.csv
- season-full_game_log_stats_data.pkl
- team_clusters.pkl
- team_experience.pkl
- exp_gamelog_clust
- add full update shell script
- archive past year files
- run all scripts in order
TODO
- Save final brackets from each tourney
- vectorize height data in data merger player_roster_merger func
- fill missing height data in rosters instead of dropping all wiht NaNs
- create archiving script to move all yearly data files to respective archive folders
- Create a testing framework to see which models are best
- for full bracket
- for each round
- ie. is tcf better in early rounds?
- Can I create some ensemble of these models for better performace?
Annual Update Prcess
- Test all scraping scripts early in case they changed formatting on website
- Add new season to
seasons_list.txt sos_list_scraper.py- creates sos_list{season}.csv to 0_scraped_data dir
gamelog_scraper.py- update
add_game_typefunc- add new year season/tourney start and end dates
- update if else section with new conditions
- saves
season_{season}_gamelog_data.pklto0_scraped_datadir
- update
player_scraper.py- saves
player_per100_{season}_data.pkl&roster_{season}_data.csvto0_scraped_datadir
- saves
gamelog_stats_transform.py- saves
season_{season}_gamelog_stats_data.pklagndseason_{season}_gamelog_final_stats_data.pklto1_transformed_datadir
- saves
data_merger.py- Manual: archive data to year specific folder
- saves files to
2_full_season_datadirplayer_per100_full-{season}_data.pklroster_full-{season}_data.csvseason_full-{season}_gamelog_stats_data.pklplayer_stats_full-{season}.pkl
position_cluster.py- saves files to
2_full_season_datadirteam_clusters-{season}.pklteam_experience-{season}.pkl
- saves files to
matchup_creator.py- Manual: archive data to year specific folder
- creates:
gamelog_exp_clust-{season}.pkl&season{season}_final_stats.pklin3_model_datadir
model_dumper.py- Manual: archive fit models to year specific folder
- trains logistic regression, random forest and gradient boosting models for testing and prediction
- testing models are trained on all data up to the current season and tested on the current season to assess model hyperparameter performance
- prtion models are trained on the current season with optimal hyperparameters for use in bracket creation
- saves models in
fit_modelsdir
model_test.py- Not updated- tests models in
fit_modelsdir and prints results
- tests models in
winner_predictor.py- Not updated- can run to manually predict outcome of a matchup
bracket_generator.py- Manual: archive past year's brackets
- create new initial bracket for current season's tournament
- Note: added funtionality to pick winner of games based on probability of teams chances of success.
bracket_scorer.py- Manual: add actual bracket
- https://www.ncaa.com/brackets/basketball-men/d1/2021
Kyles $50:
- virginia Loop $75:
- virginia
- gonzaga
- connecticut Nates $40:
- michigan - splendid
- connecticut - swell
- virginia - splendid
- gonzaga - marvelous