This project builds a reproducible forecasting workflow for WMATA daily ridership. It prepares raw WMATA ridership exports, creates forecast-safe modeling datasets, compares several forecasting approaches, selects production models, generates 30-day Bus and Rail forecasts, and writes validation tables, diagnostics, and presentation-ready graphics.
The main goal is not only to predict ridership, but to make the prediction process understandable and reviewable. The workflow is designed so a WMATA reviewer can trace how raw data becomes model inputs, how models are compared, why selected models are used, and what limitations remain.
- Raw WMATA ridership exports are placed in
data/raw/. - The pipeline standardizes those exports into bronze data.
- Bronze data is aggregated into daily Bus, Rail station, Rail unassigned, and system total tables.
- Forecast-safe features are added, such as calendar fields, lags, rolling means, and same-weekday history.
- Models are trained and evaluated with chronological validation.
- The selected models produce future 30-day forecasts.
- Tables, figures, diagnostics, and slideshow-ready charts are written to output folders.
The full pipeline is orchestrated by _targets.R, which defines each step as a reproducible target.
- Use only information that would be available at prediction time.
- Validate models through time, not with random train/test splits.
- Compare advanced models against simple baselines before selecting them.
- Keep Bus and Rail modeling separate where their operating patterns differ.
- Model Rail at the station level where possible, then aggregate to systemwide forecasts.
- Preserve unusual historical days, but allow known disruptions to be excluded from model training.
- Generate outputs that are useful for both technical review and operational communication.
Raw data comes from WMATA Daily Ridership Portal exports and is expected under data/raw/.
The project expects two source exports:
- A full detail export.
- A daily summary totals export.
These files are read as UTF-16 tab-separated data and cleaned into consistent column names and date formats.
Bronze data is the cleaned version of the raw exports. It keeps the source-level structure while converting the exports into UTF-8 CSVs.
Key outputs:
data/processed/bronze/wmata_ridership_full_utf8.csvdata/processed/bronze/wmata_ridership_totals_utf8.csv
Silver data creates analysis-ready daily ridership tables:
- Bus daily ridership.
- Rail station daily ridership.
- Rail unassigned daily ridership.
- Mode-level daily totals.
Key outputs:
data/processed/silver/ridership_bus_daily.csvdata/processed/silver/ridership_rail_station_daily.csvdata/processed/silver/ridership_rail_unassigned_daily.csvdata/processed/silver/ridership_totals_daily.csv
Gold data is the model-ready layer. It adds forecast-safe features and separates Rail stations into main modeling and fallback groups.
Key outputs:
data/processed/gold/bus_model_frame.csvdata/processed/gold/rail_station_model_frame.csvdata/processed/gold/rail_station_fallback_frame.csvdata/processed/gold/rail_unassigned_frame.csvdata/processed/gold/rail_system_daily.csv
The model features are intentionally limited to information that can be known before the forecast date.
Calendar and service features:
- Year.
- Month.
- ISO week of year.
- Day of week.
- Weekend flag.
- Holiday flag.
- Service type.
- Weekday/Saturday/Sunday grouping.
Historical ridership features:
- Prior-day ridership.
- 7-, 14-, 21-, and 28-day lags.
- Shifted rolling means over 7, 14, and 28 days.
- Average of prior same-weekday values.
Rail-specific features:
- Station name, for models that can use station identity.
- Station age in days.
- New-station flag.
The rolling features are shifted by one day so the model does not use the target day itself. This is a key leakage-prevention decision.
Bus is modeled as one systemwide daily ridership series.
This design is appropriate because the project is forecasting total daily MetroBus ridership rather than individual routes or stops. The Bus model uses the shared forecast-safe calendar and ridership-history features.
The Bus pipeline:
- Builds one daily modeling frame.
- Compares candidate models through rolling validation.
- Selects a final model using accuracy and stability criteria.
- Produces a 30-day forecast with empirical prediction intervals.
Rail is modeled station-first, then aggregated to systemwide Rail demand.
This is one of the most important design choices in the project. Rail ridership patterns differ by station, and station-level modeling preserves those differences better than a single aggregate-only model.
Rail stations are split into two groups:
- Main cohort: stations with at least 90% overall coverage and at least two years of pre-holdout history.
- Fallback cohort: newer or lower-coverage stations that should still be forecast, but should not drive the main station model.
Unassigned Rail rows are handled separately. They are excluded from station-level model training, tracked in QA outputs, forecast separately, and added back into systemwide Rail totals.
The final Rail system forecast is assembled from:
- Main station forecasts.
- Fallback station forecasts.
- Unassigned Rail forecasts.
The project does not assume that a complex model is automatically better. It compares a ladder of models:
- Annual seasonal naive benchmark.
- 7-day lag benchmark.
- Linear regression.
- GLMNET regularized regression.
- XGBoost.
The simple benchmarks are important because ridership has strong weekly and seasonal structure. A more complex model must beat these baselines to justify its use.
XGBoost is a gradient-boosted decision tree model. It builds many small regression trees in sequence, where each new tree tries to correct errors from the previous trees.
In this project, XGBoost predicts transformed ridership, using log1p(ridership), from forecast-safe features such as recent lags, rolling averages, holidays, service type, and station metadata for Rail.
XGBoost is useful here because ridership patterns can be nonlinear. For example, weekend behavior, holiday behavior, and recent ridership levels may interact in ways that a simple linear model may not capture.
The XGBoost tuning grid tests combinations of:
- Number of trees.
- Tree depth.
- Learning rate.
- Minimum node size.
- Loss reduction.
- Sample size.
The production model is selected only after comparison against the full model ladder.
The project uses chronological validation, not random splitting.
Rolling validation tests monthly forecast origins across a calendar year. For each origin, the model trains on the past and forecasts the next 30 days. Errors are then summarized across all validation windows.
The main validation metric is MAE, or mean absolute error. MAE is the average absolute difference between predicted ridership and actual ridership. It is measured in riders, making it easy to interpret operationally.
The project also reports:
- RMSE.
- Bias.
- MAPE.
- SMAPE.
- R-squared.
- Horizon-specific performance.
Holdout evaluation is reserved for a later chronological period so that model performance can be checked on data not used for model selection.
Model selection balances accuracy, stability, and interpretability.
The selection logic:
- Computes validation metrics for each candidate model.
- Requires candidate models to beat simple baselines.
- Prefers simpler models when accuracy is effectively tied.
- Allows XGBoost only when it clears accuracy and stability checks.
- Falls back to the 7-day lag baseline if no candidate model improves on the benchmarks.
This keeps model choice evidence-based instead of assuming the most complex model should win.
Future forecasts are generated recursively. That means when forecasting multiple days ahead, earlier forecasted values become part of the history used to forecast later days.
The recursive forecast process:
- Starts from the latest available actual ridership history.
- Builds future calendar rows.
- Recomputes lag and rolling features day by day.
- Predicts each future date.
- Caps extreme predictions using recent history and fallback estimates.
Prediction intervals are empirical. They are based on historical backtest residuals rather than a theoretical distribution. This makes the intervals grounded in observed model behavior.
Fallback forecasts are used for lower-coverage Rail stations, unassigned Rail rows, and model failure cases.
The fallback hierarchy uses:
- 7-day lag when available.
- Same-weekday recent history.
- Recent rolling averages.
- Broader historical averages if needed.
Annual seasonal naive remains a benchmark, but the operational fallback defaults to more recent same-weekday behavior.
Known disruptions can be excluded from training through:
data/training_exclusions.csv
This file allows known closures, weather disruptions, shutdowns, or one-off anomalies to remain in the historical record while being excluded from model fitting.
Exclusions can apply to:
- All modes.
- Bus only.
- Rail only.
- Specific Rail stations.
This is a practical design decision: the model should not be forced to learn from days that WMATA already knows were abnormal and not representative of ordinary demand.
The discovery layer produces diagnostic insights beyond the core forecast.
It analyzes:
- Weekday ridership structure.
- Winter versus non-winter error.
- Bus/Rail divergence patterns.
- Concentration of forecast error.
- Station sensitivity to holidays, weekends, and Tuesday-Thursday commuting patterns.
- Future high-demand and high-uncertainty trigger days.
These diagnostics help explain where the model works well, where it struggles, and where WMATA may want additional monitoring or context.
These are the main future forecast outputs.
Important files:
outputs/tables/bus_forecast_output.csvoutputs/tables/rail_system_forecast_output.csvoutputs/tables/rail_station_forecast_output.csvoutputs/tables/future_forecast_summary.txt
These summarize how candidate models performed.
Important files:
outputs/tables/overall_model_comparison.csvoutputs/tables/rolling_validation_summary.csvoutputs/tables/holdout_performance.csvoutputs/tables/horizon_bucket_performance.csv
These explain Rail station handling and unassigned ridership.
Important files:
outputs/tables/station_completeness_fallback.csvoutputs/tables/rail_unassigned_qa.csv
These support operational insight and diagnostic review.
Important files:
outputs/tables/discovery_insight_cards.mdoutputs/tables/discovery_trigger_counts.csvoutputs/tables/discovery_trigger_days.csvoutputs/tables/discovery_station_sensitivity.csvoutputs/tables/discovery_station_error_pareto.csvoutputs/tables/discovery_winter_error.csv
These are the broader pipeline-generated graphics under:
outputs/figures/
They cover forecast trends, holdout fit, model comparisons, prediction intervals, horizon decay, station risk, winter behavior, and other diagnostics.
These are curated presentation graphics under:
slideshowGraphs/
The manifest is:
slideshowGraphs/graph_manifest.csv
These charts are designed for stakeholder communication rather than full technical inspection.
These store fitted model objects and pipeline results for later inspection.
Important files:
outputs/diagnostics/wmata_model_artifacts.qsoutputs/diagnostics/discovery_artifacts.qs
targets: Runs the project as a reproducible pipeline.tarchetypes: Adds helper patterns for the targets workflow.renv: Locks package versions so the project can be recreated more reliably.tidyverse: Provides the core data cleaning, transformation, and plotting tools.tidymodels: Provides a consistent framework for recipes, workflows, model fitting, and prediction.xgboost: Provides the gradient-boosted tree model.glmnet: Provides regularized regression.futureandfurrr: Run model tuning and backtests in parallel.yardstick: Computes model performance metrics.slider: Builds rolling historical features.ggplot2,patchwork,ggrepel, andggtext: Build report and presentation graphics.qs2: Saves diagnostic artifacts efficiently.hereandfs: Keep file paths and directory creation consistent.
- Defines the full pipeline from raw data through outputs.
- Sources all major R modules.
- Connects data preparation, modeling, discovery diagnostics, tables, figures, and artifacts.
- Creates required project directories.
- Restores packages from
renv.lockwhen available. - Installs any missing runtime packages.
- Contains shared utilities used across the pipeline.
- Defines project packages, parallel settings, output directories, plotting theme, holiday calendar logic, forecast calendar creation, and prediction interval helpers.
- Locates and reads WMATA raw exports.
- Cleans raw export fields into bronze data.
- Aggregates silver Bus, Rail station, Rail unassigned, and total daily tables.
- Reads and applies training exclusion flags.
- Adds forecast-safe model features.
- Builds station completeness flags.
- Splits Rail into main station cohort, fallback station cohort, and unassigned Rail frame.
- Writes model-ready gold datasets.
- Defines model input columns.
- Builds modeling recipes.
- Defines linear regression, GLMNET, and XGBoost specifications.
- Defines tuning grids for GLMNET and XGBoost.
- Builds tuning and rolling-validation schedules.
- Runs the model ladder.
- Selects final models.
- Produces holdout predictions and future forecasts.
- Handles Bus, Rail main station, Rail fallback, and Rail unassigned pipelines.
- Implements recursive forecasting.
- Recomputes lag and rolling features during future prediction.
- Provides model-based forecasts and baseline forecasts.
- Builds future calendar panels.
- Computes model performance metrics.
- Scores model stability.
- Selects the final model using accuracy, stability, and interpretability rules.
- Adds empirical prediction intervals.
- Builds the main tables and figures under
outputs/. - Creates model comparison, forecast, interval, horizon, residual, station-risk, recovery, and monitoring graphics.
- Saves model diagnostic artifacts.
- Builds additional diagnostic and operational insight tables.
- Analyzes weekday patterns, winter effects, mode divergence, error concentration, station sensitivity, and forecast triggers.
- Writes discovery figures and artifacts.
- Reads forecast tables.
- Creates a concise 30-day text summary for Bus, Rail, and combined ridership.
- Writes
outputs/tables/future_forecast_summary.txt.
- Reads pipeline outputs and diagnostic artifacts.
- Produces curated, slideshow-ready graphs.
- Writes
slideshowGraphs/graph_manifest.csv.
- Provides quick-start instructions.
- Describes project structure, input data expectations, commands, modeling scope, and leakage-prevention choices.
- Records package versions used by the project.
- Supports reproducible setup through
renv::restore().
- User-maintained file for excluding known abnormal dates from training.
- Supports mode-wide and station-specific exclusions.
The project uses time-based validation because ridership forecasting is a time-dependent problem. Random splits would allow future patterns to influence training and would overstate performance.
The feature set avoids weather, gas prices, economic indicators, and unknown future disruptions. These may be useful in theory, but they are either unavailable at forecast time, hard to forecast reliably, or risk making historical performance look better than real future performance.
The statistical models predict log1p(ridership). This helps stabilize large ridership differences and keeps predictions nonnegative after transforming back to rider counts.
Prediction intervals are based on observed backtest residuals. This keeps uncertainty estimates tied to actual model errors.
Complete, mature stations are modeled directly. Newer or incomplete stations are forecast with fallback logic. This avoids forcing sparse stations into a model that assumes consistent long-term history.
The project writes both technical outputs and stakeholder-facing charts. This reduces the gap between model development and communication.
The model does not automatically know about future disruptions, emergency closures, special events, weather shocks, major service changes, or economic changes unless those factors are explicitly represented in future-available data.
Excluded or intentionally limited inputs include:
- Weather.
- Gas prices.
- Economic indicators.
- Unknown future disruptions.
- Future actual ridership.
- Rolling features that include the forecast date itself.
The model should therefore be used as a planning forecast, not as an automatic operating decision system. Human review remains important for abnormal days, event-driven demand, major service changes, and future conditions not represented in the feature set.
- Reproducible end-to-end pipeline.
- Clear data layers from raw exports to model-ready tables.
- Forecast-safe feature design.
- Time-based validation aligned with real forecasting.
- Explicit comparison against simple baselines.
- Separate Bus and Rail modeling choices.
- Station-level Rail forecasts with fallback handling.
- Empirical uncertainty bands.
- Diagnostic outputs for model trust and operational review.
- Presentation-ready graphics for stakeholder communication.
Start with these files:
README.mdfor setup, project scope, and commands._targets.Rfor the full pipeline order.data_prep.Randfeature_engineering.Rfor how raw data becomes model inputs.model_spec.R,model_fit.R,forecasting.R, andevaluation.Rfor modeling decisions.graph_pipeline.Randdiscovery_layer.Rfor outputs and diagnostics.outputs/tables/for model results and forecast tables.slideshowGraphs/for curated stakeholder visuals.