Skip to content

Latest commit

 

History

History
523 lines (339 loc) · 19 KB

File metadata and controls

523 lines (339 loc) · 19 KB

WMATA Ridership Forecasting Project Outline

Purpose

This project builds a reproducible forecasting workflow for WMATA daily ridership. It prepares raw WMATA ridership exports, creates forecast-safe modeling datasets, compares several forecasting approaches, selects production models, generates 30-day Bus and Rail forecasts, and writes validation tables, diagnostics, and presentation-ready graphics.

The main goal is not only to predict ridership, but to make the prediction process understandable and reviewable. The workflow is designed so a WMATA reviewer can trace how raw data becomes model inputs, how models are compared, why selected models are used, and what limitations remain.

High-Level Workflow

  1. Raw WMATA ridership exports are placed in data/raw/.
  2. The pipeline standardizes those exports into bronze data.
  3. Bronze data is aggregated into daily Bus, Rail station, Rail unassigned, and system total tables.
  4. Forecast-safe features are added, such as calendar fields, lags, rolling means, and same-weekday history.
  5. Models are trained and evaluated with chronological validation.
  6. The selected models produce future 30-day forecasts.
  7. Tables, figures, diagnostics, and slideshow-ready charts are written to output folders.

The full pipeline is orchestrated by _targets.R, which defines each step as a reproducible target.

Design Principles

  • Use only information that would be available at prediction time.
  • Validate models through time, not with random train/test splits.
  • Compare advanced models against simple baselines before selecting them.
  • Keep Bus and Rail modeling separate where their operating patterns differ.
  • Model Rail at the station level where possible, then aggregate to systemwide forecasts.
  • Preserve unusual historical days, but allow known disruptions to be excluded from model training.
  • Generate outputs that are useful for both technical review and operational communication.

Data Flow

Raw Data

Raw data comes from WMATA Daily Ridership Portal exports and is expected under data/raw/.

The project expects two source exports:

  • A full detail export.
  • A daily summary totals export.

These files are read as UTF-16 tab-separated data and cleaned into consistent column names and date formats.

Bronze Data

Bronze data is the cleaned version of the raw exports. It keeps the source-level structure while converting the exports into UTF-8 CSVs.

Key outputs:

  • data/processed/bronze/wmata_ridership_full_utf8.csv
  • data/processed/bronze/wmata_ridership_totals_utf8.csv

Silver Data

Silver data creates analysis-ready daily ridership tables:

  • Bus daily ridership.
  • Rail station daily ridership.
  • Rail unassigned daily ridership.
  • Mode-level daily totals.

Key outputs:

  • data/processed/silver/ridership_bus_daily.csv
  • data/processed/silver/ridership_rail_station_daily.csv
  • data/processed/silver/ridership_rail_unassigned_daily.csv
  • data/processed/silver/ridership_totals_daily.csv

Gold Data

Gold data is the model-ready layer. It adds forecast-safe features and separates Rail stations into main modeling and fallback groups.

Key outputs:

  • data/processed/gold/bus_model_frame.csv
  • data/processed/gold/rail_station_model_frame.csv
  • data/processed/gold/rail_station_fallback_frame.csv
  • data/processed/gold/rail_unassigned_frame.csv
  • data/processed/gold/rail_system_daily.csv

Forecast-Safe Features

The model features are intentionally limited to information that can be known before the forecast date.

Calendar and service features:

  • Year.
  • Month.
  • ISO week of year.
  • Day of week.
  • Weekend flag.
  • Holiday flag.
  • Service type.
  • Weekday/Saturday/Sunday grouping.

Historical ridership features:

  • Prior-day ridership.
  • 7-, 14-, 21-, and 28-day lags.
  • Shifted rolling means over 7, 14, and 28 days.
  • Average of prior same-weekday values.

Rail-specific features:

  • Station name, for models that can use station identity.
  • Station age in days.
  • New-station flag.

The rolling features are shifted by one day so the model does not use the target day itself. This is a key leakage-prevention decision.

Bus Modeling Design

Bus is modeled as one systemwide daily ridership series.

This design is appropriate because the project is forecasting total daily MetroBus ridership rather than individual routes or stops. The Bus model uses the shared forecast-safe calendar and ridership-history features.

The Bus pipeline:

  • Builds one daily modeling frame.
  • Compares candidate models through rolling validation.
  • Selects a final model using accuracy and stability criteria.
  • Produces a 30-day forecast with empirical prediction intervals.

Rail Modeling Design

Rail is modeled station-first, then aggregated to systemwide Rail demand.

This is one of the most important design choices in the project. Rail ridership patterns differ by station, and station-level modeling preserves those differences better than a single aggregate-only model.

Rail stations are split into two groups:

  • Main cohort: stations with at least 90% overall coverage and at least two years of pre-holdout history.
  • Fallback cohort: newer or lower-coverage stations that should still be forecast, but should not drive the main station model.

Unassigned Rail rows are handled separately. They are excluded from station-level model training, tracked in QA outputs, forecast separately, and added back into systemwide Rail totals.

The final Rail system forecast is assembled from:

  • Main station forecasts.
  • Fallback station forecasts.
  • Unassigned Rail forecasts.

Model Ladder

The project does not assume that a complex model is automatically better. It compares a ladder of models:

  1. Annual seasonal naive benchmark.
  2. 7-day lag benchmark.
  3. Linear regression.
  4. GLMNET regularized regression.
  5. XGBoost.

The simple benchmarks are important because ridership has strong weekly and seasonal structure. A more complex model must beat these baselines to justify its use.

What XGBoost Is Doing Here

XGBoost is a gradient-boosted decision tree model. It builds many small regression trees in sequence, where each new tree tries to correct errors from the previous trees.

In this project, XGBoost predicts transformed ridership, using log1p(ridership), from forecast-safe features such as recent lags, rolling averages, holidays, service type, and station metadata for Rail.

XGBoost is useful here because ridership patterns can be nonlinear. For example, weekend behavior, holiday behavior, and recent ridership levels may interact in ways that a simple linear model may not capture.

The XGBoost tuning grid tests combinations of:

  • Number of trees.
  • Tree depth.
  • Learning rate.
  • Minimum node size.
  • Loss reduction.
  • Sample size.

The production model is selected only after comparison against the full model ladder.

Validation Design

The project uses chronological validation, not random splitting.

Rolling validation tests monthly forecast origins across a calendar year. For each origin, the model trains on the past and forecasts the next 30 days. Errors are then summarized across all validation windows.

The main validation metric is MAE, or mean absolute error. MAE is the average absolute difference between predicted ridership and actual ridership. It is measured in riders, making it easy to interpret operationally.

The project also reports:

  • RMSE.
  • Bias.
  • MAPE.
  • SMAPE.
  • R-squared.
  • Horizon-specific performance.

Holdout evaluation is reserved for a later chronological period so that model performance can be checked on data not used for model selection.

Model Selection Rules

Model selection balances accuracy, stability, and interpretability.

The selection logic:

  • Computes validation metrics for each candidate model.
  • Requires candidate models to beat simple baselines.
  • Prefers simpler models when accuracy is effectively tied.
  • Allows XGBoost only when it clears accuracy and stability checks.
  • Falls back to the 7-day lag baseline if no candidate model improves on the benchmarks.

This keeps model choice evidence-based instead of assuming the most complex model should win.

Forecasting Design

Future forecasts are generated recursively. That means when forecasting multiple days ahead, earlier forecasted values become part of the history used to forecast later days.

The recursive forecast process:

  • Starts from the latest available actual ridership history.
  • Builds future calendar rows.
  • Recomputes lag and rolling features day by day.
  • Predicts each future date.
  • Caps extreme predictions using recent history and fallback estimates.

Prediction intervals are empirical. They are based on historical backtest residuals rather than a theoretical distribution. This makes the intervals grounded in observed model behavior.

Fallback Strategy

Fallback forecasts are used for lower-coverage Rail stations, unassigned Rail rows, and model failure cases.

The fallback hierarchy uses:

  • 7-day lag when available.
  • Same-weekday recent history.
  • Recent rolling averages.
  • Broader historical averages if needed.

Annual seasonal naive remains a benchmark, but the operational fallback defaults to more recent same-weekday behavior.

Training Exclusions

Known disruptions can be excluded from training through:

data/training_exclusions.csv

This file allows known closures, weather disruptions, shutdowns, or one-off anomalies to remain in the historical record while being excluded from model fitting.

Exclusions can apply to:

  • All modes.
  • Bus only.
  • Rail only.
  • Specific Rail stations.

This is a practical design decision: the model should not be forced to learn from days that WMATA already knows were abnormal and not representative of ordinary demand.

Discovery Layer

The discovery layer produces diagnostic insights beyond the core forecast.

It analyzes:

  • Weekday ridership structure.
  • Winter versus non-winter error.
  • Bus/Rail divergence patterns.
  • Concentration of forecast error.
  • Station sensitivity to holidays, weekends, and Tuesday-Thursday commuting patterns.
  • Future high-demand and high-uncertainty trigger days.

These diagnostics help explain where the model works well, where it struggles, and where WMATA may want additional monitoring or context.

Output Groups

Forecast Tables

These are the main future forecast outputs.

Important files:

  • outputs/tables/bus_forecast_output.csv
  • outputs/tables/rail_system_forecast_output.csv
  • outputs/tables/rail_station_forecast_output.csv
  • outputs/tables/future_forecast_summary.txt

Validation Tables

These summarize how candidate models performed.

Important files:

  • outputs/tables/overall_model_comparison.csv
  • outputs/tables/rolling_validation_summary.csv
  • outputs/tables/holdout_performance.csv
  • outputs/tables/horizon_bucket_performance.csv

Rail QA and Cohort Tables

These explain Rail station handling and unassigned ridership.

Important files:

  • outputs/tables/station_completeness_fallback.csv
  • outputs/tables/rail_unassigned_qa.csv

Discovery Tables

These support operational insight and diagnostic review.

Important files:

  • outputs/tables/discovery_insight_cards.md
  • outputs/tables/discovery_trigger_counts.csv
  • outputs/tables/discovery_trigger_days.csv
  • outputs/tables/discovery_station_sensitivity.csv
  • outputs/tables/discovery_station_error_pareto.csv
  • outputs/tables/discovery_winter_error.csv

Main Figures

These are the broader pipeline-generated graphics under:

outputs/figures/

They cover forecast trends, holdout fit, model comparisons, prediction intervals, horizon decay, station risk, winter behavior, and other diagnostics.

Slideshow-Ready Graphs

These are curated presentation graphics under:

slideshowGraphs/

The manifest is:

slideshowGraphs/graph_manifest.csv

These charts are designed for stakeholder communication rather than full technical inspection.

Diagnostic Artifacts

These store fitted model objects and pipeline results for later inspection.

Important files:

  • outputs/diagnostics/wmata_model_artifacts.qs
  • outputs/diagnostics/discovery_artifacts.qs

Key Libraries and Why They Matter

  • targets: Runs the project as a reproducible pipeline.
  • tarchetypes: Adds helper patterns for the targets workflow.
  • renv: Locks package versions so the project can be recreated more reliably.
  • tidyverse: Provides the core data cleaning, transformation, and plotting tools.
  • tidymodels: Provides a consistent framework for recipes, workflows, model fitting, and prediction.
  • xgboost: Provides the gradient-boosted tree model.
  • glmnet: Provides regularized regression.
  • future and furrr: Run model tuning and backtests in parallel.
  • yardstick: Computes model performance metrics.
  • slider: Builds rolling historical features.
  • ggplot2, patchwork, ggrepel, and ggtext: Build report and presentation graphics.
  • qs2: Saves diagnostic artifacts efficiently.
  • here and fs: Keep file paths and directory creation consistent.

File-by-File Map

_targets.R

  • Defines the full pipeline from raw data through outputs.
  • Sources all major R modules.
  • Connects data preparation, modeling, discovery diagnostics, tables, figures, and artifacts.

setup.R

  • Creates required project directories.
  • Restores packages from renv.lock when available.
  • Installs any missing runtime packages.

functions.R

  • Contains shared utilities used across the pipeline.
  • Defines project packages, parallel settings, output directories, plotting theme, holiday calendar logic, forecast calendar creation, and prediction interval helpers.

data_prep.R

  • Locates and reads WMATA raw exports.
  • Cleans raw export fields into bronze data.
  • Aggregates silver Bus, Rail station, Rail unassigned, and total daily tables.
  • Reads and applies training exclusion flags.

feature_engineering.R

  • Adds forecast-safe model features.
  • Builds station completeness flags.
  • Splits Rail into main station cohort, fallback station cohort, and unassigned Rail frame.
  • Writes model-ready gold datasets.

model_spec.R

  • Defines model input columns.
  • Builds modeling recipes.
  • Defines linear regression, GLMNET, and XGBoost specifications.
  • Defines tuning grids for GLMNET and XGBoost.

model_fit.R

  • Builds tuning and rolling-validation schedules.
  • Runs the model ladder.
  • Selects final models.
  • Produces holdout predictions and future forecasts.
  • Handles Bus, Rail main station, Rail fallback, and Rail unassigned pipelines.

forecasting.R

  • Implements recursive forecasting.
  • Recomputes lag and rolling features during future prediction.
  • Provides model-based forecasts and baseline forecasts.
  • Builds future calendar panels.

evaluation.R

  • Computes model performance metrics.
  • Scores model stability.
  • Selects the final model using accuracy, stability, and interpretability rules.
  • Adds empirical prediction intervals.

graph_pipeline.R

  • Builds the main tables and figures under outputs/.
  • Creates model comparison, forecast, interval, horizon, residual, station-risk, recovery, and monitoring graphics.
  • Saves model diagnostic artifacts.

discovery_layer.R

  • Builds additional diagnostic and operational insight tables.
  • Analyzes weekday patterns, winter effects, mode divergence, error concentration, station sensitivity, and forecast triggers.
  • Writes discovery figures and artifacts.

future_forecast.R

  • Reads forecast tables.
  • Creates a concise 30-day text summary for Bus, Rail, and combined ridership.
  • Writes outputs/tables/future_forecast_summary.txt.

wmata_prod_graphs.R

  • Reads pipeline outputs and diagnostic artifacts.
  • Produces curated, slideshow-ready graphs.
  • Writes slideshowGraphs/graph_manifest.csv.

README.md

  • Provides quick-start instructions.
  • Describes project structure, input data expectations, commands, modeling scope, and leakage-prevention choices.

renv.lock

  • Records package versions used by the project.
  • Supports reproducible setup through renv::restore().

data/training_exclusions.csv

  • User-maintained file for excluding known abnormal dates from training.
  • Supports mode-wide and station-specific exclusions.

Important Design Decisions

Chronological Splits

The project uses time-based validation because ridership forecasting is a time-dependent problem. Random splits would allow future patterns to influence training and would overstate performance.

Forecast-Safe Features

The feature set avoids weather, gas prices, economic indicators, and unknown future disruptions. These may be useful in theory, but they are either unavailable at forecast time, hard to forecast reliably, or risk making historical performance look better than real future performance.

Log-Transformed Target

The statistical models predict log1p(ridership). This helps stabilize large ridership differences and keeps predictions nonnegative after transforming back to rider counts.

Empirical Prediction Intervals

Prediction intervals are based on observed backtest residuals. This keeps uncertainty estimates tied to actual model errors.

Rail Station Cohorts

Complete, mature stations are modeled directly. Newer or incomplete stations are forecast with fallback logic. This avoids forcing sparse stations into a model that assumes consistent long-term history.

Presentation Outputs

The project writes both technical outputs and stakeholder-facing charts. This reduces the gap between model development and communication.

Limitations

The model does not automatically know about future disruptions, emergency closures, special events, weather shocks, major service changes, or economic changes unless those factors are explicitly represented in future-available data.

Excluded or intentionally limited inputs include:

  • Weather.
  • Gas prices.
  • Economic indicators.
  • Unknown future disruptions.
  • Future actual ridership.
  • Rolling features that include the forecast date itself.

The model should therefore be used as a planning forecast, not as an automatic operating decision system. Human review remains important for abnormal days, event-driven demand, major service changes, and future conditions not represented in the feature set.

Main Benefits

  • Reproducible end-to-end pipeline.
  • Clear data layers from raw exports to model-ready tables.
  • Forecast-safe feature design.
  • Time-based validation aligned with real forecasting.
  • Explicit comparison against simple baselines.
  • Separate Bus and Rail modeling choices.
  • Station-level Rail forecasts with fallback handling.
  • Empirical uncertainty bands.
  • Diagnostic outputs for model trust and operational review.
  • Presentation-ready graphics for stakeholder communication.

How to Read This Project Without Running It

Start with these files:

  1. README.md for setup, project scope, and commands.
  2. _targets.R for the full pipeline order.
  3. data_prep.R and feature_engineering.R for how raw data becomes model inputs.
  4. model_spec.R, model_fit.R, forecasting.R, and evaluation.R for modeling decisions.
  5. graph_pipeline.R and discovery_layer.R for outputs and diagnostics.
  6. outputs/tables/ for model results and forecast tables.
  7. slideshowGraphs/ for curated stakeholder visuals.