Skip to content

BigQuery for Modeling Data

Bill Hereth edited this page Oct 15, 2025 · 1 revision

Overview

The WFRC BigQuery environment serves as the centralized processing platform for the agency’s modeling data, including preprocessing of the 2023 Household Travel Survey (HTS) for Travel Demand Model (TDM) and related use cases.
This framework ensures that all modeling datasets—whether raw, consultant-delivered, or WFRC-produced—are organized, consistent, and securely managed in a single, scalable environment.

Location and Infrastructure

  • Platform: Google BigQuery — a SQL-based cloud data warehouse.
  • Purpose: Hosts a unified environment for processing and storing survey and model data (e.g., HTS, TDM, ABM, and commercial vehicle datasets).
  • Architecture: Cloud-native and modular, designed to scale and adapt without structural overhaul.

Centralized Access and Data Management

  • Single Access Point: All datasets are stored within a unified BigQuery project (wfrc-modeling-data), ensuring consistency, ease of maintenance, and streamlined collaboration.
  • Modular Design: Each dataset represents a specific processing stage (src_, ext_, prd_), promoting clear lineage and simplified governance.
  • Persistent Views: SQL-based views are centrally maintained and automatically propagate updates to all users, supporting consistent analytics across teams.
  • Integrated Spatial Processing: Spatial joins (e.g., TAZ assignments or coordinate-based joins) are applied centrally during preprocessing, preserving spatial integrity and reducing the need for repetitive spatial queries by users.

Security and Governance

  • Role-Based Access Control (RBAC): Access is governed by roles, restricting sensitive operations to authorized users and aligning with WFRC’s data governance standards.
  • Protection of Sensitive Geolocation Data: Personally identifiable location data (e.g., latitude/longitude, projected X/Y coordinates, Census block group IDs) is excluded from TDM-preprocessed tables to protect respondent privacy while preserving analytical utility.

Usability and Collaboration

  • Cross-Functional Accessibility: Built for planners, analysts, and engineers alike—providing approachable yet powerful access to shared datasets.
  • Web GUI Access: Users can query, preview, and manage datasets directly through the BigQuery web interface in the Google Cloud Console—ideal for interactive analysis and data exploration without coding.
  • Python API Access: Analysts and developers can connect using the BigQuery Python client library, supporting reproducible workflows, automated processing, and integration with Jupyter notebooks or other analytical pipelines.
  • Hybrid Use Case Support: Both methods access the same datasets and permission structure, ensuring a unified and consistent environment across graphical and programmatic workflows.

Dataset Naming Convention

Each dataset name follows this structure:

<stage>_<source>_<project_or_model>_<content>_<year>

Element Meaning Example
stage Stage of data in lifecycle src, ext, prd
source Data source (not used for ) rsg, udot, locus
project_or_model Model system or project name tdm, abm, lu
content Content of data household_travel_survey, commercial_vehicle, taz
year (optional) Primary data year 2023

Example:

src_rsg_household_travel_survey_2023

Stage Definitions

Stage Meaning Description Owned by
src Source Data ingested from external systems or raw sources (e.g., HTS survey, UDOT feeds, Locus exports). External system
ext External Consultant-delivered or partner-processed datasets not yet WFRC-final. Consultant / Partner
prd Production WFRC-approved, final datasets ready for modeling or publication. WFRC
stg Staging Temporary or intermediate processing steps (optional use). WFRC

Example data flow: src_rsg_household_travel_survey_2023ext_rsg_hts_2023prd_tdm_hts_2023

Project/Model Abbreviations

Term Abbreviation Example
Travel Demand Model tdm prd_tdm_taz_2021
Activity-Based Model abm prd_abm_lu_2025
Real Estate Market Model remm prd_remm_parcel_2025

Content Abbreviations

For content at the src level use full descriptions and at all other levels use abbreviations.

Term Abbreviation Example
Household Travel Survey hts prd_tdm_hts_2023
On-Board Survey obs prd_tdm_obs_2023
Land Use lu prd_abm_lu_2025
Personal Vehicle pv ext_tdm_pv_2023
Commercial Vehicle cv ext_tdm_cv_2023
Transportation Analysis Zone taz prd_tdm_taz_2021

Example dataset structure

wfrc-modeling-data
├── src_rsg_household_travel_survey_2023
├── src_locus_commercial_vehicle_2023
├── ext_insight_cv_2023
├── ext_utah_on_board_survey_2023
├── prd_tdm_hts_2023
├── prd_tdm_taz
├── prd_abm_lu_2025

Best Practices

  • Keep names lowercase and use underscores (_) as separators.
  • Include year when datasets are versioned or time-specific.
  • Use descriptive names in source dataset for clarity.
  • Use abbreviations for descriptive names for production tables.
  • Never overwrite production (prd) datasets — use another dataset for new year/version.

Clone this wiki locally