French railway analytics platform built on SNCF open data. Ingests train performance metrics, transforms them with dbt, and provides an interactive dashboard for analysis.
ferrodata/
├── ingestion/ # Data pipeline (fetch from SNCF API, load to DuckDB)
├── ferrodata/ # dbt project (data transformation and modeling)
├── streamlit_app/ # Interactive dashboard
├── data/ # Raw data cache
├── ferrodata.duckdb # Local database (gitignored)
└── pyproject.toml # UV workspace configuration
- Ingestion: Python scripts fetch data from SNCF Open Data API and load into DuckDB
- Transformation: dbt models clean, transform, and build analytics tables
- Visualization: Streamlit app queries DuckDB and renders interactive charts
All data from SNCF Open Data:
- TGV punctuality by route (monthly, 2018-present)
- Intercites punctuality by route (monthly, 2014-present)
- TER punctuality by region (monthly, 2013-present)
- Station master list (network metadata)
# Clone repository
git clone <repository-url>
cd ferrodata
# Install dependencies (UV workspace)
uv sync
# Install dbt packages
cd ferrodata
dbtf deps
cd ..Create .env files in each workspace if needed:
# ingestion/.env (optional - for BigQuery target)
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
GCP_PROJECT_ID=your-project-id
# ferrodata/.env (dbt profile if using BigQuery)
DBT_BIGQUERY_PROJECT=your-project-id
DBT_BIGQUERY_DATASET=your-datasetFetch data from SNCF API and load into DuckDB:
# Run from project root
uv run --package ferrodata-ingestion ferrodata-ingestThis creates ferrodata.duckdb with raw data in the raw_sncf schema.
Run dbt models to build staging and analytics tables:
cd ferrodata
# Run all models
dbt run
# Run specific model
dbt run --select stg_sncf__regularite_tgv
# Run tests
dbt test
# Generate documentation
dbt docs generate
dbt docs serveOutput schemas:
analytics_staging: Cleaned source dataanalytics_analytics: Marts and aggregations
Start the Streamlit app:
# From project root
cd streamlit_app
uv run streamlit run Home.py
# Or using the streamlit command directly
streamlit run streamlit_app/Home.pyAccess at http://localhost:8501
Cleaned and typed source data:
stg_sncf__gares: Station master liststg_sncf__regularite_tgv: TGV punctualitystg_sncf__regularite_intercites: Intercites punctualitystg_sncf__regularite_ter: TER regional punctuality
Analytics-ready tables:
dim_stations: Station dimension with geography and service metadatafct_train_punctuality: Unified punctuality metrics across all servicesfct_tgv_delays_by_cause: Delay cause analysis for TGVagg_monthly_service_performance: Monthly trends by service typeagg_station_performance: Station-level performance metricsagg_route_performance: Route-level performance ratings
- Home: Overview metrics and trends
- Station Map: Interactive map of all stations
- Route Analysis: Performance by origin-destination pair
- Delay Causes: Deep dive into delay attribution (TGV only)
# Lint with ruff
uv run ruff check .
# Format
uv run ruff format .
# Run tests
uv run pytestThe project supports both DuckDB (local) and BigQuery (cloud):
DuckDB (default):
- Fast local development
- No credentials needed
- Single-file database
BigQuery:
- Production-ready
- Requires GCP credentials
- Set
target: bigqueryinferrodata/profiles.yml
Models use dbt macros for database portability:
-- Instead of date_diff() or datediff()
{{ dbt.datediff("start_date", "end_date", "day") }}
-- Instead of current_timestamp()
{{ dbt.current_timestamp() }}
-- And more: dateadd, date_trunc, concat, split_part, etc.Check SNCF API status or update URLs in ingestion/config.py
Ensure ferrodata.duckdb exists in project root after running ingestion
- Check DuckDB file path in
streamlit_app/utils/db.py - Verify
dim_stationstable exists with lat/lon data - Try clearing cache: Settings > Clear Cache
MIT
Slimane Lakehal