Two-Stage Deep Learning Recommendation System for Transit Route Prediction
MSc Thesis · Data Science & Artificial Intelligence · Lima, Perú 🇵🇪
NextStopAI is a sequential recommendation system that predicts the next flight route a B2C customer is most likely to purchase. The system is built as a full end-to-end MLOps pipeline on Databricks, moving data from raw Bronze tables through a rigorous curation process, training a two-stage deep learning model, and registering it for serving.
| Stage | Notebook | Output |
|---|---|---|
| 1. Data Curation | 01_curation.ipynb |
recsys_silver.dataset_curado_v2 (103 cols) |
| 2. Model Training | 02_training.ipynb |
Registered model in MLflow + Delta logs |
| 3. Mass Inference | 03_inference.ipynb |
Top-5 predictions per active user |
┌──────────────────────────────────────────────────────────────────┐
│ BRONZE LAYER │
│ recsys_bronze.interactions_raw_v2 · recsys_bronze.airports │
└─────────────────────────┬────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 01 · DATA CURATION │
│ │
│ Identity Doc Curation ──► Route (IATA) Curation │
│ │ │ │
│ ▼ ▼ │
│ DOC_VALIDO_V2 RUTA_CURADA + RUTA_VALIDA │
│ ID_PERSONA_V2 Broadcast Join vs airports │
│ │ │ │
│ └────────────┬────────────┘ │
│ ▼ │
│ Feature Engineering (103 cols) │
│ Geo · Haversine Distance · Route Complexity │
│ Popularity · Hub Scores · User Profile · Dates │
│ ▼ │
│ recsys_silver.dataset_curado_v2 │
│ Z-ORDER by (ANIO_EMISION, MES_EMISION) │
└─────────────────────────┬────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 02 · MODEL TRAINING │
│ │
│ STAGE 1 — SASRec (Transformer) │
│ User Sequence → Self-Attention (4 heads) → Top-200 Candidates │
│ │
│ STAGE 2 — LambdaRank (LightGBM) │
│ 200 Candidates + 80 Features → NDCG@5 Optimized → Top-5 │
│ Optuna HPO · 30 trials · Hard Negatives 80% │
└─────────────────────────┬────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ MLOPS LAYER │
│ MLflow Tracking · Model Registry · Delta Audit Table │
│ recsys_gold.training_logs_v2 │
└─────────────────────────┬────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 03 · MASS INFERENCE │
│ Batch predictions · Top-5 per active user │
└──────────────────────────────────────────────────────────────────┘
NextStopAI/
│
├── README.md ← Project overview (this file)
├── requirements.txt ← Python dependencies
├── .gitignore
│
├── notebooks/
│ ├── 01_curation.ipynb ← Data curation pipeline (PySpark)
│ ├── 02_training.ipynb ← SASRec + LambdaRank training
│ └── 03_inference.ipynb ← Mass inference pipeline
│
├── docs/
│ ├── 01_curation.md ← Curation pipeline deep-dive
│ ├── 02_training.md ← Training pipeline deep-dive
│ ├── 03_inference.md ← Inference documentation
│ └── data_schema.md ← Delta Lake schema reference
│
├── data/
│ └── README.md ← Data sources (no raw data stored)
│
└── models/
└── README.md ← Model registry & artifacts reference
Full PySpark pipeline that transforms raw transactional data into a clean, feature-rich Silver table.
Curation stages:
1 · Identity Document Curation
Normalizes document types (DNI, RUC, CE, Passport, CI), applies format validation rules per type, detects blacklisted placeholders (TEST, 000000, S/N...), and identifies RUCs Masivos (> 1,000 records per entity) via Spark Window Functions. Produces DOC_VALIDO_V2 and ID_PERSONA_V2.
2 · B2C / B2B Segmentation Classifies customers: B2C (DNI, CE, Passport, CI) vs B2B (RUC). All downstream modeling uses the B2C segment exclusively.
3 · Route (IATA) Curation
Extracts IATA airport codes from raw route strings using regexp_extract_all, removes consecutive duplicates with a custom UDF (limpiar_y_unir_ruta), and validates each token against the commercial airport catalog (SCHEDULED_SERVICE = yes). Produces RUTA_CURADA and RUTA_VALIDA.
4 · Feature Engineering (103 features)
| Feature Group | Columns | Method |
|---|---|---|
| Route structure | N_TOKENS, N_TRAMOS, N_ESCALAS, ES_IDA_VUELTA, ES_MULTITRAMO |
Native Spark |
| Geographic (OD) | ORIG_COUNTRY, DEST_COUNTRY, ORIG_CONT, CRUZA_CONTINENTE_OD, TIPO_VUELO_OD |
Broadcast join vs airports catalog |
| Distance | DIST_KM_OD, DIST_BIN_OD, DELTA_ELEV_FT_OD |
Haversine in Spark SQL (no UDF) |
| Complex route metrics | DIST_KM_TOTAL_RUTA, DIST_KM_MAX/MEAN/STD_TRAMO, N_PAISES_VISITADOS, N_CONTINENTES_VISITADOS |
Broadcast UDF with airport dict |
| Popularity & Hub | LOG_POP_RUTA, LOG_POP_OD, HUB_SCORE_ORIG, HUB_SCORE_DEST |
Window aggregations |
| User profile | U_N_TRIPS, U_PCT_INTL, U_AVG_DIST_OD, U_MAX_DIST_OD, U_AVG_TRAMOS |
GroupBy aggregations |
| Temporal | ANIO_EMISION, MES_EMISION, DIA_SEMANA_EMISION, ANTICIPACION_DIAS, DURACION_VIAJE |
Date parsing (coalesce multi-format) |
5 · Optimized Save
Writes to Delta Lake with OPTIMIZE + ZORDER BY (ANIO_EMISION, MES_EMISION) for fast temporal queries at training time.
Full details:
docs/01_curation.md
Self-attentive sequential Transformer that learns from user purchase sequences to generate 200 candidate routes per user.
| Parameter | Value |
|---|---|
| Embedding dimension | 128 |
| Attention heads | 4 |
| Max sequence length | 10 |
| Epochs | 40 |
| Batch size | 2,048 |
| Loss | CrossEntropyLoss |
| Optimizer | Adam (lr=0.001) |
LightGBM ranking model that re-scores the 200 candidates using rich contextual features, optimizing for NDCG@5.
| Aspect | Detail |
|---|---|
| Objective | LambdaRank |
| HPO | Optuna · TPE Sampler · 30 trials |
| Negative sampling | 80% hard (top-2,500 popular routes) + 20% uniform |
| Training users | 22,000 |
| Negatives per positive | 80 |
Full details:
docs/02_training.md
Loads the registered model from Databricks Model Registry and generates Top-5 route predictions for all active B2C users in batch mode.
Full details:
docs/03_inference.md
Evaluated on heavy users (≥ 5 trips in the test holdout) using a temporal split (last 6 months as test).
| Metric | Description |
|---|---|
| HR@5 | Hit Rate — at least 1 correct route in Top-5 |
| Precision@5 | Fraction of Top-5 that are relevant |
| Recall@5 | Fraction of relevant routes recovered |
| MRR | Mean Reciprocal Rank |
| NDCG@5 | Normalized Discounted Cumulative Gain |
| MAP@5 | Mean Average Precision |
Results per run are tracked in MLflow under /Shared/RecSys_NextStopAI_Experiment and persisted to recsys_gold.training_logs_v2.
| Component | Tool | Purpose |
|---|---|---|
| Compute | Databricks | PySpark + GPU cluster |
| Raw data | Delta Lake Bronze | Transactional interactions + Airport catalog |
| Curated data | Delta Lake Silver | Feature-rich dataset (103 cols) |
| Audit log | Delta Lake Gold | Historical training records |
| Experiment tracking | MLflow | Metrics, params, plots, artifacts |
| Model registry | MLflow + Databricks | Model versioning and serving |
| HPO | Optuna (TPE) | Hyperparameter optimization |
- Scalar metrics: HR@5, Precision@5, Recall@5, MRR, NDCG@5, MAP@5
- Best Optuna hyperparameters
- SASRec training loss curve (
plots/sasrec_loss.png) - Evaluation metrics bar chart (
plots/metrics_bar.png) - SASRec model weights (
sasrec_model/sasrec_weights.pth) - Inference context (
context_artifacts/inference_context.pkl) - LightGBM model with input signature for serving
SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
random.seed(SEED); np.random.seed(SEED)
torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
# Optuna: TPESampler(seed=SEED)
# LightGBM: deterministic=True, force_col_wise=TrueDatabricks (recommended)
%pip install optuna lightgbm
dbutils.library.restartPython()Local environment
pip install torch optuna lightgbm mlflow scikit-learn pandas numpy matplotlib seaborn pysparkKey configuration
EXPERIMENT_NAME = "/Shared/RecSys_NextStopAI_Experiment"
MODEL_REGISTRY_NAME = "RecSys_NextStopAI_Model"
TABLE_LOG_METRICS = "recsys_gold.training_logs_v2"
TABLE_SILVER = "recsys_silver.dataset_curado_v2"
TABLE_BRONZE = "recsys_bronze.interactions_raw_v2"
TABLE_AIRPORTS = "recsys_bronze.airports"| Authors | Jose Casiano · Marco Anchiraico · Miguel Silva |
| Degree | MSc Data Science & Artificial Intelligence |
| Year | 2025 |
| GitHub | @ergodicdev |
| Contact | nextstopai.research@gmail.com |