Major refactor and enhance step CLI with new features and fixes by martjanz · Pull Request #243 · EL-BID/UrbanTrips

martjanz · 2026-06-08T21:12:09Z

This pull request restructures the project's automation and documentation infrastructure, introduces a performance review document, and adds a formal package dependency specification. The main themes are: (1) splitting and modernizing CI/CD workflows, (2) introducing a Makefile for local developer tasks, (3) documenting architectural performance bottlenecks and optimization patterns, and (4) specifying dependencies with a Pipfile.

CI/CD Workflow Restructuring and Improvements

The main build workflow (.github/workflows/build.yml) is replaced with more focused workflows: publishing is now triggered only on version tags, and the process is simplified to only build and publish distributions to PyPI, removing SonarCloud analysis, linting, and testing steps.
New workflows are added for continuous integration (.github/workflows/tests.yml) covering linting, unit tests, integration tests, and package build, and for documentation generation (.github/workflows/docs.yml).

Developer Experience and Local Automation

A Makefile is introduced to standardize local development tasks for linting, running unit and integration tests, building the package, and running all CI steps together.

Performance Documentation and Optimization

A comprehensive performance review document (PERF_REVIEW.md) is added. It details architectural bottlenecks, prioritizes them, and provides actionable recommendations and a checklist for micro-optimizations, including code patterns to fix and step-by-step guidance.

Dependency Management

A Pipfile is added to formally specify both runtime and development dependencies, as well as the Python version, improving reproducibility and onboarding.

Add unit tests for DuckDB Arrow-registration memory safety, verifying every register call is balanced by unregister and that save_legs uses zero Arrow registrations via parquet staging.

…ions

Replace stale `h3_o` column references with `h3` in geo equivalence and desire-line helpers. Compute polygon centroid from geometry instead of relying on dropped `polygon_lat/lon` mean columns.

…rationale

Resolves 18 files with conflict markers. Strategy: keep HEAD's StorageContext/ DuckDB architecture throughout, integrate VV's functional improvements: - Rename assign_gps_destination → assign_time_distances; now computes distance_od, distance_route, distance_route_gps and saves to travel_times_legs/trips tables - preparo_dashboard: query etapas/viajes with LEFT JOIN to travel_times_* to pull pre-computed distance and speed columns - kpi.py: read_data_for_daily_kpi joins travel_times_legs; add inf cleanup and rounding to compute_kpi_by_line_day; add pd.set_option for pandas - misc.py: persist_indicators queries use distance_od from travel_times_trips - run_process.py: procesar_transacciones uses assign_time_distances, moves rearrange_trip_id_same_od before it, removes redundant add_distance calls - routes.py: improved assert error message from VV - kpi_lineas.py: column renames (distance_km→distance_route, distancia→ distance_od, travel_speed→kmh_od) and vehiculos_operativos logic - configs: alias_db_insumos key, new GPS column names preserved from VV

…rom_legs_and_fex

… functions)

…run_basic_kpi output

…r first-run tables

…hema

…stances output

… fo_mean) to VV names throughout dashboard and schema

…n resumen_x_linea

… of distance_km

…config key

…configs without alias_db_insumos

…s, not autogenerated config

…s inline from corrida name

…na argument

…cess_data

…ruyo_indicadores

…art 2 complete

fix in destination for min distance and fix in legs to use use_gps instead of nombre_archivo_gps

Replace per-worker DuckDB reconnects with a single scan per CPU-chunk: each chunk loads only its batches' rows, capping peak RAM to chunk_size × batch_size. Saves overlap the next chunk read via a background thread. - Hash-based batch partitioning replaces stored batch_id column - Workers receive pre-split DataFrames; no in-worker file I/O - DuckDB buffer pool configurable via tuning.yaml or auto-tuned to 25 % of total RAM (floor 1 GB) - n_batches auto-tune accounts for CPU workers to keep them fed - Add usa_archivo_gps config flag as alternative to naming the GPS file explicitly - Vectorize timestamp conversion in ingestion (drop row-wise apply)

Round the auto-tuned batch count up to the nearest multiple of `cpu_workers` so every parallel chunk is fully packed, avoiding a partial last chunk that would leave workers idle.

…earrange Use a targeted `update_leg_trip_ids` port method that updates only `id_viaje`/`id_etapa` columns matched by `id`, avoiding the expensive parquet staging and full DELETE+INSERT cycle used by `save_legs`. Also switch batch-mode DELETE to filter by `batch_id` (indexed) instead of joining on `id` from a parquet scan, preventing an O(n²) scan as the etapas table grows across batches.

- Filter legs by min_fecha_d < 20 before computing h3 distances, reducing the number of h3.grid_distance calls - Deduplicate (h3_d_gps_res, h3) pairs before iterating to avoid redundant calls, then merge results back feat(kpi): include route distances in service KPI demand queries - Join travel_times_legs in both GPS and non-GPS demand CTEs to propagate distance_route and distance_route_gps columns - Fix distance_col reference: 'distance' → 'distance_od'

Replace sequential dia/hora loop with ThreadPoolExecutor, extract _process_dia_hora helper, and pre-filter GPS rows to vehicle boarding times to reduce join cardinality.

martjanz · 2026-06-12T18:06:37Z

Outdated. Superseded by #244

martjanz added 30 commits May 27, 2026 10:55

feat: major refactor

3d27d0c

fix: unregister DuckDB views and include dia in trip groupby

645f2e0

fix: guard save_legs against empty DataFrame early return

fc7562b

Add unit tests for DuckDB Arrow-registration memory safety, verifying every register call is balanced by unregister and that save_legs uses zero Arrow registrations via parquet staging.

docs: add step CLI design spec

2bcaa02

docs: add step CLI implementation plan

417c7e3

docs: add incremental run behaviour notes to step CLI spec

094ce35

refactor: extract run_ingest/legs/outputs/dashboard public step funct…

f05c6f5

…ions

feat: add check_prerequisites step guard

52b3a7f

feat: add --step and --through CLI flags for single-step execution

5c796c5

fix: use h3 column name and polygon centroid correctly

86fdc06

Replace stale `h3_o` column references with `h3` in geo equivalence and desire-line helpers. Compute polygon centroid from geometry instead of relying on dropped `polygon_lat/lon` mean columns.

docs: add scaling considerations and memory guidelines to step CLI spec

951de58

feat: auto-tune n_batches from available RAM when not configured

529cb30

feat: improve n_batches logging with RAM stats, source, and decision …

8ffa687

…rationale

docs: add merge report for validar_velocidad into refactor-clean

b05b747

fix: remove dead VV code block using undefined conn in create_trips_f…

7d09822

…rom_legs_and_fex

fix: align run_basic_kpi to use kmh_route_veh_h (renamed by VV helper…

ea2b5b9

… functions)

fix: align compute_basic_kpi_line_typeday/hr_typeday schema to match …

7c4e850

…run_basic_kpi output

fix: wire assign_time_distances into _enrich_all_legs; safe DELETE fo…

ebe0789

…r first-run tables

fix: drop travel_times tables before recreating to enforce current sc…

1c0c0ee

…hema

fix: update travel_times schema to 12/10 cols matching assign_time_di…

fc54f17

…stances output

fix: rename old column names (distancia, travel_speed, dmt_mean, ipk,…

2a5eb17

… fo_mean) to VV names throughout dashboard and schema

fix: restore metric_cols definition and fix all/all_linea/all_ramal i…

d0ae531

…n resumen_x_linea

fix: update services schema and callers to use distance_route instead…

59bb791

… of distance_km

feat: add --config flag to point to an alternative YAML config file

fc4bb7d

feat: separate alias_db (outputs) from alias_db_insumos, add db_path …

7ea699f

…config key

fix: remove legacy SQLite block from borrar_corridas that crashed on …

47298bb

…configs without alias_db_insumos

fix: URBANTRIPS_CONFIG env var only overrides autogenerado=False read…

fb93158

…s, not autogenerated config

refactor: bypass autogenerado config in run_process — derive filename…

48d93f9

…s inline from corrida name

fix: remove stray compute_distance_km_gps call with invalid use_panda…

d5d177d

…na argument

martjanz and others added 23 commits June 2, 2026 20:38

perf(b3): DBSCAN grid reduction, early stopping, and tuning config

2e8922c

perf(b5): push scalar derived columns into DuckDB SQL in load_and_pro…

a0c75da

…cess_data

perf(b1): replace 9 pandas groupby scans with DuckDB queries in const…

d1415eb

…ruyo_indicadores

Merge perf/b2-remove-copies (resolve test file conflict)

518fc05

Merge perf/b4-h3-polygon-fill (resolve test file conflict)

d273744

Merge perf/b3-dbscan-grid (resolve test file conflict)

9b82b54

Merge perf/b5-derived-columns-sql (resolve test file conflict)

ffe1214

Merge perf/b1-construyo-indicadores (resolve test file conflict)

e3dea3f

Fix typo in B5 test fixture (outros -> otros)

9e2d6dc

perf(micro): fix remaining apply(h3.cell_to_parent) in viz.py, mark P…

ba5c3e0

…art 2 complete

fix_destinations_legs

87f7149

fix in destination for min distance and fix in legs to use use_gps instead of nombre_archivo_gps

add batches to config

b2ad30c

fix_destinations_legs

3ead46a

fix in destination for min distance and fix in legs to use use_gps instead of nombre_archivo_gps

add batches to config

8540a05

refactor(run_process): align n_batches to cpu_workers multiple

f34b3e9

Round the auto-tuned batch count up to the nearest multiple of `cpu_workers` so every parallel chunk is fully packed, avoiding a partial last chunk that would leave workers idle.

update_configuraciones.xlsx

96ddaeb

perf(legs): parallelize GPS destination imputation

f5b1267

Replace sequential dia/hora loop with ThreadPoolExecutor, extract _process_dia_hora helper, and pre-filter GPS rows to vehicle boarding times to reduce join cardinality.

fix: macOS GPS serial fallback, selective destination UPDATE

8d32346

fix: correct hora_inicio/hora_fin type to TEXT, add migration

43784ef

Merge feat/refactor-clean-merge-vv-improv into feat/improvements-tmp

60b3f47

martjanz self-assigned this Jun 8, 2026

martjanz added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 8, 2026

martjanz requested review from alephcero and sanapolsky June 8, 2026 21:49

martjanz closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major refactor and enhance step CLI with new features and fixes#243

Major refactor and enhance step CLI with new features and fixes#243
martjanz wants to merge 79 commits into
devfrom
feat/major-refactor

martjanz commented Jun 8, 2026

Uh oh!

martjanz commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

martjanz commented Jun 8, 2026

Uh oh!

martjanz commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants