Skip to content

Major refactor and enhance step CLI with new features and fixes#243

Closed
martjanz wants to merge 79 commits into
devfrom
feat/major-refactor
Closed

Major refactor and enhance step CLI with new features and fixes#243
martjanz wants to merge 79 commits into
devfrom
feat/major-refactor

Conversation

@martjanz

@martjanz martjanz commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

This pull request restructures the project's automation and documentation infrastructure, introduces a performance review document, and adds a formal package dependency specification. The main themes are: (1) splitting and modernizing CI/CD workflows, (2) introducing a Makefile for local developer tasks, (3) documenting architectural performance bottlenecks and optimization patterns, and (4) specifying dependencies with a Pipfile.

CI/CD Workflow Restructuring and Improvements

  • The main build workflow (.github/workflows/build.yml) is replaced with more focused workflows: publishing is now triggered only on version tags, and the process is simplified to only build and publish distributions to PyPI, removing SonarCloud analysis, linting, and testing steps.
  • New workflows are added for continuous integration (.github/workflows/tests.yml) covering linting, unit tests, integration tests, and package build, and for documentation generation (.github/workflows/docs.yml).

Developer Experience and Local Automation

  • A Makefile is introduced to standardize local development tasks for linting, running unit and integration tests, building the package, and running all CI steps together.

Performance Documentation and Optimization

  • A comprehensive performance review document (PERF_REVIEW.md) is added. It details architectural bottlenecks, prioritizes them, and provides actionable recommendations and a checklist for micro-optimizations, including code patterns to fix and step-by-step guidance.

Dependency Management

  • A Pipfile is added to formally specify both runtime and development dependencies, as well as the Python version, improving reproducibility and onboarding.

martjanz added 30 commits May 27, 2026 10:55
Add unit tests for DuckDB Arrow-registration memory safety,
verifying every register call is balanced by unregister and
that save_legs uses zero Arrow registrations via parquet staging.
Replace stale `h3_o` column references with `h3` in geo equivalence
and desire-line helpers. Compute polygon centroid from geometry
instead of relying on dropped `polygon_lat/lon` mean columns.
Resolves 18 files with conflict markers. Strategy: keep HEAD's StorageContext/
DuckDB architecture throughout, integrate VV's functional improvements:

- Rename assign_gps_destination → assign_time_distances; now computes
  distance_od, distance_route, distance_route_gps and saves to
  travel_times_legs/trips tables
- preparo_dashboard: query etapas/viajes with LEFT JOIN to travel_times_*
  to pull pre-computed distance and speed columns
- kpi.py: read_data_for_daily_kpi joins travel_times_legs; add inf cleanup
  and rounding to compute_kpi_by_line_day; add pd.set_option for pandas
- misc.py: persist_indicators queries use distance_od from travel_times_trips
- run_process.py: procesar_transacciones uses assign_time_distances, moves
  rearrange_trip_id_same_od before it, removes redundant add_distance calls
- routes.py: improved assert error message from VV
- kpi_lineas.py: column renames (distance_km→distance_route, distancia→
  distance_od, travel_speed→kmh_od) and vehiculos_operativos logic
- configs: alias_db_insumos key, new GPS column names preserved from VV
… fo_mean) to VV names throughout dashboard and schema
martjanz and others added 23 commits June 2, 2026 20:38
fix in destination for min distance and fix in legs to use use_gps instead of nombre_archivo_gps
fix in destination for min distance and fix in legs to use use_gps instead of nombre_archivo_gps
Replace per-worker DuckDB reconnects with a single scan per
CPU-chunk: each chunk loads only its batches' rows, capping peak
RAM to chunk_size × batch_size. Saves overlap the next chunk read
via a background thread.

- Hash-based batch partitioning replaces stored batch_id column
- Workers receive pre-split DataFrames; no in-worker file I/O
- DuckDB buffer pool configurable via tuning.yaml or auto-tuned
  to 25 % of total RAM (floor 1 GB)
- n_batches auto-tune accounts for CPU workers to keep them fed
- Add usa_archivo_gps config flag as alternative to naming the
  GPS file explicitly
- Vectorize timestamp conversion in ingestion (drop row-wise apply)
Round the auto-tuned batch count up to the nearest multiple of
`cpu_workers` so every parallel chunk is fully packed, avoiding
a partial last chunk that would leave workers idle.
…earrange

Use a targeted `update_leg_trip_ids` port method that updates only
`id_viaje`/`id_etapa` columns matched by `id`, avoiding the expensive
parquet staging and full DELETE+INSERT cycle used by `save_legs`.

Also switch batch-mode DELETE to filter by `batch_id` (indexed) instead
of joining on `id` from a parquet scan, preventing an O(n²) scan as
the etapas table grows across batches.
- Filter legs by min_fecha_d < 20 before computing h3 distances,
  reducing the number of h3.grid_distance calls
- Deduplicate (h3_d_gps_res, h3) pairs before iterating to avoid
  redundant calls, then merge results back

feat(kpi): include route distances in service KPI demand queries

- Join travel_times_legs in both GPS and non-GPS demand CTEs to
  propagate distance_route and distance_route_gps columns
- Fix distance_col reference: 'distance' → 'distance_od'
Replace sequential dia/hora loop with ThreadPoolExecutor,
extract _process_dia_hora helper, and pre-filter GPS rows
to vehicle boarding times to reduce join cardinality.
@martjanz martjanz self-assigned this Jun 8, 2026
@martjanz martjanz added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 8, 2026
@martjanz martjanz requested review from alephcero and sanapolsky June 8, 2026 21:49
@martjanz

Copy link
Copy Markdown
Collaborator Author

Outdated. Superseded by #244

@martjanz martjanz closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants