Major refactor and enhance step CLI with new features and fixes#243
Closed
martjanz wants to merge 79 commits into
Closed
Major refactor and enhance step CLI with new features and fixes#243martjanz wants to merge 79 commits into
martjanz wants to merge 79 commits into
Conversation
Add unit tests for DuckDB Arrow-registration memory safety, verifying every register call is balanced by unregister and that save_legs uses zero Arrow registrations via parquet staging.
Replace stale `h3_o` column references with `h3` in geo equivalence and desire-line helpers. Compute polygon centroid from geometry instead of relying on dropped `polygon_lat/lon` mean columns.
Resolves 18 files with conflict markers. Strategy: keep HEAD's StorageContext/ DuckDB architecture throughout, integrate VV's functional improvements: - Rename assign_gps_destination → assign_time_distances; now computes distance_od, distance_route, distance_route_gps and saves to travel_times_legs/trips tables - preparo_dashboard: query etapas/viajes with LEFT JOIN to travel_times_* to pull pre-computed distance and speed columns - kpi.py: read_data_for_daily_kpi joins travel_times_legs; add inf cleanup and rounding to compute_kpi_by_line_day; add pd.set_option for pandas - misc.py: persist_indicators queries use distance_od from travel_times_trips - run_process.py: procesar_transacciones uses assign_time_distances, moves rearrange_trip_id_same_od before it, removes redundant add_distance calls - routes.py: improved assert error message from VV - kpi_lineas.py: column renames (distance_km→distance_route, distancia→ distance_od, travel_speed→kmh_od) and vehiculos_operativos logic - configs: alias_db_insumos key, new GPS column names preserved from VV
…run_basic_kpi output
…r first-run tables
… fo_mean) to VV names throughout dashboard and schema
…n resumen_x_linea
…configs without alias_db_insumos
…s, not autogenerated config
…s inline from corrida name
fix in destination for min distance and fix in legs to use use_gps instead of nombre_archivo_gps
fix in destination for min distance and fix in legs to use use_gps instead of nombre_archivo_gps
Replace per-worker DuckDB reconnects with a single scan per CPU-chunk: each chunk loads only its batches' rows, capping peak RAM to chunk_size × batch_size. Saves overlap the next chunk read via a background thread. - Hash-based batch partitioning replaces stored batch_id column - Workers receive pre-split DataFrames; no in-worker file I/O - DuckDB buffer pool configurable via tuning.yaml or auto-tuned to 25 % of total RAM (floor 1 GB) - n_batches auto-tune accounts for CPU workers to keep them fed - Add usa_archivo_gps config flag as alternative to naming the GPS file explicitly - Vectorize timestamp conversion in ingestion (drop row-wise apply)
Round the auto-tuned batch count up to the nearest multiple of `cpu_workers` so every parallel chunk is fully packed, avoiding a partial last chunk that would leave workers idle.
…earrange Use a targeted `update_leg_trip_ids` port method that updates only `id_viaje`/`id_etapa` columns matched by `id`, avoiding the expensive parquet staging and full DELETE+INSERT cycle used by `save_legs`. Also switch batch-mode DELETE to filter by `batch_id` (indexed) instead of joining on `id` from a parquet scan, preventing an O(n²) scan as the etapas table grows across batches.
- Filter legs by min_fecha_d < 20 before computing h3 distances, reducing the number of h3.grid_distance calls - Deduplicate (h3_d_gps_res, h3) pairs before iterating to avoid redundant calls, then merge results back feat(kpi): include route distances in service KPI demand queries - Join travel_times_legs in both GPS and non-GPS demand CTEs to propagate distance_route and distance_route_gps columns - Fix distance_col reference: 'distance' → 'distance_od'
Replace sequential dia/hora loop with ThreadPoolExecutor, extract _process_dia_hora helper, and pre-filter GPS rows to vehicle boarding times to reduce join cardinality.
Collaborator
Author
|
Outdated. Superseded by #244 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request restructures the project's automation and documentation infrastructure, introduces a performance review document, and adds a formal package dependency specification. The main themes are: (1) splitting and modernizing CI/CD workflows, (2) introducing a Makefile for local developer tasks, (3) documenting architectural performance bottlenecks and optimization patterns, and (4) specifying dependencies with a
Pipfile.CI/CD Workflow Restructuring and Improvements
.github/workflows/build.yml) is replaced with more focused workflows: publishing is now triggered only on version tags, and the process is simplified to only build and publish distributions to PyPI, removing SonarCloud analysis, linting, and testing steps..github/workflows/tests.yml) covering linting, unit tests, integration tests, and package build, and for documentation generation (.github/workflows/docs.yml).Developer Experience and Local Automation
Makefileis introduced to standardize local development tasks for linting, running unit and integration tests, building the package, and running all CI steps together.Performance Documentation and Optimization
PERF_REVIEW.md) is added. It details architectural bottlenecks, prioritizes them, and provides actionable recommendations and a checklist for micro-optimizations, including code patterns to fix and step-by-step guidance.Dependency Management
Pipfileis added to formally specify both runtime and development dependencies, as well as the Python version, improving reproducibility and onboarding.