Skip to content

Refactor to minimize memory usage#21

Open
austensen wants to merge 30 commits into
level2from
refactor
Open

Refactor to minimize memory usage#21
austensen wants to merge 30 commits into
level2from
refactor

Conversation

@austensen

@austensen austensen commented Jun 3, 2026

Copy link
Copy Markdown
Member

I got a little carried away working on this refactor and a lot was changed, but it keeps the core parsing logic untouched and the basic processing flow is the same with some changes to speed things up where possible and minimize memory usage so it can be deployed on JustFix's existing kubernetes cluster that has nodes of about 2GB. I did some benchmarking on the process and for typical weekly runs a lot faster, and it runs on our k8s without the out-of-memory failure I got previously.

Initially I was going to try to make a few minimal changes to solve the memory issue, but after I accidentally lost the prod oca_addresses table so had to re-run everything and I was already getting into the code I thought I'd take the opportunity to implement a lot of the backlog improvements we had noted and use some new patterns (like s3-rds direct operations) more extensively.

As of now I've run all of the raw files through the new system saving everything to a refactor/ subfolder in S3 and a separate refactor schema in the DB. I'm going to move the current public/ files on S3 into an archive/ folder and copy of the new refactor ones and test out the import into our DB using the CSV files as a final check that everything is working properly, and then if things look ok to you both I can remove the refactor s3/rds setting and run it for real on k8s for new files starting this weekend.

Here's a list of the some of the key changes or things that might not be clear from the new readme and doc files that explain how things are set up now, but those docs are best for getting an overview of how everything works now.

  • I updated the base python docker image we're using since I was getting some security warnings about the existing one
  • I tried to simplify the pipeline steps and also keep things safe in case of failures (eg. previously during geocoding the prod oca_addresses table is dropped then replaced, and if the upload fails you couldn't rollback the drop). Now the final update of the production tables is done in a single transaction thats rolled back if there's a problem so tables never get lost or out of sync
  • To speed up the XML parsing step, I added a batching system for writing to the local staging DuckDB - so now it prepares the INSERT and DELETE statements for multiple cases before committing them.
    • I did a bunch of benchmarking and checks that everything is the same in the old and new process when adding the batched-parsing system, but pulled all of that extra logic out to keep things clean, but you can see that in Batch parse & staging export #20
  • To avoid having to read full table CSVs into memory to fix DuckDB-Postgres formatting this is now handled in SQL during the export of CSVs from DuckDB (this was a todo in the code already)
  • To align everything in steps more clearly, all the geocoding happens on the staging csv before any uploads to s3/rds.
  • To keep the amount of geocoding/csv reading down, each run only geocodes the files being processed. If we ever need to backfill with geocoding there is a separate script that can be run for that (and it's probably easier to just do this locally as a one off)
  • The geocoding columns are now also set up in the initial definition of the tables and the addition of the geom column is removed from the sql that generates the views to keep that to just views
  • The files are all reorganized to use etl_stages.py for the core logic for each step and then the main etl.py is able to be short and readable so you can more easily understand the flow of the whole pipeline.
  • a few new "manifest" tables are added that track the running of the etl process itself. Largely this was to help as I was reprocessing everything, but I think it should be nice going forward for debugging issues and tracking the process of the automated jobs
  • I forgot about this open PR Bug Fix: Duplicated appearanceids #19 when I started, and by then the git situation was gonna be crazy so I just directly incorporated the fixes
  • There are a ton of tests that we probably don't need, but I was using some AI tools helping with the refactor and the tests helped that process and to ensure things weren't breaking the original way things worked. Since they don't hurt I've just left them in, but no need to review them.
  • I removed the interactive progress bar since it makes the k8s logs unreadable and since we won't be running it interactively often I didn't bother trying to keep both behaviors.
  • I also made a few changes mostly relevant to reprocessing everything as I had to do:
  • in Handle failed parse #23 I added some extra error handling for when parsing of a case fails for some reason

austensen added 30 commits May 27, 2026 08:19
The refactor work thus far has only touched the process after the initial parsing into DuckDB and export to csv. This PR adds on some improvements to the parsing and duckdb -> csv export.

Previously during parsing each row written to each table in the staging DuckDB was committed separately, and this changes to use a batching approach. The INSERT & DELETE statements are collected in a buffer and then written to the DuckDB in a single transaction for multiple cases (configurable). On a typical weekly update this makes the parsing stage about 80% faster.

Previously we exported CSVs from the DuckDB and then did some preprocessing of them to adjust from differences in how DuckDB and Postgres represent some data types so that the COPY (via s3-to-rds) works. This changes to incorporate the preprocessing changes directly into the COPY export from DuckDB so those CSVs are ready for upload to s3/rds without any extra reading of CSVs.

During the process of working out these changes I did a bunch of benchmarking and testing to make sure all the behavior wasn't changed at all. I've since stripped that out to keep the code more readable, but the version with benchmarking and other checks is preserved on this branch: https://github.com/housing-data-coalition/oca/tree/batch-parsing-eval
In the OCA XML files some cases are marked for permanent deletion and this wasn't handled correctly if files are ever reprocessed out of order since previously deleted cases could be restored. This PR adds extra protection against that issue by explicitly deleting cases marked for deletion (which we already record in oca_metadata.deletedate) before promoting staging data to main, and adds a one-off backfill script to purge all cases.
Previously failures during xml case parsing were printed but not handled in any other way so it would be easy to miss the problem. This adds some additional safety measures to make sure it's clear when there is a parsing failure and when it needs to be corrected. By default the rest of the file(s) continue the rest of the pipeline, since I think it's better to update with a few cases missing and then reprocess later to correct it rather than to fail right away when there might not be time to reprocess before the data is needed (context is justfix sends out an email monday morning that uses the data) and if multiple files are being processed at once then later only the single fail that had the parsing failure can be rerun.

Details on the failures are recorded in the "manifest" tables that were already added as part of the refactor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant