Refactor to minimize memory usage by austensen · Pull Request #21 · housing-data-coalition/oca

austensen · 2026-06-03T16:30:32Z

I got a little carried away working on this refactor and a lot was changed, but it keeps the core parsing logic untouched and the basic processing flow is the same with some changes to speed things up where possible and minimize memory usage so it can be deployed on JustFix's existing kubernetes cluster that has nodes of about 2GB. I did some benchmarking on the process and for typical weekly runs a lot faster, and it runs on our k8s without the out-of-memory failure I got previously.

Initially I was going to try to make a few minimal changes to solve the memory issue, but after I accidentally lost the prod oca_addresses table so had to re-run everything and I was already getting into the code I thought I'd take the opportunity to implement a lot of the backlog improvements we had noted and use some new patterns (like s3-rds direct operations) more extensively.

As of now I've run all of the raw files through the new system saving everything to a refactor/ subfolder in S3 and a separate refactor schema in the DB. I'm going to move the current public/ files on S3 into an archive/ folder and copy of the new refactor ones and test out the import into our DB using the CSV files as a final check that everything is working properly, and then if things look ok to you both I can remove the refactor s3/rds setting and run it for real on k8s for new files starting this weekend.

Here's a list of the some of the key changes or things that might not be clear from the new readme and doc files that explain how things are set up now, but those docs are best for getting an overview of how everything works now.

I updated the base python docker image we're using since I was getting some security warnings about the existing one
I tried to simplify the pipeline steps and also keep things safe in case of failures (eg. previously during geocoding the prod oca_addresses table is dropped then replaced, and if the upload fails you couldn't rollback the drop). Now the final update of the production tables is done in a single transaction thats rolled back if there's a problem so tables never get lost or out of sync
To speed up the XML parsing step, I added a batching system for writing to the local staging DuckDB - so now it prepares the INSERT and DELETE statements for multiple cases before committing them.
- I did a bunch of benchmarking and checks that everything is the same in the old and new process when adding the batched-parsing system, but pulled all of that extra logic out to keep things clean, but you can see that in Batch parse & staging export #20
To avoid having to read full table CSVs into memory to fix DuckDB-Postgres formatting this is now handled in SQL during the export of CSVs from DuckDB (this was a todo in the code already)
To align everything in steps more clearly, all the geocoding happens on the staging csv before any uploads to s3/rds.
To keep the amount of geocoding/csv reading down, each run only geocodes the files being processed. If we ever need to backfill with geocoding there is a separate script that can be run for that (and it's probably easier to just do this locally as a one off)
The geocoding columns are now also set up in the initial definition of the tables and the addition of the geom column is removed from the sql that generates the views to keep that to just views
The files are all reorganized to use etl_stages.py for the core logic for each step and then the main etl.py is able to be short and readable so you can more easily understand the flow of the whole pipeline.
a few new "manifest" tables are added that track the running of the etl process itself. Largely this was to help as I was reprocessing everything, but I think it should be nice going forward for debugging issues and tracking the process of the automated jobs
I forgot about this open PR Bug Fix: Duplicated appearanceids #19 when I started, and by then the git situation was gonna be crazy so I just directly incorporated the fixes
There are a ton of tests that we probably don't need, but I was using some AI tools helping with the refactor and the tests helped that process and to ensure things weren't breaking the original way things worked. Since they don't hurt I've just left them in, but no need to review them.
I removed the interactive progress bar since it makes the k8s logs unreadable and since we won't be running it interactively often I didn't bother trying to keep both behaviors.
I also made a few changes mostly relevant to reprocessing everything as I had to do:
- in handle cases marked for deletion #22 I fixed an issue where it was possible if files were processed out of order that cases marked for deletion could be recreated (mostly only relevant to when it's all getting rebuilt as I was doing)
- in add env flag to skip csv s3 publish when reprocessing #24 I added an option to skip the final export of CSVs from the prod db to s3
in Handle failed parse #23 I added some extra error handling for when parsing of a case fails for some reason

The refactor work thus far has only touched the process after the initial parsing into DuckDB and export to csv. This PR adds on some improvements to the parsing and duckdb -> csv export. Previously during parsing each row written to each table in the staging DuckDB was committed separately, and this changes to use a batching approach. The INSERT & DELETE statements are collected in a buffer and then written to the DuckDB in a single transaction for multiple cases (configurable). On a typical weekly update this makes the parsing stage about 80% faster. Previously we exported CSVs from the DuckDB and then did some preprocessing of them to adjust from differences in how DuckDB and Postgres represent some data types so that the COPY (via s3-to-rds) works. This changes to incorporate the preprocessing changes directly into the COPY export from DuckDB so those CSVs are ready for upload to s3/rds without any extra reading of CSVs. During the process of working out these changes I did a bunch of benchmarking and testing to make sure all the behavior wasn't changed at all. I've since stripped that out to keep the code more readable, but the version with benchmarking and other checks is preserved on this branch: https://github.com/housing-data-coalition/oca/tree/batch-parsing-eval

…oid duplicate address export

In the OCA XML files some cases are marked for permanent deletion and this wasn't handled correctly if files are ever reprocessed out of order since previously deleted cases could be restored. This PR adds extra protection against that issue by explicitly deleting cases marked for deletion (which we already record in oca_metadata.deletedate) before promoting staging data to main, and adds a one-off backfill script to purge all cases.

Previously failures during xml case parsing were printed but not handled in any other way so it would be easy to miss the problem. This adds some additional safety measures to make sure it's clear when there is a parsing failure and when it needs to be corrected. By default the rest of the file(s) continue the rest of the pipeline, since I think it's better to update with a few cases missing and then reprocess later to correct it rather than to fail right away when there might not be time to reprocess before the data is needed (context is justfix sends out an email monday morning that uses the data) and if multiple files are being processed at once then later only the single fail that had the parsing failure can be rerun. Details on the failures are recorded in the "manifest" tables that were already added as part of the refactor.

…a problem

austensen added 30 commits May 27, 2026 08:19

update ignore files

16d166f

update docker and python deps for security

f2495a0

Runtime controls + schema plumbing + reprocess selectors

b61884d

Run manifest + stage checkpointing + single-run locking

f60d2c5

ETL module structure (flat split, move-only)

a10632e

Parse/load memory reduction and CSV preprocessing reduction

dcc3c32

PR #19 appearanceid + S3 publish fixes

596e208

Incremental geocoding with delta extraction and DB upsert

00840e6

Multi-address geocode row keys

57de240

Atomic staging->main promotion hardening

ee8d3d7

Publish optimization + operational hardening

db12b02

Schema-safe table bootstrap before staging import

570ed86

remove unused prep_db

285a4c1

update readmes

5909343

prevent dropped remote db connections for long processes

35c7eda

add DB connection issue protections, suppress geosupport logs

92a5dc2

adjust geocode candidate selection

0674af2

reorder steps for better grouping (publish all files together) and av…

ae51651

…oid duplicate address export

Geom SQL (schema, promotion, views, upsert)

bb68234

CSV geocode + weekly pipeline rewire

1a30fff

RDS backfill CLI

e45d3c1

update Documentation for geocoding changes

1abb194

fix bug uploading tempfiles to s3

e80073e

update ignore files

e6a2d6f

simplify parser progress logging for non-interactive

d9e53e6

add env flag to skip csv s3 publish when reprocessing (#24)

51c5dd8

fix minor bug in threading for parsers that raised exception but not …

1ab7c41

…a problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to minimize memory usage#21

Refactor to minimize memory usage#21
austensen wants to merge 30 commits into
level2from
refactor

austensen commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

austensen commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

austensen commented Jun 3, 2026 •

edited

Loading