Conversation
The refactor work thus far has only touched the process after the initial parsing into DuckDB and export to csv. This PR adds on some improvements to the parsing and duckdb -> csv export. Previously during parsing each row written to each table in the staging DuckDB was committed separately, and this changes to use a batching approach. The INSERT & DELETE statements are collected in a buffer and then written to the DuckDB in a single transaction for multiple cases (configurable). On a typical weekly update this makes the parsing stage about 80% faster. Previously we exported CSVs from the DuckDB and then did some preprocessing of them to adjust from differences in how DuckDB and Postgres represent some data types so that the COPY (via s3-to-rds) works. This changes to incorporate the preprocessing changes directly into the COPY export from DuckDB so those CSVs are ready for upload to s3/rds without any extra reading of CSVs. During the process of working out these changes I did a bunch of benchmarking and testing to make sure all the behavior wasn't changed at all. I've since stripped that out to keep the code more readable, but the version with benchmarking and other checks is preserved on this branch: https://github.com/housing-data-coalition/oca/tree/batch-parsing-eval
…oid duplicate address export
In the OCA XML files some cases are marked for permanent deletion and this wasn't handled correctly if files are ever reprocessed out of order since previously deleted cases could be restored. This PR adds extra protection against that issue by explicitly deleting cases marked for deletion (which we already record in oca_metadata.deletedate) before promoting staging data to main, and adds a one-off backfill script to purge all cases.
Previously failures during xml case parsing were printed but not handled in any other way so it would be easy to miss the problem. This adds some additional safety measures to make sure it's clear when there is a parsing failure and when it needs to be corrected. By default the rest of the file(s) continue the rest of the pipeline, since I think it's better to update with a few cases missing and then reprocess later to correct it rather than to fail right away when there might not be time to reprocess before the data is needed (context is justfix sends out an email monday morning that uses the data) and if multiple files are being processed at once then later only the single fail that had the parsing failure can be rerun. Details on the failures are recorded in the "manifest" tables that were already added as part of the refactor.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I got a little carried away working on this refactor and a lot was changed, but it keeps the core parsing logic untouched and the basic processing flow is the same with some changes to speed things up where possible and minimize memory usage so it can be deployed on JustFix's existing kubernetes cluster that has nodes of about 2GB. I did some benchmarking on the process and for typical weekly runs a lot faster, and it runs on our k8s without the out-of-memory failure I got previously.
Initially I was going to try to make a few minimal changes to solve the memory issue, but after I accidentally lost the prod
oca_addressestable so had to re-run everything and I was already getting into the code I thought I'd take the opportunity to implement a lot of the backlog improvements we had noted and use some new patterns (like s3-rds direct operations) more extensively.As of now I've run all of the raw files through the new system saving everything to a
refactor/subfolder in S3 and a separaterefactorschema in the DB. I'm going to move the currentpublic/files on S3 into anarchive/folder and copy of the new refactor ones and test out the import into our DB using the CSV files as a final check that everything is working properly, and then if things look ok to you both I can remove therefactors3/rds setting and run it for real on k8s for new files starting this weekend.Here's a list of the some of the key changes or things that might not be clear from the new readme and doc files that explain how things are set up now, but those docs are best for getting an overview of how everything works now.
etl_stages.pyfor the core logic for each step and then the mainetl.pyis able to be short and readable so you can more easily understand the flow of the whole pipeline.