Reduce data duplication and temp tables during transform#72
Open
brycekbargar wants to merge 28 commits intolibrary-data-platform:release-v4.0.0from
Open
Reduce data duplication and temp tables during transform#72brycekbargar wants to merge 28 commits intolibrary-data-platform:release-v4.0.0from
brycekbargar wants to merge 28 commits intolibrary-data-platform:release-v4.0.0from
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I've learned a lot working on this new implementation and many of the initial assumptions and design choices I made broke down at the scale of the Inventory table. In trying to get the inventory table transforming more and more code was bolted to just barely make it work. Any given transform was one bad postgres query plan away from exploding the server. Before releasing the next version I'd like to make sure it is well-behaved when run by other people and actually maintainable so I can follow up with quick bug fixes.
The biggest issues with the existing design were:
This combines all the various control structures into the Node hierarchy. This hierarchy is used to generate paths and names of various columns as well as storing the type information. It continues to be responsible for generating the correct SQL to expand/infer the structure itself.
Instead of incrementally building the result tables through a series of temporary tables this switches to inferring as much information as possible from the origin jsonb column. For arrays temp tables save time/computation because the jsonb_array_elements only needs to be called once during the expansion. The information expanded into the new Array temp table is kept to a minimum and only the final create table statement copies in the "carryover" columns. This tames the disk usage spikes and simplifies the transformation logic by removing all the intermediate drop tables.
We're now using jsonb_each for finding the keys and their type in one query. Finding the type of array elements and more specific types are now separate steps if necessary. This has greatly reduced the amount of time spent inferring type information and also eliminates the accidentally and unpredictable query plan explosions.