Skip to content

Reduce data duplication and temp tables during transform#72

Open
brycekbargar wants to merge 28 commits intolibrary-data-platform:release-v4.0.0from
Five-Colleges-Incorporated:fewer-temp-tables
Open

Reduce data duplication and temp tables during transform#72
brycekbargar wants to merge 28 commits intolibrary-data-platform:release-v4.0.0from
Five-Colleges-Incorporated:fewer-temp-tables

Conversation

@brycekbargar
Copy link
Collaborator

I've learned a lot working on this new implementation and many of the initial assumptions and design choices I made broke down at the scale of the Inventory table. In trying to get the inventory table transforming more and more code was bolted to just barely make it work. Any given transform was one bad postgres query plan away from exploding the server. Before releasing the next version I'd like to make sure it is well-behaved when run by other people and actually maintainable so I can follow up with quick bug fixes.

The biggest issues with the existing design were:

  • The Carcinisation of the Nodes, Metadata, and Path/Name calculations
  • Having to constantly create and drop temporary tables in order to try to keep disk usage under control
  • Having a single query to find the type of a column, the type of the values if that column was an array, and more specialized type information like whether it was a uuid

This combines all the various control structures into the Node hierarchy. This hierarchy is used to generate paths and names of various columns as well as storing the type information. It continues to be responsible for generating the correct SQL to expand/infer the structure itself.

Instead of incrementally building the result tables through a series of temporary tables this switches to inferring as much information as possible from the origin jsonb column. For arrays temp tables save time/computation because the jsonb_array_elements only needs to be called once during the expansion. The information expanded into the new Array temp table is kept to a minimum and only the final create table statement copies in the "carryover" columns. This tames the disk usage spikes and simplifies the transformation logic by removing all the intermediate drop tables.

We're now using jsonb_each for finding the keys and their type in one query. Finding the type of array elements and more specific types are now separate steps if necessary. This has greatly reduced the amount of time spent inferring type information and also eliminates the accidentally and unpredictable query plan explosions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant