Reduce data duplication and temp tables during transform by brycekbargar · Pull Request #72 · library-data-platform/ldlite

brycekbargar · 2026-03-26T14:56:36Z

I've learned a lot working on this new implementation and many of the initial assumptions and design choices I made broke down at the scale of the Inventory table. In trying to get the inventory table transforming more and more code was bolted to just barely make it work. Any given transform was one bad postgres query plan away from exploding the server. Before releasing the next version I'd like to make sure it is well-behaved when run by other people and actually maintainable so I can follow up with quick bug fixes.

The biggest issues with the existing design were:

The Carcinisation of the Nodes, Metadata, and Path/Name calculations
Having to constantly create and drop temporary tables in order to try to keep disk usage under control
Having a single query to find the type of a column, the type of the values if that column was an array, and more specialized type information like whether it was a uuid

This combines all the various control structures into the Node hierarchy. This hierarchy is used to generate paths and names of various columns as well as storing the type information. It continues to be responsible for generating the correct SQL to expand/infer the structure itself.

Instead of incrementally building the result tables through a series of temporary tables this switches to inferring as much information as possible from the origin jsonb column. For arrays temp tables save time/computation because the jsonb_array_elements only needs to be called once during the expansion. The information expanded into the new Array temp table is kept to a minimum and only the final create table statement copies in the "carryover" columns. This tames the disk usage spikes and simplifies the transformation logic by removing all the intermediate drop tables.

We're now using jsonb_each for finding the keys and their type in one query. Finding the type of array elements and more specific types are now separate steps if necessary. This has greatly reduced the amount of time spent inferring type information and also eliminates the accidentally and unpredictable query plan explosions.

brycekbargar added 28 commits March 23, 2026 19:40

Begin the rewrite with new Node class and implement the fixed value

d2de294

Implement getting the keys and types for objects

d8dcbda

Remove the identity node type

25ae3e5

Implement staging the array as a temp file and getting the type

1c460d6

Implement the new simplified transform algorithm

c1c2356

Track scan progress

755d631

Refactor sql statement generation location

0427c60

Share more of the json object traversal code

a016de6

Skeleton out the create table sql statements

dd40621

Build sql with output tables and columns

154644f

Clear the indexing time when the transform finishes

8e00886

Refactor to use the rewrite expansion

b3bed23

Use postgres sql format properly

04b41c9

Add a duckdb jsonb_each shim

3cf5217

Various syntax fixes

60d158e

Fix aliasing for basic datatypes column test

df0f2f8

Fix snake case naming

72449e9

Fix basic object expansion

bc7c490

WIP: Making array expansion work

5d55799

Infer NodeContext from Node structure

1f521f2

Implement json_depth

a7dc2f2

Replace legacy version with the rewrite

2552364

Remove delete from returning construct

7d25e53

Inline the ANALYZE statements necessary for the transform

351c9fa

Invert json_depth check and simplify unparenting empty arrays

7955d21

Analyze important column in raw table for postgres

870dc1c

Re-enable flaky source records CI test

e2bec99

Cleanup transform_progress

82a7ffd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce data duplication and temp tables during transform#72

Reduce data duplication and temp tables during transform#72
brycekbargar wants to merge 28 commits intolibrary-data-platform:release-v4.0.0from
Five-Colleges-Incorporated:fewer-temp-tables

brycekbargar commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

brycekbargar commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant