Skip to content

Allow adding TRUNCATECOLUMNS option to Redshift COPY#43

Open
stefankeidel wants to merge 12 commits into
datamill-co:masterfrom
stefankeidel:master
Open

Allow adding TRUNCATECOLUMNS option to Redshift COPY#43
stefankeidel wants to merge 12 commits into
datamill-co:masterfrom
stefankeidel:master

Conversation

@stefankeidel

Copy link
Copy Markdown

This adds a configuration parameter (defaulting to False) which triggers the TRUNCATECOLUMNS option in every Redshift COPY statement sent by the target.

The use case for us is the combination with tap-intercom where some of the content can exceed 64k, but the content for those few records/fields where that happens can be safely ignored. I couldn't find another way to truncate the content before sending to Redshift.

It might be useful to at some point add a more flexible way to include other options as well, but this should work for now.

@AlexanderMann

Copy link
Copy Markdown
Collaborator

@stefankeidel this seems really straightforward and practical. I think it'd make sense to have a test which creates an (insanely) large record, uses this option, and then makes sure we read a truncated version out of Redshift since unlike some of our other configuration options, this is pretty straightforward to actually test for.

Curious if @awm33 has any opinions/thoughts here?

This test adds 100 cats with a long description and asserts that they
all insert correctly (Redshift bails if the content is too long if the
TRUNCATECOLUMNS option is not set) and that the longest record for
that column equals the max column length.

Tested using the docker setup for this project:

    source /code/venv--target-redshift/bin/activate
    pytest tests/test_target_redshift.py -k 'test_truncate_columns'
@stefankeidel

Copy link
Copy Markdown
Author

Good idea! Added a test that does roughly that. Lmk if that works

@AlexanderMann

Copy link
Copy Markdown
Collaborator

Test looks good. If we can get @awm33 to weigh in here, I think this is good to merge. Really nice work @stefankeidel!

@AlexanderMann AlexanderMann self-requested a review May 15, 2020 13:37
@awm33

awm33 commented May 16, 2020

Copy link
Copy Markdown
Member

@AlexanderMann @stefankeidel I'm wondering if we should create a subobject for Redshift COPY options to group them?

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

@AlexanderMann

Copy link
Copy Markdown
Collaborator

@awm33 that seems like a reasonable thing to do. I think a good enhancement for all of our config would be grouping all of the various things. Like, for psycopg2 we can make the connection object just a 1:1 mapping in a sub-object etc.

You're suggesting doing it here so folks don't have a bunch of work to do in the future?

@stefankeidel

stefankeidel commented May 18, 2020

Copy link
Copy Markdown
Author

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

Yeah, this is for tap-intercom. We don't care about every individual message being imported in full but would still like to have most of them :)

Regarding a subgrouping: Makes sense! Wdyt about something like this?

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"redshift_copy_options": {
		"truncate_columns": true
	}
}

@AlexanderMann

Copy link
Copy Markdown
Collaborator

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

@stefankeidel

Copy link
Copy Markdown
Author

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Hmm, we already have prefixed redshift_ values for the connection params. Would you want to move those in there as well and break existing configs? Either way it should be consistent I think.

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

I like this idea! Not sure if we should do some verification or if we can just assume people that are using such an option know what they're doing?

This allows to pass a list of options to redshift's copy command
instead of just enabling to set a single option.
@stefankeidel

Copy link
Copy Markdown
Author

I implemented it using the prefix redshift_* for now, but I'm happy to change the config format obviously. I just thought this is most consistent with the existing redshift_host etc.

@awm33

awm33 commented May 19, 2020

Copy link
Copy Markdown
Member

@stefankeidel @AlexanderMann I kind of regret us prefixing everything with redshift :). Other targets have COPY commands (postgres and snowflake) and snowflake has a TRUNCATECOLUMNS options too.

I propose something that we can use with other targets as well, since they offer something similar:

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"copy_options": {
		"truncate_columns": true
	}
}

@stefankeidel

Copy link
Copy Markdown
Author

@awm33 I like that format, but what do you think about the copy-options-as-array-of-strings idea posted by @AlexanderMann above and implemented in this latest revision? Do we want keywords for every single option we want to support or just assume users know what they're doing?

@AlexanderMann

Copy link
Copy Markdown
Collaborator

@awm33 on that note...should this really be something we put into target-postgres then and have the other targets inherit? Is this something we can assume is a SQLBase type configuration?

@Tom-E

Tom-E commented Jan 5, 2021

Copy link
Copy Markdown

Is this ready to be merged + released? Happen to be looking for exactly this config option =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants