Skip to content

Conversation

@devzer01
Copy link

@devzer01 devzer01 commented Dec 25, 2022

the column count versifier was not taking enclosed string columns as a single column and was splitting columns with ',' inside the string. I did a quick fix for it, Hope it's good enough, also some compiler warnings on flags.c about pointers being compared to none pointer types etc.

@mifrandir
Copy link
Owner

Thank you for the contribution!

I believe that the problem is a bit more subtle, though. Until now the implementation has been naive in the sense I didn't try to follow the standard. If we are to make this extension then we need to choose a spec and follow it. E.g. https://csv-spec.org/.

On an implementation-level, there are things that I am currently not happy with that I shall address.

@devzer01
Copy link
Author

Sure ,

Should we implement a flag to ignore errors or offer an option to skip error ? i was dealing with a file with 249 million lines and was splitting to files of 500,000 each . I picked your tool because the fact it includes the headers on the split , and when you the program exits when there is an error without some useful info and no reasonable way to resume it took some math and dd skip bytes then sed append header etc, and then cut -d f take the first column build an index then figure out the line that was broken.

so maybe

  1. shall we write the good portion of the read lines to the active file before exit on errors? least that way it's easy to spot the broken line
  2. Add an option to skip error lines ?
  3. print error lines out to standard error so the user can handle broken lines by themselves?

I will check the spec out

@devzer01
Copy link
Author

looks like if we want to stick with spec better off to integrate with https://github.com/rgamble/libcsv

what do you think ?

@mifrandir
Copy link
Owner

Sorry for ghosting you.

I think this tool wants to be standalone, for learning purposes; of course you can fork and do whatever you want with it, no hard feelings.

However, options 2 and 3 seem quite sensible and possibly the easiest to implement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants