Conversation
|
Any news or thoughts about this? |
|
My apologies @czaefferer, I have not been able to find the time to review this properly. Hopefully soon. |
This comment has been minimized.
This comment has been minimized.
| joe,sam,ja"n | ||
| joe,sam,"ja"n" | ||
| joe,"sa | ||
| "m",jan |
There was a problem hiding this comment.
this test here should actually fail. each line in a csv file indicates a row, and a newline indicates the start of a new row. the result in the snapshot isn't accurate.
There was a problem hiding this comment.
I don't think you're right there, this is from the RFC 4180 (https://tools.ietf.org/html/rfc4180):
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
As I understand this, a newline within a quoted field does not indicate the start of a new line (I wish it would, that would have saved me so much trouble in the past).
There was a problem hiding this comment.
hm you may be right there. I'm giving things a big refresh in order to try and alleviate the pain of working through tokenizing things like this, remove some complication. hang tight and I'll follow up when I have that work done.
There was a problem hiding this comment.
Of curious note is this bit:
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
That would seem to indicate that joe,sam,"ja"n" is invalid, would it not?
There was a problem hiding this comment.
Yes, it should be invalid. However, if a single double-quote is found which isn't followed by a separator or a newline, it can be identified as invalid, and could then be treated like any other character. This does indeed take the standard a bit more lax, but then again, so do a lot of other options like configurable separators, quote- and escape-characters. To me it would be totally fine if this behaviour would need to be enabled by configuration. But considering how often I see this intentionally used in real world files, I think this is a far better solution than to invalidate the whole record (which, given the alternative, would not be acceptable in my use-cases).
Right now however, the parser does neither invalidate rows nor treat misplaced double-quotes as regular text, usually it merges two rows, but I've also seen it merge 8K out of 40K rows due to a single misplaced double-quote.
There was a problem hiding this comment.
Yeah it certainly doesn't follow the spec at present. I'm working on a proper tokenizer for this that'll make implementing parts of the spec far easier. I work with the PostCSS folks quite a bit and maintain the postcss-less, postcss-values-parser modules which both follow the PostCSS tokenizer model and it's amazing how much easier it is to work with versus an ad-hoc parser like what's in the module at present
|
FWIW I confirmed that Excel does export as |
This PR contains:
Breaking Changes?
Unescaped quotes that are not preceded or succeeded by a separator or Linebreak are handled like any other character.
Please Describe Your Changes
fixes issue #70