-
-
Notifications
You must be signed in to change notification settings - Fork 34.4k
src: improve StringBytes::Encode perf on UTF8 #61131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5b2b040 to
aee5408
Compare
aee5408 to
118db5f
Compare
|
As #61119 landed, this is now ready. Rebased. |
118db5f to
f1d3a0e
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #61131 +/- ##
==========================================
+ Coverage 88.52% 88.54% +0.01%
==========================================
Files 704 704
Lines 208802 208808 +6
Branches 40318 40315 -3
==========================================
+ Hits 184842 184884 +42
+ Misses 15947 15907 -40
- Partials 8013 8017 +4
🚀 New features to boost your workflow:
|
| // We know that we are non-ASCII (and are unlikely Latin1), use 2-byte | ||
| // In the most likely case of valid UTF-8, we can use this fast impl | ||
| size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen); | ||
| uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a null check here?
| uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size); | |
| uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size); | |
| if (u16size != 0 && dst == nullptr) { | |
| isolate->ThrowException(node::ERR_MEMORY_ALLOCATION_FAILED(isolate)); | |
| return MaybeLocal<Value>(); | |
| } |
| if (simdutf::validate_utf8(buf, buflen)) { | ||
| // We know that we are non-ASCII (and are unlikely Latin1), use 2-byte | ||
| // In the most likely case of valid UTF-8, we can use this fast impl | ||
| size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think we need a guard here to not allocate for no reason:
| size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen); | |
| size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen); | |
| if (u16size > static_cast<size_t>(v8::String::kMaxLength)) { | |
| isolate->ThrowException(node::ERR_STRING_TOO_LONG(isolate)); | |
| return MaybeLocal<Value>(); | |
| } |
Tracking: #61041
Most data is valid utf-8, no need to wait for v8 optimizations or for simdutf implementing fast replacement.
We can just check + simdutf in fast case.
This is a 2x-10x speedup according to https://github.com/lemire/jstextdecoderbench bench (+ I added extra cases)
There is still room for improvement here (e.g. avoiding triple scans), but this change alone improves results significantly
We can improve further iteratively
This performs mallocs only for valid strings, instead of optimistically malloc-ing and decoding until error
Switching that behavior to optimistic would be a separate PR (perf needs to be checked against this not main or #61119)
Buffer#toString() - utf8
pre-#61119:
main with #61119 (landed):
PR:
TextDecoder, loose
pre-#61119:
main with #61119 (landed):
PR:
TextDecoder, fatal
pre-#61119:
main with #61119 (landed):
PR:
cc @nodejs/performance