src: improve StringBytes::Encode perf on UTF8 #61131

ChALkeR · 2025-12-20T00:03:59Z

Tracking: #61041

Most data is valid utf-8, no need to wait for v8 optimizations or for simdutf implementing fast replacement.
We can just check + simdutf in fast case.

This is a 2x-10x speedup according to https://github.com/lemire/jstextdecoderbench bench (+ I added extra cases)

There is still room for improvement here (e.g. avoiding triple scans), but this change alone improves results significantly
We can improve further iteratively
This performs mallocs only for valid strings, instead of optimistically malloc-ing and decoding until error
Switching that behavior to optimistic would be a separate PR (perf needs to be checked against this not main or #61119)

Buffer#toString() - utf8

pre-#61119:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	18.21 GiB/s	0.005 ms
Arabic lipsum	79.771 KiB	0.29 GiB/s	0.266 ms
Chinese lipsum	68.203 KiB	0.34 GiB/s	0.192 ms
Arabic + 2 * ASCII	249.575 KiB	0.73 GiB/s	0.329 ms

main with #61119 (landed):

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	36.75 GiB/s	0.002 ms
Arabic lipsum	79.771 KiB	0.28 GiB/s	0.273 ms
Chinese lipsum	68.203 KiB	0.33 GiB/s	0.197 ms
Arabic + 2 * ASCII	249.575 KiB	0.69 GiB/s	0.344 ms

PR:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	36.84 GiB/s	0.002 ms
Arabic lipsum	79.771 KiB	2.03 GiB/s	0.038 ms
Chinese lipsum	68.203 KiB	4.06 GiB/s	0.016 ms
Arabic + 2 * ASCII	249.577 KiB	3.42 GiB/s	0.072 ms

TextDecoder, loose

pre-#61119:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	17.99 GiB/s	0.005 ms
Arabic lipsum	79.771 KiB	0.28 GiB/s	0.270 ms
Chinese lipsum	68.203 KiB	0.34 GiB/s	0.194 ms
Arabic + 2 * ASCII	249.577 KiB	0.71 GiB/s	0.333 ms

main with #61119 (landed):

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	36.59 GiB/s	0.002 ms
Arabic lipsum	79.771 KiB	0.28 GiB/s	0.271 ms
Chinese lipsum	68.203 KiB	0.34 GiB/s	0.192 ms
Arabic + 2 * ASCII	249.577 KiB	0.70 GiB/s	0.340 ms

PR:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	36.78 GiB/s	0.002 ms
Arabic lipsum	79.771 KiB	2.03 GiB/s	0.038 ms
Chinese lipsum	68.203 KiB	4.01 GiB/s	0.016 ms
Arabic + 2 * ASCII	249.577 KiB	3.42 GiB/s	0.072 ms

TextDecoder, fatal

pre-#61119:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	15.31 GiB/s	0.006 ms
Arabic lipsum	79.771 KiB	0.27 GiB/s	0.279 ms
Chinese lipsum	68.203 KiB	0.34 GiB/s	0.194 ms
Arabic + 2 * ASCII	249.577 KiB	0.71 GiB/s	0.338 ms

main with #61119 (landed):

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	36.63 GiB/s	0.002 ms
Arabic lipsum	79.771 KiB	0.28 GiB/s	0.272 ms
Chinese lipsum	68.203 KiB	0.33 GiB/s	0.197 ms
Arabic + 2 * ASCII	249.577 KiB	0.68 GiB/s	0.351 ms

PR:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	36.71 GiB/s	0.002 ms
Arabic lipsum	79.771 KiB	1.70 GiB/s	0.046 ms
Chinese lipsum	68.203 KiB	2.97 GiB/s	0.022 ms
Arabic + 2 * ASCII	249.577 KiB	3.01 GiB/s	0.082 ms

cc @nodejs/performance

ChALkeR · 2026-01-17T11:01:41Z

As #61119 landed, this is now ready. Rebased.

codecov · 2026-01-17T12:04:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.54%. Comparing base (955d347) to head (f1d3a0e).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #61131      +/-   ##
==========================================
+ Coverage   88.52%   88.54%   +0.01%     
==========================================
  Files         704      704              
  Lines      208802   208808       +6     
  Branches    40318    40315       -3     
==========================================
+ Hits       184842   184884      +42     
+ Misses      15947    15907      -40     
- Partials     8013     8017       +4

Files with missing lines	Coverage Δ
src/encoding_binding.cc	`52.73% <ø> (ø)`
src/string_bytes.cc	`70.31% <100.00%> (+0.56%)`	⬆️

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gurgunday · 2026-01-17T16:07:33Z

src/string_bytes.cc

+        // We know that we are non-ASCII (and are unlikely Latin1), use 2-byte
+        // In the most likely case of valid UTF-8, we can use this fast impl
+        size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen);
+        uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size);


Why not a null check here?

Suggested change

uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size);

uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size);

if (u16size != 0 && dst == nullptr) {

isolate->ThrowException(node::ERR_MEMORY_ALLOCATION_FAILED(isolate));

return MaybeLocal<Value>();

}

gurgunday · 2026-01-17T16:11:04Z

src/string_bytes.cc

+      if (simdutf::validate_utf8(buf, buflen)) {
+        // We know that we are non-ASCII (and are unlikely Latin1), use 2-byte
+        // In the most likely case of valid UTF-8, we can use this fast impl
+        size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen);


Again, I think we need a guard here to not allocate for no reason:

Suggested change

size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen);

size_t u16size = simdutf::utf16_length_from_utf8(buf, buflen);

if (u16size > static_cast<size_t>(v8::String::kMaxLength)) {

isolate->ThrowException(node::ERR_STRING_TOO_LONG(isolate));

return MaybeLocal<Value>();

}

nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Dec 20, 2025

ChALkeR force-pushed the chalker/non-ascii/0 branch 2 times, most recently from 5b2b040 to aee5408 Compare December 20, 2025 05:49

RafaelGSS added the performance Issues and PRs related to the performance of Node.js. label Dec 29, 2025

RafaelGSS self-requested a review December 29, 2025 20:49

ChALkeR force-pushed the chalker/non-ascii/0 branch from aee5408 to 118db5f Compare January 17, 2026 11:01

ChALkeR marked this pull request as ready for review January 17, 2026 11:01

src: improve StringBytes::Encode perf on UTF8

f1d3a0e

ChALkeR force-pushed the chalker/non-ascii/0 branch from 118db5f to f1d3a0e Compare January 17, 2026 11:06

gurgunday reviewed Jan 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

src: improve StringBytes::Encode perf on UTF8 #61131

src: improve StringBytes::Encode perf on UTF8 #61131

ChALkeR commented Dec 20, 2025 •

edited

Loading

Uh oh!

ChALkeR commented Jan 17, 2026

Uh oh!

codecov bot commented Jan 17, 2026 •

edited

Loading

Uh oh!

gurgunday Jan 17, 2026 •

edited

Loading

Uh oh!

gurgunday Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-        uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size);
+        uint16_t* dst = node::UncheckedMalloc<uint16_t>(u16size);
+        if (u16size != 0 && dst == nullptr) {
+           isolate->ThrowException(node::ERR_MEMORY_ALLOCATION_FAILED(isolate));
+           return MaybeLocal<Value>();
+        }

Uh oh!

src: improve StringBytes::Encode perf on UTF8 #61131

Are you sure you want to change the base?

src: improve StringBytes::Encode perf on UTF8 #61131

Conversation

ChALkeR commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Buffer#toString() - utf8

TextDecoder, loose

TextDecoder, fatal

Uh oh!

ChALkeR commented Jan 17, 2026

Uh oh!

codecov bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gurgunday Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gurgunday Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChALkeR commented Dec 20, 2025 •

edited

Loading

codecov bot commented Jan 17, 2026 •

edited

Loading

gurgunday Jan 17, 2026 •

edited

Loading