Skip to content

varchar() measures characters instead of bytes, causing truncation with multibyte UTF-8 strings (SQL Server) #960

@lachlangh

Description

@lachlangh

I believe there’s a bug in the varchar() helper when used with SQL Server.

odbc/R/aaa-odbc-data-type.R

Lines 119 to 128 in c8a2e26

varchar <- function(x, type = "varchar") {
# at least 255 characters, use max if more than 8000:
max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE)
if (max_length > 8000) {
max_length <- "max"
}
paste0(type, "(", max_length, ")")
}

The SQL Server documentation for VARCHAR(n) defines n as the number of bytes, not characters.
Currently, varchar() determines n using nchar(type = "chars"), which underestimates the required width for multibyte UTF-8 strings. This can cause string truncation when inserting data longer than 255 bytes.

Reprex

string <- paste0("\xe2\x80\x99", paste(rep("a", 255), collapse = ""))
nchar(string)
#> [1] 256
nchar(string, type = "bytes")
#> [1] 258
odbc:::varchar(string) # not 258
#> [1] "varchar(256)"

Created on 2025-10-29 with reprex v2.1.1

Expected

varchar() should use the byte length of the input (nchar(x, type = "bytes")) rather than character count.
That aligns with SQL Server’s definition and avoids truncation with UTF-8 input.
Using byte length instead of character count should not adversely affect other database backends.

Proposed change

I think that this would be the cleanest fix, given nchar() supports returning the byte count.

# before
max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE)

# after
max_length <- max(c(255, nchar(as.character(x), type = "bytes")), na.rm = TRUE)

I plan to open a PR with this proposed change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions