From 9b1885820ece8ce555f1573c1a1020e9169d1caf Mon Sep 17 00:00:00 2001 From: Bart Turczynski <142225707+bart-turczynski@users.noreply.github.com> Date: Tue, 16 Jun 2026 21:12:27 +0200 Subject: [PATCH] docs: citation/documentation sweep (PUNY-ccaxoxfa) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cite the standards behaviour we already implement and make the UTS #46 != IDNA2008 distinction explicit. No behaviour change. - normalization-contract.md: new "Standards and references" section mapping each standard to where it is used — RFC 3492 (Punycode transform), RFC 5890 (xn-- ACE framing), RFC 5891 (canonical A-label, cited for that check only), RFC 5892 (ContextJ via CheckJoiners; NOT full CONTEXTO), RFC 5893 (Bidi via CheckBidi), UAX #15 (NFC), UAX #44 (UCD properties), STD 3 = RFC 952+1123 (UseSTD3), RFC 8753 (version pinning). Notes IDNA2003/Nameprep (RFC 3490/3491/3454) as the rejected alternative and the URL-surface citations as scope-conditional. - host_normalize roxygen: spell out "UTS #46 profile, not IDNA2008/RFC 5891 conformance" with the coffee.example example and per-standard pointers. - README: two-tier overview, UTS #46 != IDNA2008 note, host_normalize/ normalization_profile_info added to the feature list, link to the contract. Co-Authored-By: Claude Opus 4.8 --- R/normalize.R | 9 ++++++ README.Rmd | 10 +++++-- README.md | 27 +++++++++++++---- docs/normalization-contract.md | 54 ++++++++++++++++++++++++++++++++++ man/host_normalize.Rd | 9 ++++++ 5 files changed, 101 insertions(+), 8 deletions(-) diff --git a/R/normalize.R b/R/normalize.R index 7c97670..d7b3a5e 100644 --- a/R/normalize.R +++ b/R/normalize.R @@ -12,6 +12,15 @@ #' The profile is fixed at one pinned Unicode version per release; see #' [normalization_profile_info()] for the machine-readable identity. #' +#' This is a **UTS #46 profile, not IDNA2008 / RFC 5891 conformance.** UTS #46 +#' is compatibility processing and deliberately differs from IDNA2008 — it +#' accepts labels IDNA2008 would reject (e.g. `"☕.example"` becomes +#' `"xn--53h.example"`). The pipeline draws on RFC 3492 (the Punycode +#' transform), NFC per UAX #15, the RFC 5892 ContextJ rules via `CheckJoiners` +#' (ZWJ/ZWNJ only — full RFC 5892 CONTEXTO is **not** checked), the RFC 5893 +#' Bidi rule via `CheckBidi`, and STD 3 (RFC 952 + RFC 1123) host-name rules via +#' `UseSTD3ASCIIRules`. IDNA2003 / Nameprep (RFC 3490/3491/3454) is not used. +#' #' The default applies the full strict UTS #46 profile #' (`uts46-nontransitional-std3-v1`). The `check_hyphens`, `use_std3`, and #' `verify_dns_length` arguments are UTS #46 processing flags that can each be diff --git a/README.Rmd b/README.Rmd index b03a377..78e8946 100644 --- a/README.Rmd +++ b/README.Rmd @@ -24,7 +24,12 @@ High-performance Unicode and Punycode encoding/decoding for internationalized do ## Overview -The `punycoder` package addresses critical gaps in R's URL processing capabilities by providing reliable, fast conversion between Unicode and ASCII representations of domain names. It follows RFC 3492 standards and is designed for robust handling of internationalized domain names in web scraping, data analysis, and URL processing workflows. +The `punycoder` package provides fast, standards-based conversion between Unicode and ASCII representations of domain names, across two distinct surfaces: + +- a **low-level Punycode codec** — `puny_encode()` / `puny_decode()` — the raw RFC 3492 transform with `xn--` A-label framing (RFC 5890/5891) and letter-digit-hyphen checks, **not** an IDNA normalization API (no Unicode NFC, UTS #46 mapping, or case folding); +- an **IDNA/UTS-46 host-normalization surface** — `host_normalize()` — mapping a host name to its canonical lowercase ASCII comparison form under a pinned UTS #46 non-transitional profile. + +`host_normalize()` is a **UTS #46 profile, not IDNA2008 conformance** — UTS #46 is compatibility processing and deliberately accepts labels IDNA2008 would reject (e.g. `☕.example` → `xn--53h.example`). See [`docs/normalization-contract.md`](docs/normalization-contract.md) for the normative profile and full standards references (RFC 3492/5890/5891/5892/5893, UTS #46, UAX #15/#44, STD 3, RFC 8753). ## Dependencies @@ -144,7 +149,8 @@ validate_domain(c("valid.com", "invalid..domain")) `punycoder` currently provides: -- Domain encoding/decoding: `puny_encode()`, `puny_decode()` +- Low-level Punycode codec: `puny_encode()`, `puny_decode()` +- IDNA/UTS-46 host normalization: `host_normalize()`, `normalization_profile_info()` - Best-effort URL host rewriting/extraction (not URL parsing/canonicalization): `url_encode()`, `url_decode()`, `parse_url()` - Domain validation utilities: `is_punycode()`, `is_idn()`, `validate_domain()` - Vectorized operations and strict/non-strict handling for malformed input diff --git a/README.md b/README.md index 25f87c4..da79cf7 100644 --- a/README.md +++ b/README.md @@ -15,11 +15,24 @@ internationalized domain names (IDNs) in R. ## Overview -The `punycoder` package addresses critical gaps in R’s URL processing -capabilities by providing reliable, fast conversion between Unicode and -ASCII representations of domain names. It follows RFC 3492 standards and -is designed for robust handling of internationalized domain names in web -scraping, data analysis, and URL processing workflows. +The `punycoder` package provides fast, standards-based conversion +between Unicode and ASCII representations of domain names, across two +distinct surfaces: + +- a **low-level Punycode codec** — `puny_encode()` / `puny_decode()` — + the raw RFC 3492 transform with `xn--` A-label framing (RFC 5890/5891) + and letter-digit-hyphen checks, **not** an IDNA normalization API (no + Unicode NFC, UTS \#46 mapping, or case folding); +- an **IDNA/UTS-46 host-normalization surface** — `host_normalize()` — + mapping a host name to its canonical lowercase ASCII comparison form + under a pinned UTS \#46 non-transitional profile. + +`host_normalize()` is a **UTS \#46 profile, not IDNA2008 conformance** — +UTS \#46 is compatibility processing and deliberately accepts labels +IDNA2008 would reject (e.g. `☕.example` → `xn--53h.example`). See +[`docs/normalization-contract.md`](docs/normalization-contract.md) for +the normative profile and full standards references (RFC +3492/5890/5891/5892/5893, UTS \#46, UAX \#15/#44, STD 3, RFC 8753). ## Dependencies @@ -153,7 +166,9 @@ validate_domain(c("valid.com", "invalid..domain")) `punycoder` currently provides: -- Domain encoding/decoding: `puny_encode()`, `puny_decode()` +- Low-level Punycode codec: `puny_encode()`, `puny_decode()` +- IDNA/UTS-46 host normalization: `host_normalize()`, + `normalization_profile_info()` - Best-effort URL host rewriting/extraction (not URL parsing/canonicalization): `url_encode()`, `url_decode()`, `parse_url()` diff --git a/docs/normalization-contract.md b/docs/normalization-contract.md index 195b167..4a1bfd8 100644 --- a/docs/normalization-contract.md +++ b/docs/normalization-contract.md @@ -246,3 +246,57 @@ punycoder release. Contract ratified. Implementation (PSLR-encodgsk), parity tests (PSLR-gnmvyymh), and release (PSLR-obdfxkqb) may proceed. + +## 10. Standards and references + +`host_normalize()` implements a **pinned UTS #46 non-transitional ToASCII +profile**. UTS #46 is *compatibility processing* and is **deliberately not +identical to IDNA2008**: it accepts labels IDNA2008 would reject (e.g. +`☕.example` → `xn--53h.example`), and WHATWG specifies UTS #46 "and not +IDNA2008". This function must therefore be described as a UTS #46 profile, +**never** as IDNA2008 / RFC 5891 conformance. + +Standards this profile draws on, mapped to where each is used: + +- **UTS #46** (*Unicode IDNA Compatibility Processing*) — the overall mapping + + validation profile (section 3; algorithm steps 3a/3c/3d). UTS #46 §6 also + *recommends* UTR #36 / UTS #39 confusable checks as application/UI-layer + steps — out of scope here (section 1, Non-goals). +- **RFC 3492** (*Punycode*, the IDNA parameterization of Bootstring) — the + deterministic A-label ↔ U-label transform (step 4, and the re-encode check in + step 3d). The only step the optional libidn2 backend may serve (section 6). +- **RFC 5890** (*IDNA2008 Definitions/Framework*) — the `xn--` ACE prefix and + the A-label / U-label / LDH vocabulary. `puny_encode()` emitting `xn--` is RFC + 5890 framing, not RFC 3492. +- **RFC 5891** (*IDNA2008 Protocol*) — the §5.4 canonical-A-label requirement + enforced in step 3d. Cited for that specific check only; see the UTS #46 ≠ + IDNA2008 note above — we do **not** claim RFC 5891 conformance. +- **RFC 5892** (*The Unicode Code Points and IDNA*) — derived property values + and contextual rules. Our `CheckJoiners` uses the RFC 5892 **ContextJ** rules + for ZWJ/ZWNJ (`punycoder_normalize.cpp:69-86`, `Joining_Type` tables). We do + **NOT** implement full RFC 5892 **CONTEXTO** validation: CONTEXTO code points + that the UTS #46 mapping table marks `valid` (e.g. U+00B7 middle dot, Greek + keraia U+0375, Hebrew geresh/gershayim, Katakana middle dot U+30FB) are + accepted with no CONTEXTO check. That is conformant UTS #46 (which mandates + CheckJoiners, not CONTEXTO); adding CONTEXTO would be a separate, deliberate + change. +- **RFC 5893** (*Right-to-Left Scripts for IDNA*) — the Bidi rule enforced by + `CheckBidi`. +- **UAX #15** (*Unicode Normalization Forms*) — NFC (step 3b). +- **UAX #44** (*Unicode Character Database*) — the property tables consumed: + `Bidi_Class`, `Joining_Type`, `General_Category`, `Canonical_Combining_Class`. +- **STD 3** (= **RFC 952** + **RFC 1123**) — the host-name letter/digit/hyphen + rules behind `UseSTD3ASCIIRules`. +- **RFC 8753** (*IDNA Review for New Unicode Versions*) — rationale for pinning + one Unicode version per release (sections 7, 8). + +Consciously **not** used (rejected alternative): **RFC 3490 / 3491 / 3454** +(IDNA2003 / Nameprep / Stringprep). The UTS #46 non-transitional profile +supersedes them. **RFC 5894** (IDNA2008 rationale, informational) is background +reading, not a normative dependency. + +The best-effort URL helpers (`url_*` / `parse_url()`, `punycoder_url.cpp`) are +**not** conformant to **RFC 3986** (URI), **RFC 3987** (IRI), the **WHATWG URL +Standard**, **RFC 5952** (IPv6 text form), **RFC 4291** (IPv6 addressing), or +**RFC 6874** (IPv6 zone IDs); those citations belong with that surface and move +with it if it migrates to a dedicated URL package. diff --git a/man/host_normalize.Rd b/man/host_normalize.Rd index 68fcfae..dbeb77d 100644 --- a/man/host_normalize.Rd +++ b/man/host_normalize.Rd @@ -46,6 +46,15 @@ Unlike [puny_encode()], invalid input is reported by returning The profile is fixed at one pinned Unicode version per release; see [normalization_profile_info()] for the machine-readable identity. +This is a **UTS #46 profile, not IDNA2008 / RFC 5891 conformance.** UTS #46 +is compatibility processing and deliberately differs from IDNA2008 — it +accepts labels IDNA2008 would reject (e.g. `"☕.example"` becomes +`"xn--53h.example"`). The pipeline draws on RFC 3492 (the Punycode +transform), NFC per UAX #15, the RFC 5892 ContextJ rules via `CheckJoiners` +(ZWJ/ZWNJ only — full RFC 5892 CONTEXTO is **not** checked), the RFC 5893 +Bidi rule via `CheckBidi`, and STD 3 (RFC 952 + RFC 1123) host-name rules via +`UseSTD3ASCIIRules`. IDNA2003 / Nameprep (RFC 3490/3491/3454) is not used. + The default applies the full strict UTS #46 profile (`uts46-nontransitional-std3-v1`). The `check_hyphens`, `use_std3`, and `verify_dns_length` arguments are UTS #46 processing flags that can each be