Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions R/normalize.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,15 @@
#' The profile is fixed at one pinned Unicode version per release; see
#' [normalization_profile_info()] for the machine-readable identity.
#'
#' This is a **UTS #46 profile, not IDNA2008 / RFC 5891 conformance.** UTS #46
#' is compatibility processing and deliberately differs from IDNA2008 — it
#' accepts labels IDNA2008 would reject (e.g. `"☕.example"` becomes
#' `"xn--53h.example"`). The pipeline draws on RFC 3492 (the Punycode
#' transform), NFC per UAX #15, the RFC 5892 ContextJ rules via `CheckJoiners`
#' (ZWJ/ZWNJ only — full RFC 5892 CONTEXTO is **not** checked), the RFC 5893
#' Bidi rule via `CheckBidi`, and STD 3 (RFC 952 + RFC 1123) host-name rules via
#' `UseSTD3ASCIIRules`. IDNA2003 / Nameprep (RFC 3490/3491/3454) is not used.
#'
#' The default applies the full strict UTS #46 profile
#' (`uts46-nontransitional-std3-v1`). The `check_hyphens`, `use_std3`, and
#' `verify_dns_length` arguments are UTS #46 processing flags that can each be
Expand Down Expand Up @@ -57,9 +66,9 @@

# Derive the coarse `profile` cache token from a flag set. The default profile
# (all checks on) yields the byte-stable historical token; any deviation appends
# a deterministic, fixed-order tag so a token minted under one flag set can never

Check warning on line 69 in R/normalize.R

View workflow job for this annotation

GitHub Actions / lint

file=R/normalize.R,line=69,col=81,[line_length_linter] Lines should not be more than 80 characters. This line is 81 characters.
# `identical()`-match one minted under another. The token is a COARSE cache key
# only: the precise identity lives in the per-parameter columns, which downstream

Check warning on line 71 in R/normalize.R

View workflow job for this annotation

GitHub Actions / lint

file=R/normalize.R,line=71,col=81,[line_length_linter] Lines should not be more than 80 characters. This line is 81 characters.
# keys on (PUNY-nblrvplp). check_bidi / check_joiners / transitional are not
# knobs (fixed by the profile), so they never enter the token.
.normalization_profile_token <- function(check_hyphens, use_std3,
Expand Down
10 changes: 8 additions & 2 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,12 @@ High-performance Unicode and Punycode encoding/decoding for internationalized do

## Overview

The `punycoder` package addresses critical gaps in R's URL processing capabilities by providing reliable, fast conversion between Unicode and ASCII representations of domain names. It follows RFC 3492 standards and is designed for robust handling of internationalized domain names in web scraping, data analysis, and URL processing workflows.
The `punycoder` package provides fast, standards-based conversion between Unicode and ASCII representations of domain names, across two distinct surfaces:

- a **low-level Punycode codec** — `puny_encode()` / `puny_decode()` — the raw RFC 3492 transform with `xn--` A-label framing (RFC 5890/5891) and letter-digit-hyphen checks, **not** an IDNA normalization API (no Unicode NFC, UTS #46 mapping, or case folding);
- an **IDNA/UTS-46 host-normalization surface** — `host_normalize()` — mapping a host name to its canonical lowercase ASCII comparison form under a pinned UTS #46 non-transitional profile.

`host_normalize()` is a **UTS #46 profile, not IDNA2008 conformance** — UTS #46 is compatibility processing and deliberately accepts labels IDNA2008 would reject (e.g. `☕.example` → `xn--53h.example`). See [`docs/normalization-contract.md`](docs/normalization-contract.md) for the normative profile and full standards references (RFC 3492/5890/5891/5892/5893, UTS #46, UAX #15/#44, STD 3, RFC 8753).

## Dependencies

Expand Down Expand Up @@ -144,7 +149,8 @@ validate_domain(c("valid.com", "invalid..domain"))

`punycoder` currently provides:

- Domain encoding/decoding: `puny_encode()`, `puny_decode()`
- Low-level Punycode codec: `puny_encode()`, `puny_decode()`
- IDNA/UTS-46 host normalization: `host_normalize()`, `normalization_profile_info()`
- Best-effort URL host rewriting/extraction (not URL parsing/canonicalization): `url_encode()`, `url_decode()`, `parse_url()`
- Domain validation utilities: `is_punycode()`, `is_idn()`, `validate_domain()`
- Vectorized operations and strict/non-strict handling for malformed input
Expand Down
27 changes: 21 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,24 @@ internationalized domain names (IDNs) in R.

## Overview

The `punycoder` package addresses critical gaps in R’s URL processing
capabilities by providing reliable, fast conversion between Unicode and
ASCII representations of domain names. It follows RFC 3492 standards and
is designed for robust handling of internationalized domain names in web
scraping, data analysis, and URL processing workflows.
The `punycoder` package provides fast, standards-based conversion
between Unicode and ASCII representations of domain names, across two
distinct surfaces:

- a **low-level Punycode codec** — `puny_encode()` / `puny_decode()` —
the raw RFC 3492 transform with `xn--` A-label framing (RFC 5890/5891)
and letter-digit-hyphen checks, **not** an IDNA normalization API (no
Unicode NFC, UTS \#46 mapping, or case folding);
- an **IDNA/UTS-46 host-normalization surface** — `host_normalize()` —
mapping a host name to its canonical lowercase ASCII comparison form
under a pinned UTS \#46 non-transitional profile.

`host_normalize()` is a **UTS \#46 profile, not IDNA2008 conformance** —
UTS \#46 is compatibility processing and deliberately accepts labels
IDNA2008 would reject (e.g. `☕.example` → `xn--53h.example`). See
[`docs/normalization-contract.md`](docs/normalization-contract.md) for
the normative profile and full standards references (RFC
3492/5890/5891/5892/5893, UTS \#46, UAX \#15/#44, STD 3, RFC 8753).

## Dependencies

Expand Down Expand Up @@ -153,7 +166,9 @@ validate_domain(c("valid.com", "invalid..domain"))

`punycoder` currently provides:

- Domain encoding/decoding: `puny_encode()`, `puny_decode()`
- Low-level Punycode codec: `puny_encode()`, `puny_decode()`
- IDNA/UTS-46 host normalization: `host_normalize()`,
`normalization_profile_info()`
- Best-effort URL host rewriting/extraction (not URL
parsing/canonicalization): `url_encode()`, `url_decode()`,
`parse_url()`
Expand Down
54 changes: 54 additions & 0 deletions docs/normalization-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,3 +246,57 @@ punycoder release.

Contract ratified. Implementation (PSLR-encodgsk), parity tests
(PSLR-gnmvyymh), and release (PSLR-obdfxkqb) may proceed.

## 10. Standards and references

`host_normalize()` implements a **pinned UTS #46 non-transitional ToASCII
profile**. UTS #46 is *compatibility processing* and is **deliberately not
identical to IDNA2008**: it accepts labels IDNA2008 would reject (e.g.
`☕.example` → `xn--53h.example`), and WHATWG specifies UTS #46 "and not
IDNA2008". This function must therefore be described as a UTS #46 profile,
**never** as IDNA2008 / RFC 5891 conformance.

Standards this profile draws on, mapped to where each is used:

- **UTS #46** (*Unicode IDNA Compatibility Processing*) — the overall mapping +
validation profile (section 3; algorithm steps 3a/3c/3d). UTS #46 §6 also
*recommends* UTR #36 / UTS #39 confusable checks as application/UI-layer
steps — out of scope here (section 1, Non-goals).
- **RFC 3492** (*Punycode*, the IDNA parameterization of Bootstring) — the
deterministic A-label ↔ U-label transform (step 4, and the re-encode check in
step 3d). The only step the optional libidn2 backend may serve (section 6).
- **RFC 5890** (*IDNA2008 Definitions/Framework*) — the `xn--` ACE prefix and
the A-label / U-label / LDH vocabulary. `puny_encode()` emitting `xn--` is RFC
5890 framing, not RFC 3492.
- **RFC 5891** (*IDNA2008 Protocol*) — the §5.4 canonical-A-label requirement
enforced in step 3d. Cited for that specific check only; see the UTS #46 ≠
IDNA2008 note above — we do **not** claim RFC 5891 conformance.
- **RFC 5892** (*The Unicode Code Points and IDNA*) — derived property values
and contextual rules. Our `CheckJoiners` uses the RFC 5892 **ContextJ** rules
for ZWJ/ZWNJ (`punycoder_normalize.cpp:69-86`, `Joining_Type` tables). We do
**NOT** implement full RFC 5892 **CONTEXTO** validation: CONTEXTO code points
that the UTS #46 mapping table marks `valid` (e.g. U+00B7 middle dot, Greek
keraia U+0375, Hebrew geresh/gershayim, Katakana middle dot U+30FB) are
accepted with no CONTEXTO check. That is conformant UTS #46 (which mandates
CheckJoiners, not CONTEXTO); adding CONTEXTO would be a separate, deliberate
change.
- **RFC 5893** (*Right-to-Left Scripts for IDNA*) — the Bidi rule enforced by
`CheckBidi`.
- **UAX #15** (*Unicode Normalization Forms*) — NFC (step 3b).
- **UAX #44** (*Unicode Character Database*) — the property tables consumed:
`Bidi_Class`, `Joining_Type`, `General_Category`, `Canonical_Combining_Class`.
- **STD 3** (= **RFC 952** + **RFC 1123**) — the host-name letter/digit/hyphen
rules behind `UseSTD3ASCIIRules`.
- **RFC 8753** (*IDNA Review for New Unicode Versions*) — rationale for pinning
one Unicode version per release (sections 7, 8).

Consciously **not** used (rejected alternative): **RFC 3490 / 3491 / 3454**
(IDNA2003 / Nameprep / Stringprep). The UTS #46 non-transitional profile
supersedes them. **RFC 5894** (IDNA2008 rationale, informational) is background
reading, not a normative dependency.

The best-effort URL helpers (`url_*` / `parse_url()`, `punycoder_url.cpp`) are
**not** conformant to **RFC 3986** (URI), **RFC 3987** (IRI), the **WHATWG URL
Standard**, **RFC 5952** (IPv6 text form), **RFC 4291** (IPv6 addressing), or
**RFC 6874** (IPv6 zone IDs); those citations belong with that surface and move
with it if it migrates to a dedicated URL package.
9 changes: 9 additions & 0 deletions man/host_normalize.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading