rurl is a lightweight, vectorized toolkit for URL parsing,
normalization, extraction, and matching in R.
Current package capabilities include:
- Robust parsing via
safe_parse_url()andsafe_parse_urls() - URL normalization with fine-grained controls for protocol,
www, case, trailing slashes, index pages, path normalization, scheme-relative URLs, host encoding, and path encoding - URL component extractors (
get_*helpers) - URL-based joins with
canonical_join() - Built-in memoization caches with introspection and configuration
(
rurl_cache_info(),rurl_cache_config(),rurl_clear_caches())
# From CRAN
install.packages("rurl")
# Development version from GitHub
# install.packages("remotes")
remotes::install_github("bart-turczynski/rurl")- Parsing and normalization:
safe_parse_url(),safe_parse_urls(),get_clean_url() - Accessors:
get_scheme(),get_host(),get_subdomain(),get_domain(),get_tld(),get_path(),get_query(),get_fragment(),get_port(),get_user(),get_password(),get_userinfo(),get_parse_status() - Matching/joining:
canonical_join()for deterministic canonical-key joins - Cache control:
rurl_cache_info(),rurl_cache_config(),rurl_clear_caches()
safe_parse_url() is the core workhorse. It returns parsed components
and a normalized clean_url.
library(rurl)
parsed <- safe_parse_url(
"HTTP://www.Example.com/a//b/../index.html?x=1#frag",
protocol_handling = "https",
www_handling = "strip",
case_handling = "lower_host",
trailing_slash_handling = "strip",
index_page_handling = "strip",
path_normalization = "both",
host_encoding = "idna",
path_encoding = "encode"
)
parsed$clean_url
#> [1] "https://example.com/a"
parsed$parse_status
#> [1] "ok"clean_url is a normalized canonical key built from scheme, host, and
path only. Port, query, fragment, and userinfo are intentionally
excluded — read them from the dedicated components (get_port(),
get_query(), get_fragment(), get_userinfo()) instead. With
path_encoding = "decode" the path is shown decoded, so clean_url is
human-readable rather than guaranteed URL-safe.
Scheme-relative URL handling is configurable:
safe_parse_url("//example.com/path", scheme_relative_handling = "keep")$parse_status
#> [1] "ok-scheme-relative"
safe_parse_url("//example.com/path", scheme_relative_handling = "https")$clean_url
#> [1] "https://example.com/path"For vectors, use safe_parse_urls():
safe_parse_urls(c("example.com", "https://www.example.com/path"))[, c("original_url", "clean_url", "parse_status")]safe_parse_url() and get_clean_url() support these controls:
protocol_handling:keep,none,strip,http,httpswww_handling:none,strip,keep,if_no_subdomaincase_handling:lower_host(default),keep,lower,uppertrailing_slash_handling:none,keep,stripindex_page_handling:keep,strippath_normalization:none,collapse_slashes,dot_segments,bothscheme_relative_handling:keep,http,https,errorhost_encoding:keep,idna,unicodepath_encoding:keep,encode,decodesubdomain_levels_to_keep:NULL,0, orN > 0
Subdomain retention is applied after www_handling:
get_host("http://www.three.two.one.example.com", www_handling = "strip", subdomain_levels_to_keep = 1)
#> [1] "one.example.com"
get_clean_url("http://www.deep.sub.example.com/path", subdomain_levels_to_keep = 0)
#> [1] "http://www.example.com/path"Host and path encoding controls:
get_clean_url("http://münich.com/a%20b",
host_encoding = "idna",
path_encoding = "encode",
case_handling = "lower_host")
#> [1] "http://xn--mnich-kva.com/a%20b"
get_clean_url("http://xn--mnich-kva.com/a%20b",
host_encoding = "unicode",
path_encoding = "decode",
case_handling = "keep")
#> [1] "http://münich.com/a b"u <- "https://user:pass@www.blog.example.co.uk/path/to/page?a=1&b=2#frag"
get_scheme(u)
get_host(u)
get_subdomain(u)
get_domain(u)
get_tld(u)
get_path(u)
get_query(u)
get_query(u, format = "list")
get_fragment(u)
get_port(u)
get_user(u)
get_password(u)
get_userinfo(u)
get_parse_status(c(u, "mailto:test@example.com"))canonical_join() matches on one canonicalized key per URL and is the
preferred option for large datasets:
A <- data.frame(URL = c("http://Example.com/Page", "http://example.com/Other"),
ValA = 1:2, stringsAsFactors = FALSE)
B <- data.frame(URL = c("https://www.example.com/Page/", "http://example.com/Miss"),
ValB = c("x", "y"), stringsAsFactors = FALSE)
canonical_join(
A, B,
protocol_handling = "strip",
www_handling = "strip",
case_handling = "lower_host",
trailing_slash_handling = "strip"
)rurl memoizes parse/domain/TLD work to speed repeated operations over
large URL vectors. Inspect, clear, and configure the caches:
rurl_cache_info() # entries / enabled / max per cache
rurl_clear_caches() # free memory in a long-running session
rurl_cache_config(max_full_parse = 1e5) # bound the full-parse cache
rurl_cache_config(domain = FALSE) # disable a cache entirelyThe full_parse cache is unbounded by default (max_full_parse = Inf);
set a bound to cap its peak memory. The domain and tld caches grow
with the number of unique hosts and can be disabled for workloads with
very many of them.
Domain and TLD extraction is delegated to the
pslr package, which implements the
Public Suffix List (PSL) with correct handling of
wildcard (*.) and exception (!) rules and IDN hosts. rurl maps its
source argument ("all", "icann", "private") onto the corresponding
pslr section and always returns domains/TLDs in Unicode form.
pslr ships its own bundled copy of the list and can refresh it via
pslr::psl_refresh(); see the pslr documentation for details.
MIT © 2025 Bart Turczynski