rurl

rurl is a small, pipe-friendly, and vectorized R package that helps you construct, parse, and clean URLs from various components. It is designed to make URL manipulation and HTTP endpoint generation readable, composable, and easy to integrate into modern R workflows.

It includes helpers to: - Get cleaned URLs with fine-grained control over protocols, ‘www’ prefixes, letter casing, and trailing slashes. - Extract domains, paths, schemes, hosts, and top-level domains (TLDs). - Normalize, strip, or enforce specific protocols. - Handle ‘www’ prefixes in hostnames.

Installation

# Install from GitHub
# devtools::install_github("bart-turczynski/rurl")

Core Functionality

The primary workhorse of the package is safe_parse_url(). This function comprehensively parses a URL and allows for various transformations. It returns a detailed list of URL components and a cleaned URL string.

library(rurl)

# Example of detailed parsing and transformation
parsed_details <- safe_parse_url(
  "Http://Www.Example.Com/Some/Path/?Query=1#Frag",
  protocol_handling = "https",      # Force https
  www_handling = "strip",         # Remove www
  case_handling = "lower",        # Convert to lowercase (new default)
  trailing_slash_handling = "strip" # Remove trailing slash
)

# The cleaned URL (scheme, host, path only by default)
print(parsed_details$clean_url)
#> [1] "https://example.com/some/path"

# Full list of parsed and derived components
# print(parsed_details)
# Output would include:
# $original_url: "Http://Www.Example.Com/Some/Path/?Query=1#Frag"
# $scheme: "https"
# $host: "example.com"
# $port: NULL (or parsed port if present and not stripped)
# $path: "/some/path" (after trailing slash handling)
# $query: "Query=1"
# $fragment: "Frag"
# $domain: "example.com"
# $tld: "com"
# $is_ip_host: FALSE
# $clean_url: "https://example.com/some/path"
# $parse_status: "ok"

Most other exported functions are convenient wrappers around safe_parse_url() to extract specific parts of a URL or just the clean_url string.

Examples of Helper Functions

library(rurl)

# Get cleaned URL with specific handling
# Note: case_handling defaults to "lower", trailing_slash_handling to "none"

get_clean_url("Http://Example.Com/MyPath/")
#> [1] "http://example.com/mypath/"

get_clean_url("Http://Example.Com/MyPath/",
              case_handling = "keep",
              trailing_slash_handling = "strip")
#> [1] "Http://example.com/MyPath"

get_clean_url("ftp://Sub.Example.ORG/anotherPath",
              protocol_handling = "strip", # Removes ftp://
              www_handling = "keep",       # Ensures www.
              case_handling = "upper",     # Converts to uppercase
              trailing_slash_handling = "keep") # Ensures trailing slash
#> [1] "WWW.EXAMPLE.ORG/ANOTHERPATH/"

get_clean_url("example.com:8080/path", trailing_slash_handling = "keep")
#> [1] "http://example.com/path/" # Port is not part of clean_url by default

# Extracting specific components
get_domain("https://sub.example.co.uk/page")
#> [1] "example.co.uk"

get_scheme("Example.com/Test", protocol_handling = "https")
#> [1] "https"

get_scheme("Example.com/Test", protocol_handling = "none") # No scheme forced or kept
#> [1] NA

# Host is lowercased due to default case_handling in the underlying safe_parse_url call
get_host("Http://User:Pass@MyHost.Com:8080/SomeWhere")
#> [1] "myhost.com"

# Path is lowercased; trailing slash from input is preserved with trailing_slash_handling = "none" (default)
get_path("HTTP://EXAMPLE.NET/A/B/C/?p=1")
#> [1] "/a/b/c/"

get_tld("www.sub.example.co.uk")
#> [1] "co.uk"

get_parse_status("mailto:test@example.com")
#> [1] "error"
get_parse_status("http://example.com")
#> [1] "ok"

# Handling Subdomain Levels

Functions like `get_host()` and `get_clean_url()` also support the `subdomain_levels_to_keep` argument.
This provides control over how many subdomain levels are kept in the host *after* `www_handling`.

# `NULL` (default): keeps all subdomains (beyond www handling).
# `0`: strips all subdomains (e.g., `one.two.example.com` becomes `example.com`).
# `N > 0`: keeps N levels of subdomains (e.g., `one.two.example.com` with N=1 becomes `two.example.com`).

# With `www_handling = "strip"` and `subdomain_levels_to_keep = 1`:
get_host("http://www.three.two.one.example.com", www_handling = "strip", subdomain_levels_to_keep = 1)
#> [1] "one.example.com"

# With `subdomain_levels_to_keep = 0` (default `www_handling` is `"none"`):
get_clean_url("http://www.deep.sub.example.com/path", subdomain_levels_to_keep = 0)
#> [1] "http://www.example.com/path"

Data Sources

This package includes a processed copy of the Public Suffix List (PSL), used to extract registered domains and top-level domains. It is updated manually via data-raw/update_psl.R. The original list is maintained by Mozilla and hosted at: https://publicsuffix.org/list/public_suffix_list.dat The data is included in accordance with the Mozilla Public License 2.0 and is never downloaded at runtime. See inst/LICENSE.psl for full license text.

To refresh it:

# From the root of the package:
source("data-raw/update_psl.R")

This regenerates the internal sysdata.rda file used for domain parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
.github/workflows		.github/workflows
R		R
data-raw		data-raw
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.html		README.html
README.md		README.md
rurl.Rproj		rurl.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

rurl

rurl

Installation

Core Functionality

Examples of Helper Functions

Data Sources

License

About

Licenses found

Uh oh!

Releases 6

Packages

Uh oh!

Languages

License

Licenses found

bart-turczynski/rurl

Folders and files

Latest commit

History

Repository files navigation

rurl

rurl

Installation

Core Functionality

Examples of Helper Functions

Data Sources

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Languages

Packages