The link checker code in this post is also available in this GitHub Gist.
Previously on link checking…
We discussed the
link rot issue before.
We also built a simple link checker
for blogdown projects based on the XML sitemap generated by Hugo.
In reality, I maintain many more bookdown projects and Quarto projects
than blogdown projects. Searching site: bookdown::bookdown_site
also shows
that there are 10,000+ bookdown projects hosted on GitHub alone.
Can we build a general-purpose link checker to check them all?
Conceptually, one can make the link checker a lot more generic by checking
against the collection of source Markdown-like files directly. For example,
the .Rmd
, .md
, and .qmd
files in bookdown, blogdown, or Quarto projects.
To do this, I still prefer leveraging the awesome urlchecker package because it marks the document with precise positions of the problematic links and explanations from rigorous rules that satisfy high standards.
✖ Error: vignettes/tlf-overview.Rmd:20:43 Error: Failed to connect to bitbucket.cdisc.org port 443 after 8572 ms: Network is unreachable
[pilot project following ICH E3 guidance](https://bitbucket.cdisc.org/projects/CED/repos/sdtm-adam-pilot-project/browse).
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
✖ Error: vignettes/slides⧸workshop-slides.Rmd:118:34 404: Not Found
[Section 1.1 of R Packages book](https://r-pkgs.org/intro.html#intro-phil) and quote here.
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
! Warning: vignettes/tlf-overview.Rmd:101:44 Moved
RStudio provided outstanding [cheatsheets](https://www.rstudio.com/resources/cheatsheets/)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
https://posit.co/resources/cheatsheets/
✖ Error: vignettes/tlf-overview.Rmd:98:28 Error: CRAN URL not in canonical form
[the tidy tools manifesto](https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Requirements
We may face a few unique challenges when using urlchecker to check against the source files. Most importantly, urlchecker is almost specifically built for checking R packages and might only be used for checking other types of projects with modifications. Therefore, to check a list of Markdown-like files, a simple workaround is to check them as “vignettes” in a fake, skeleton R package. This results in the following technical considerations.
① Hallmarks of vignette
We should figure out the minimal requirements (e.g., directory structure, metadata) for the files copied into the fake package—then urlchecker can recognize them as vignettes and perform the check.
All files should be copied into a single, flat
vignettes/
directory under the fake package without creating any subdirectory structures.To get the metadata requirements, we need to know how the package vignettes are located and identified in urlchecker.
② File type support
We should cover most file types containing URL content, including
.Rmd
, .md
, and .qmd
files.
Note that the .ipynb
files in (a small number of) Quarto projects are
out of scope here for simplicity’s sake. However, one can convert them
to Markdown first with knitr::pandoc()
easily.
③ File name mapping
The source R Markdown or Quarto project files can be stored in different subdirectories. They may also share identical file names (base name, base name + extension).
We aim to create an informative and predictable mapping for these file names (and potentially, their subdirectory path) in the destination directory for urlchecker to give good hints on which source files have broken links.
④ File name conflicts
For everything above, a common goal is to make minimal assumptions about the source file name and directory structure patterns and avoid duplicated file names in the destination directory.
Implementation
“Flatten copy” of the source files can be a common task, so it would be ideal if we create a reusable function.
- I renamed the
.qmd
and.md
files by adding a.Rmd
extension so that they can be captured as vignettes. - I used a trick to replace the forward slashes in the path with the
Unicode character U+29F8:
⧸
(big solidus) so that they are visually similar to the directory separator/
while still allowed in file names.
This approach ensures an informative, non-conflicting, one to one mapping.
Edge cases may exist, say, if you have example.md.Rmd
and example.md
under the same directory, but I would consider it rare.
Overall, the file name mapping scheme looks like this.
/source-project/ → /tempdir/pkg/vignettes/
├── README.md → ├── README.md.Rmd
├── ch01.Rmd → ├── ch01.Rmd
├── ch02.qmd → ├── ch02.qmd.Rmd
├── dir1 │
│ └── example.Rmd → ├── dir1⧸example.Rmd
└── dir2 │
└── dir3 │
└── example.Rmd → └── dir2⧸dir3⧸example.Rmd
#' Flatten copy
#'
#' @param from Source directory path.
#' @param to Destination directory path.
#'
#' @return Destination directory path.
#'
#' @details
#' Copy all `.Rmd`, `.qmd`, and `.md` files from source to destination,
#' rename the `.qmd` and `.md` files with an additional `.Rmd` extension,
#' and get a flat destination structure with path-preserving file names.
flatten_copy <- function(from, to) {
rmd <- list.files(from, pattern = "\\.Rmd$", recursive = TRUE, full.names = TRUE)
xmd <- list.files(from, pattern = "\\.qmd$|\\.md$", recursive = TRUE, full.names = TRUE)
src <- c(rmd, xmd)
dst <- c(rmd, paste0(xmd, ".Rmd"))
# Remove starting `./` (if any)
dst <- gsub("^\\./", replacement = "", x = dst)
# Replace the forward slash in path with Unicode big solidus
dst <- gsub("/", replacement = "\u29F8", x = dst)
file.copy(src, to = file.path(to, dst))
invisible(to)
}
OK, now it finally comes to the minimal package structure. If we
look into urlchecker,
it calls tools::pkgVignettes()$docs
to locate the vignettes.
By further inspecting the
relevant logic in the function,
it takes two core criteria to identify .Rmd
files as vignettes:
- the
VignetteBuilder
field in theDESCRIPTION
file; - the
VignetteEngine
metadata entry in each.Rmd
file.
Then the logic to construct the package becomes natural:
#' Check URLs in an R Markdown or Quarto project
#'
#' @param input Path to the project directory.
#'
#' @return URL checking results from `urlchecker::url_check()`
#' for all `.Rmd`, `.qmd`, and `.md` files in the project.
#'
#' @details
#' The `tools::pkgVignettes()$docs` call in urlchecker requires
#' two core criteria (`VignetteBuilder` and `VignetteEngine`)
#' to recognize `.Rmd` files as package vignettes.
check_url <- function(input = ".") {
# Create a source package directory
pkg <- tempfile()
dir.create(pkg)
# Flatten copy relevant files
vig <- file.path(pkg, "vignettes")
dir.create(vig)
flatten_copy(input, vig)
# Create a minimal DESCRIPTION file
write("VignetteBuilder: knitr", file = file.path(pkg, "DESCRIPTION"))
# Make the copied files look like vignettes
lapply(
list.files(vig, full.names = TRUE),
function(x) {
write(
"---\nvignette: >\n %\\VignetteEngine{knitr::rmarkdown}\n---",
file = x, append = TRUE
)
}
)
urlchecker::url_check(pkg)
}
Dogfooding
With the help of the URL checker, I
found and corrected
all the broken URLs by running check_url()
for our bookdown project
R for Clinical Study Reports and Submission.
It feels so good to see:
fetching [ 126 / 126 ]
✔ All URLs are correct!