A General-Purpose Link Checker for R Markdown and Quarto Projects

Nan Xiao January 16, 2023 6 min read

The link checker code in this post is also available in this GitHub Gist.

Puffins in the rain, Iceland. Photo by Yves Alarie.

Requirements

We may face a few unique challenges when using urlchecker to check against the source files. Most importantly, urlchecker is almost specifically built for checking R packages and might only be used for checking other types of projects with modifications. Therefore, to check a list of Markdown-like files, a simple workaround is to check them as “vignettes” in a fake, skeleton R package. This results in the following technical considerations.

① Hallmarks of vignette

We should figure out the minimal requirements (e.g., directory structure, metadata) for the files copied into the fake package—then urlchecker can recognize them as vignettes and perform the check.

  • All files should be copied into a single, flat vignettes/ directory under the fake package without creating any subdirectory structures.

  • To get the metadata requirements, we need to know how the package vignettes are located and identified in urlchecker.

② File type support

We should cover most file types containing URL content, including .Rmd, .md, and .qmd files. Note that the .ipynb files in (a small number of) Quarto projects are out of scope here for simplicity’s sake. However, one can convert them to Markdown first with knitr::pandoc() easily.

③ File name mapping

The source R Markdown or Quarto project files can be stored in different subdirectories. They may also share identical file names (base name, base name + extension).

We aim to create an informative and predictable mapping for these file names (and potentially, their subdirectory path) in the destination directory for urlchecker to give good hints on which source files have broken links.

④ File name conflicts

For everything above, a common goal is to make minimal assumptions about the source file name and directory structure patterns and avoid duplicated file names in the destination directory.

Implementation

“Flatten copy” of the source files can be a common task, so it would be ideal if we create a reusable function.

  • I renamed the .qmd and .md files by adding a .Rmd extension so that they can be captured as vignettes.
  • I used a trick to replace the forward slashes in the path with the Unicode character U+29F8: (big solidus) so that they are visually similar to the directory separator / while still allowed in file names.

This approach ensures an informative, non-conflicting, one to one mapping. Edge cases may exist, say, if you have example.md.Rmd and example.md under the same directory, but I would consider it rare. Overall, the file name mapping scheme looks like this.

/source-project/         →  /tempdir/pkg/vignettes/
├── README.md            →  ├── README.md.Rmd
├── ch01.Rmd             →  ├── ch01.Rmd
├── ch02.qmd             →  ├── ch02.qmd.Rmd
├── dir1                    │
│   └── example.Rmd      →  ├── dir1⧸example.Rmd
└── dir2                    │
    └── dir3                │
        └── example.Rmd  →  └── dir2⧸dir3⧸example.Rmd
#' Flatten copy
#'
#' @param from Source directory path.
#' @param to Destination directory path.
#'
#' @return Destination directory path.
#'
#' @details
#' Copy all `.Rmd`, `.qmd`, and `.md` files from source to destination,
#' rename the `.qmd` and `.md` files with an additional `.Rmd` extension,
#' and get a flat destination structure with path-preserving file names.
flatten_copy <- function(from, to) {
  rmd <- list.files(from, pattern = "\\.Rmd$", recursive = TRUE, full.names = TRUE)
  xmd <- list.files(from, pattern = "\\.qmd$|\\.md$", recursive = TRUE, full.names = TRUE)

  src <- c(rmd, xmd)
  dst <- c(rmd, paste0(xmd, ".Rmd"))

  # Remove starting `./` (if any)
  dst <- gsub("^\\./", replacement = "", x = dst)
  # Replace the forward slash in path with Unicode big solidus
  dst <- gsub("/", replacement = "\u29F8", x = dst)

  file.copy(src, to = file.path(to, dst))

  invisible(to)
}

OK, now it finally comes to the minimal package structure. If we look into urlchecker, it calls tools::pkgVignettes()$docs to locate the vignettes. By further inspecting the relevant logic in the function, it takes two core criteria to identify .Rmd files as vignettes:

  • the VignetteBuilder field in the DESCRIPTION file;
  • the VignetteEngine metadata entry in each .Rmd file.

Then the logic to construct the package becomes natural:

#' Check URLs in an R Markdown or Quarto project
#'
#' @param input Path to the project directory.
#'
#' @return URL checking results from `urlchecker::url_check()`
#' for all `.Rmd`, `.qmd`, and `.md` files in the project.
#'
#' @details
#' The `tools::pkgVignettes()$docs` call in urlchecker requires
#' two core criteria (`VignetteBuilder` and `VignetteEngine`)
#' to recognize `.Rmd` files as package vignettes.
check_url <- function(input = ".") {
  # Create a source package directory
  pkg <- tempfile()
  dir.create(pkg)

  # Flatten copy relevant files
  vig <- file.path(pkg, "vignettes")
  dir.create(vig)
  flatten_copy(input, vig)

  # Create a minimal DESCRIPTION file
  write("VignetteBuilder: knitr", file = file.path(pkg, "DESCRIPTION"))

  # Make the copied files look like vignettes
  lapply(
    list.files(vig, full.names = TRUE),
    function(x) {
      write(
        "---\nvignette: >\n  %\\VignetteEngine{knitr::rmarkdown}\n---",
        file = x, append = TRUE
      )
    }
  )

  urlchecker::url_check(pkg)
}

Dogfooding

With the help of the URL checker, I found and corrected all the broken URLs by running check_url() for our bookdown project R for Clinical Study Reports and Submission. It feels so good to see:

fetching [ 126 / 126 ]
✔ All URLs are correct!