I have been blogging with Hugo/blogdown for a while. One housekeeping task I have always wanted to automate with R is scanning the entire website to ensure that all the links are still working. It is essential for maintaining an enjoyable reading experience without archiving too many external links.
Conceptually, the requirement for a generic broken link checker is quite simple:
- Get the links to all pages on the site.
- Scrape and parse the pages to get the links contained.
- Check the link status.
However, getting the links to all pages may involve recursive scraping and parsing of a site. It could make the program behavior unpredictable and add too many lines for exception handling.
It is dramatically easier if we consider a more specific checker for Hugo and blogdown websites with certain configurations. For example, the built-in sitemap in Hugo allows us to discover the links to all internal pages with one simple step of XML parsing.
We also have a decent infrastructure in R for page parsing and link checking:
- xml2 and rvest to parse and extract elements from the XML and HTML;
- urlchecker to check links in parallel, with informative feedback. It was initially built for checking the links in R packages before submitting to CRAN but can be easily repurposed.
Let’s try building a minimum viable link checker with these in mind.
A few limitations of this simple checker:
- This checker did not handle the case where relative URLs are used for
linking internal pages. It worked for me because
- I use absolute URLs to link all pages when creating content, even if it is an internal resource;
- I also did not enable
relativeURLsin the Hugo configuration. Therefore, all rendered links to internal pages are absolute URLs.
- This checker did not check resources linked using a tag other than
<a>, such as images linked with
- It is trivial to extend it to handle resources linked with other HTML tags and attributes.
- For external resources such as images, I have the habit of saving
and serving a local copy while explicitly linking the source with