Parsing Human-Readable Text Data with Readability.js and R

Nan Xiao August 2, 2022 8 min read

The R and JavaScript code to reproduce the results in this post is available from https://github.com/nanxstats/r-readability-parser.

Photo by Nick Hillier.
Photo by Nick Hillier.

Update (2023-09-04): Browserify support was removed in jsdom v22.0.0. This post has been updated to recommend installing jsdom v20.0.0 to ensure the code examples work as intended. The original versions of JS libraries in the GitHub repo remain unchanged, as they continue to function correctly.

Readability.js

Maybe you have used tools like rvest to harvest text data from web pages. Naturally, this often requires elaborated human efforts in the front to understand the structure of the target website.

The picture looks quite different when we think at the web scale. To parse the content of many more sites and many more types of pages, we need to make our tool adaptive enough to extract the most relevant text instead of purely relying on manually crafted logic. We might tolerate missing some useful text and including some irrelevant text, which is acceptable because they probably won’t matter when the text data we collect is big enough.

Fortunately, Readability.js offers a tool for parsing human-readable text from any web page. It was built for the Reader View feature in Firefox but is also usable as an open source, standalone JavaScript library.

In this post, I will create an R wrapper for Readability.js using the R package V8.

Packing JS dependencies

Before we write the wrapper, the first step is identifying and packing the JavaScript dependencies to run in the V8 engine. The three key dependencies are @mozilla/readability, jsdom, and dompurify.

Following the vignette using NPM packages in V8, we pack them as follows.

brew install node
npm install -g browserify

Pack Readability.js:

npm install @mozilla/readability
echo "window.Readability = require('@mozilla/readability');" > in.js
browserify in.js -o readability.js

Pack jsdom for converting HTML into operable DOM document objects:

npm install jsdom@20.0.0
echo "window.jsdom = require('jsdom');" > in.js
browserify in.js -o jsdom.js

Pack DOMPurify mentioned in the readability.js security recommendation for sanitizing output to avoid script injection:

npm install dompurify
echo "window.dompurify = require('dompurify');" > in.js
browserify in.js -o dompurify.js

Writing an R binding

We will write some wrapper JavaScript functions to implement the workflow that uses all three JS libraries above.

function readabilityParser(html, url, candidates, threshold) {
  // Parse jsdom with readability.js
  let doc = new jsdom.JSDOM(
    html,
    { url: url }
  );
  let reader = new Readability.Readability(
    doc.window.document,
    { nbTopCandidates: candidates, charThreshold: threshold }
  );
  let res = reader.parse();

  // Sanitize results to avoid script injection
  const purifyWindow = new jsdom.JSDOM('').window;
  const DOMPurify = dompurify(purifyWindow);

  let clean = DOMPurify.sanitize(res.content);
  res.content = clean;

  return res;
}

function isReadable(html, min_content_length, min_score) {
  let doc = new jsdom.JSDOM(html);
  return Readability.isProbablyReaderable(
    doc.window.document,
    { minContentLength: min_content_length, minScore: min_score }
  );
}

The R wrapper is quite straightforward if you follow the V8 introduction vignette. As is suggested, the interactive JavaScript console via ct$console() is both fun and useful to play with when debugging.

readability <- function(html, url, candidates = 5L, threshold = 500L) {
  ct <- V8::v8(global = "window")

  ct$eval("function setTimeout(){}")
  ct$eval("function clearInterval(){}")
  ct$source("js/encoding.min.js")
  ct$source("js/jsdom.js")
  ct$source("js/dompurify.js")
  ct$source("js/readability.js")
  ct$eval(readLines("js/readability-parser.js"))

  # ct$get(V8::JS("Object.keys(window)"))
  ct$call("readabilityParser", html, url, candidates, threshold)
}

is_readable <- function(html, min_content_length = 140, min_score = 20) {
  ct <- V8::v8(global = "window")

  ct$eval("function setTimeout(){}")
  ct$eval("function clearInterval(){}")
  ct$source("js/encoding.min.js")
  ct$source("js/jsdom.js")
  ct$source("js/readability.js")
  ct$eval(readLines("js/readability-parser.js"))

  # ct$get(V8::JS("Object.keys(window)"))
  ct$call("isReadable", html, min_content_length, min_score)
}

Example

Let’s parse a recipe page (pasta with caramelized peppers, anchovies, and ricotta) from NYT Cooking.

Check if it is likely that the page is suitable for readability parsing:

url <- "https://cooking.nytimes.com/recipes/1021246-pasta-with-caramelized-peppers-anchovies-and-ricotta"

html <- url |>
  rvest::read_html() |>
  as.character()

html |> is_readable()
#> [1] TRUE

We can get the title and the clean, plain text corpus, usable for downstream text data modeling:

lst <- html |> readability(url = url)
cat(lst$title)
#> Pasta With Caramelized Peppers, Anchovies and Ricotta Recipe
lst$textContent |>
  gsub("\\n", " ", x = _, perl = TRUE) |>
  gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x = _, perl = TRUE) |>
  stringr::str_wrap(width = 74) |>
  cat()
Click here to expand the output
#> Time 30 minutes Rating 4(1,237) Notes Read community notes Caramelized
#> slivers of soft bell peppers and whole cloves of garlic serve as the
#> sweet vegetable-based sauce for this summery pasta. The ricotta gives
#> everything creaminess and body, while herbs and scallions lend freshness,
#> and anchovies depth. If you have an open bottle of wine on hand, you
#> can add a splash to deglaze the tasty browned bits on the bottom of
#> the pan. But don’t bother opening something new: A little water or
#> dry vermouth does the trick nearly as well. Featured in: What’s Better
#> Than Caramelized Onions? Caramelized Peppers Ingredients Yield:4 to 6
#> servings Kosher salt 12ounces short pasta, such as radiatori, fusilli or
#> campanelle 3tablespoons extra-virgin olive oil, plus more for drizzling
#> 8 to 10anchovy fillets, chopped, or use a dash or two of soy sauce
#> 2large rosemary sprigs 6garlic cloves, smashed and peeled Large pinch
#> of red-pepper flakes 2sweet bell peppers (red, orange or yellow), thinly
#> sliced 2tablespoons dry red, white or rosé wine, or use dry vermouth or
#> water 1tablespoon unsalted butter Fresh lemon juice ½cup fresh ricotta
#> 2scallions, thinly sliced, or ¼ cup sliced red onion Freshly ground black
#> pepper ¼cup finely chopped fresh mint, basil or thyme, plus torn mint
#> or basil leaves and tender sprigs, for garnish Freshly grated Parmesan
#> (optional) Ingredient Substitution Guide Nutritional analysis per serving
#> (6 servings) 364 calories; 13 grams fat; 4 grams saturated fat; 0 grams
#> trans fat; 7 grams monounsaturated fat; 1 gram polyunsaturated fat; 48
#> grams carbohydrates; 4 grams dietary fiber; 3 grams sugars; 12 grams
#> protein; 356 milligrams sodium Note: The information shown is Edamam’s
#> estimate based on available ingredients and preparation. It should not be
#> considered a substitute for a professional nutritionist’s advice. Powered
#> by Preparation Bring a large pot of heavily salted water to a boil. Add
#> the pasta and cook, according to package instructions, until the pasta is
#> just al dente. As pasta cooks, heat a large sauté pan over medium-high,
#> and add 3 tablespoons olive oil. When the oil is hot, add the anchovies
#> and rosemary, and sauté until the anchovies start to dissolve, about 1
#> minute. Add the garlic and red-pepper flakes, and sauté until the garlic
#> turns pale golden in spots, about 1 to 2 minutes. Add the bell peppers
#> and a large pinch of salt to the pan, and sauté until the bell peppers are
#> very soft and well caramelized, 10 to 15 minutes, lowering the heat if the
#> peppers start becoming too dark. Add the wine (or water) and the butter,
#> and sauté, scraping up the browned bits on the bottom of the pan. Taste
#> and season with lemon juice and more salt as needed. Put ¼ cup ricotta
#> and the scallions in a large serving bowl, and season aggressively with
#> black pepper. Use a coffee mug or measuring cup to scoop about ½ cup pasta
#> water from the pot. Drain the pasta, then add it to the bowl with the
#> ricotta and scallions, tossing well. Add the bell pepper mixture and the
#> herbs, and toss well, adding a splash or two of pasta water if the mixture
#> looks dry. Taste and season with more salt if needed. Spoon pasta into
#> bowls, and top with dollops of the remaining ¼ cup ricotta, a drizzle of
#> oil and a little Parmesan, if you like. Shower torn herb leaves over all.
#> Ratings Have you cooked this? or to mark this recipe as cooked. Private
#> Notes Leave a Private Note on this recipe and see it here. Cooking Notes
#> There aren’t any notes yet. Be the first to leave one.There aren’t any
#> notes yet. Be the first to leave one.Private notes are only visible to
#> you. Trending on Cooking Cooking Guides Cooking Guide Basic Knife Skills
#> By Julia Moskin Cooking Guide How to Make Soufflé By Melissa Clark Cooking
#> Guide How to Make Rice By Tejal Rao Cooking Guide How to Make Stuffing
#> By Melissa Clark Cooking Guide How to Make Cooking Substitutions By Alexa
#> Weibel Cooking Guide How to Make Ice Cream By Melissa Clark Cooking Guide
#> How to Make Yogurt By Melissa Clark Cooking Guide How to Cook Potatoes By
#> Julia Moskin

We also got the clean HTML that preserves more structural information. We can process it further, for example, using xml2 or pandoc.

lst$content |>
  htmltools::HTML() |>
  htmltools::browsable()

You can preview the clean HTML here.

Common issues

I encountered and resolved two common issues when using the JS libraries.

TextEncoder is not defined

I used the hints here and saved text-encoding explicitly as another dependency. Doing this will eliminate the error ReferenceError: TextEncoder is not defined when sourcing jsdom.js with ct$source().

setTimeout/clearInterval is not defined

It seems some web APIs are not available in the V8 standard library. I followed the suggestions here and defined stubs for setTimeout() and clearInterval() to avoid errors like ReferenceError: setTimeout is not defined when running jsdom.