Parsing Human-Readable Text Data with Readability.js and R

Nan Xiao 2022-08-02 7 min read

The R and JavaScript code to reproduce the results in this post is available from https://github.com/nanxstats/r-readability-parser.

Photo by Nick Hillier.
Photo by Nick Hillier.

Readability.js

Maybe you have used tools like rvest to harvest text data from web pages. Naturally, this often requires elaborated human efforts in the front to understand the structure of the target website.

The picture looks quite different when we think at the web scale. To parse the content of many more sites and many more types of pages, we need to make our tool adaptive enough to extract the most relevant text instead of purely relying on manually crafted logic. We might tolerate missing some useful text and including some irrelevant text, which is acceptable because they probably won’t matter when the text data we collect is big enough.

Fortunately, Readability.js offers a tool for parsing human-readable text from any web page. It was built for the Reader View feature in Firefox but is also usable as an open source, standalone JavaScript library.

In this post, I will create an R wrapper for Readability.js using the R package V8.

Packing JS dependencies

Before we write the wrapper, the first step is identifying and packing the JavaScript dependencies to run in the V8 engine. The three key dependencies are @mozilla/readability, jsdom, and dompurify.

Following the vignette using NPM packages in V8, we pack them as follows.

brew install node
npm install -g browserify

Pack Readability.js:

npm install @mozilla/readability
echo "window.Readability = require('@mozilla/readability');" > in.js
browserify in.js -o readability.js

Pack jsdom for converting HTML into operable DOM document objects:

npm install jsdom
echo "window.jsdom = require('jsdom');" > in.js
browserify in.js -o jsdom.js

Pack DOMPurify mentioned in the readability.js security recommendation for sanitizing output to avoid script injection:

npm install dompurify
echo "window.dompurify = require('dompurify');" > in.js
browserify in.js -o dompurify.js

Writing an R binding

We will write some wrapper JavaScript functions to implement the workflow that uses all three JS libraries above.

function readabilityParser(html, url, candidates, threshold) {
  // Parse jsdom with readability.js
  let doc = new jsdom.JSDOM(
    html,
    { url: url }
  );
  let reader = new Readability.Readability(
    doc.window.document,
    { nbTopCandidates: candidates, charThreshold: threshold }
  );
  let res = reader.parse();

  // Sanitize results to avoid script injection
  const purifyWindow = new jsdom.JSDOM('').window;
  const DOMPurify = dompurify(purifyWindow);

  let clean = DOMPurify.sanitize(res.content);
  res.content = clean;

  return res;
}

function isReadable(html, min_content_length, min_score) {
  let doc = new jsdom.JSDOM(html);
  return Readability.isProbablyReaderable(
    doc.window.document,
    { minContentLength: min_content_length, minScore: min_score }
  );
}

The R wrapper is quite straightforward if you follow the V8 introduction vignette. As is suggested, the interactive JavaScript console via ct$console() is both fun and useful to play with when debugging.

readability <- function(html, url, candidates = 5L, threshold = 500L) {
  ct <- V8::v8(global = "window")

  ct$eval("function setTimeout(){}")
  ct$eval("function clearInterval(){}")
  ct$source("js/encoding.min.js")
  ct$source("js/jsdom.js")
  ct$source("js/dompurify.js")
  ct$source("js/readability.js")
  ct$eval(readLines("js/readability-parser.js"))

  # ct$get(V8::JS("Object.keys(window)"))
  ct$call("readabilityParser", html, url, candidates, threshold)
}

is_readable <- function(html, min_content_length = 140, min_score = 20) {
  ct <- V8::v8(global = "window")

  ct$eval("function setTimeout(){}")
  ct$eval("function clearInterval(){}")
  ct$source("js/encoding.min.js")
  ct$source("js/jsdom.js")
  ct$source("js/readability.js")
  ct$eval(readLines("js/readability-parser.js"))

  # ct$get(V8::JS("Object.keys(window)"))
  ct$call("isReadable", html, min_content_length, min_score)
}

Example

Let’s parse a recipe page (pasta with caramelized peppers, anchovies, and ricotta) from NYT Cooking.

Check if it is likely that the page is suitable for readability parsing:

url <- "https://cooking.nytimes.com/recipes/1021246-pasta-with-caramelized-peppers-anchovies-and-ricotta"

html <- url |>
  rvest::read_html() |>
  as.character()

html |> is_readable()
#> [1] TRUE

We can get the title and the clean, plain text corpus, usable for downstream text data modeling:

lst <- html |> readability(url = url)
cat(lst$title)
#> Pasta With Caramelized Peppers, Anchovies and Ricotta Recipe
lst$textContent |>
  gsub("\\n", " ", x = _, perl = TRUE) |>
  gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x = _, perl = TRUE) |>
  stringr::str_wrap(width = 74) |>
  cat()
Click here to expand the output
#> Time 30 minutes Rating 4(1071) Notes Read community notes Caramelized
#> slivers of soft bell peppers and whole cloves of garlic serve as the
#> sweet vegetable-based sauce for this summery pasta. The ricotta gives
#> everything creaminess and body, while herbs and scallions lend freshness,
#> and anchovies depth. If you have an open bottle of wine on hand, you can
#> add a splash to deglaze the tasty browned bits on the bottom of the pan.
#> But don’t bother opening something new: A little water or dry vermouth
#> does the trick nearly as well. Featured in: What’s Better Than Caramelized
#> Onions? Caramelized Peppers Kosher salt 12ounces short pasta, such as
#> radiatori, fusilli or campanelle 3tablespoons extra-virgin olive oil,
#> plus more for drizzling 8 to 10anchovy fillets, chopped, or use a dash
#> or two of soy sauce 2large rosemary sprigs 6garlic cloves, smashed and
#> peeled Large pinch of red-pepper flakes 2sweet bell peppers (red, orange
#> or yellow), thinly sliced 2tablespoons dry red, white or rosé wine, or use
#> dry vermouth or water 1tablespoon unsalted butter Fresh lemon juice ½cup
#> fresh ricotta 2scallions, thinly sliced, or ¼ cup sliced red onion Freshly
#> ground black pepper ¼cup finely chopped fresh mint, basil or thyme, plus
#> torn mint or basil leaves and tender sprigs, for garnish Freshly grated
#> Parmesan (optional) Ingredient Substitution Guide Nutritional analysis per
#> serving (6 servings) 364 calories; 13 grams fat; 4 grams saturated fat; 0
#> grams trans fat; 7 grams monounsaturated fat; 1 gram polyunsaturated fat;
#> 48 grams carbohydrates; 4 grams dietary fiber; 3 grams sugars; 12 grams
#> protein; 356 milligrams sodium Note: The information shown is Edamam’s
#> estimate based on available ingredients and preparation. It should not
#> be considered a substitute for a professional nutritionist’s advice.
#> Powered by Bring a large pot of heavily salted water to a boil. Add the
#> pasta and cook, according to package instructions, until the pasta is
#> just al dente. As pasta cooks, heat a large sauté pan over medium-high,
#> and add 3 tablespoons olive oil. When the oil is hot, add the anchovies
#> and rosemary, and sauté until the anchovies start to dissolve, about 1
#> minute. Add the garlic and red-pepper flakes, and sauté until the garlic
#> turns pale golden in spots, about 1 to 2 minutes. Add the bell peppers
#> and a large pinch of salt to the pan, and sauté until the bell peppers
#> are very soft and well caramelized, 10 to 15 minutes, lowering the heat
#> if the peppers start becoming too dark. Add the wine (or water) and the
#> butter, and sauté, scraping up the browned bits on the bottom of the
#> pan. Taste and season with lemon juice and more salt as needed. Put ¼ cup
#> ricotta and the scallions in a large serving bowl, and season aggressively
#> with black pepper. Use a coffee mug or measuring cup to scoop about ½ cup
#> pasta water from the pot. Drain the pasta, then add it to the bowl with
#> the ricotta and scallions, tossing well. Add the bell pepper mixture and
#> the herbs, and toss well, adding a splash or two of pasta water if the
#> mixture looks dry. Taste and season with more salt if needed. Spoon pasta
#> into bowls, and top with dollops of the remaining ¼ cup ricotta, a drizzle
#> of oil and a little Parmesan, if you like. Shower torn herb leaves over
#> all. Have you cooked this? or to mark this recipe as cooked. Private Notes
#> Leave a Private Note on this recipe and see it here. There aren’t any
#> notes yet. Be the first to leave one. Trending on Cooking Cooking Guides
#> Cooking Guide Basic Knife Skills By Julia Moskin Cooking Guide How to
#> Make a Gingerbread House By Julia Moskin Cooking Guide How to Make Yogurt
#> By Melissa Clark Cooking Guide How to Cook Cauliflower By Alison Roman
#> Cooking Guide How to Make Pancakes By Alison Roman Cooking Guide How to
#> Make Soup By Samin Nosrat Cooking Guide How to Make an Omelet By Melissa
#> Clark Cooking Guide How to Make Pommes Anna By Melissa Clark

We also got the clean HTML that preserves more structural information. We can process it further, for example, using xml2 or pandoc.

lst$content |>
  htmltools::HTML() |>
  htmltools::browsable()

You can preview the clean HTML here.

Common issues

I encountered and resolved two common issues when using the JS libraries.

TextEncoder is not defined

I used the hints here and saved text-encoding explicitly as another dependency. Doing this will eliminate the error ReferenceError: TextEncoder is not defined when sourcing jsdom.js with ct$source().

setTimeout/clearInterval is not defined

It seems some web APIs are not available in the V8 standard library. I followed the suggestions here and defined stubs for setTimeout() and clearInterval() to avoid errors like ReferenceError: setTimeout is not defined when running jsdom.