More Than 1,500 File Extensions Are Used Inside R Packages

Nan Xiao December 1, 2021 3 min read

The R code to reproduce the results is available from the GitHub repo nanxstats/cran-file-exts.

Photo by Pawel Czerwinski.

When applied correctly, file extensions can be informative. They are the very first clue on handling a specific file without parsing the file content.

To properly capture and classify files in source R packages, I am interested in learning what file extensions are frequently used by R packages.

We can achieve this easily by downloading all R packages available from CRAN one at a time and collect the file extensions inside:

library("curl")
library("tools")

repo <- "https://cran.rstudio.com/"
db <- as.data.frame(available.packages(paste0(repo, "src/contrib/")), stringsAsFactors = FALSE)
pkgs <- db$Package
files <- paste0(pkgs, "_", db$Version, ".tar.gz")
links <- paste0(repo, "src/contrib/", files)

find_ext <- function(path) {
  x <- unique(file_ext(untar(path, list = TRUE)))
  x[!(x %in% "")]
}

for (i in seq_along(pkgs)) {
  cat("Downloading", i, "/", length(pkgs), "\n")
  curl_download(links[i], destfile = files[i])
  x <- find_ext(files[i])
  write(paste0(x, collapse = "\t"), file = "exts.txt", append = TRUE)
  unlink(files[i])
}

Since this is very one-dimensional, we should look into the frequency table:

x <- readLines("exts.txt")
x <- tolower(unlist(strsplit(x, split = "\t")))
y <- sort(table(x), decreasing = TRUE)

It looks like we have 1,529 file extensions. It is also likely a heavy-tailed distribution, with 96% of all files designated 5% of the unique file types.

length(y)
#> [1] 1529
z <- y[y > 50L]
length(z) / length(y)
#> [1] 0.04905167
sum(z) / sum(y)
#> [1] 0.9611313

We can also cluster this frequency data with any one-dimensional data clustering algorithm such as the maximum homogeneity clustering, implemented in my R package oneclust. Say, we are interested in file extensions that appeared >=5 times:

library("oneclust")

eoi <- y[y > 4L]
cl <- oneclust(eoi, 4)
cl$cluster
#>   [1] 4 4 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [33] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [65] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [97] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [129] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [161] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [193] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [225] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [257] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Create a table for everything and display it with the awesome DT:

df <- data.frame(
  "ext" = names(eoi),
  "mime" = mime::guess_type(paste0(".", names(eoi))),
  "count" = as.vector(eoi),
  "cluster" = dplyr::recode(cl$cluster, `1` = 4, `2` = 3, `3` = 2, `4` = 1)
)

After looking into the table, what is your interesting discovery?