# More Than 1,500 File Extensions Are Used Inside R Packages

The R code to reproduce the results is available from the GitHub repo nanxstats/cran-file-exts.

When applied correctly, file extensions can be informative. They are the very first clue on handling a specific file without parsing the file content.

To properly capture and classify files in source R packages, I am interested in learning what file extensions are frequently used by R packages.

We can achieve this easily by downloading all R packages available from CRAN one at a time and collect the file extensions inside:

library("curl")
library("tools")

repo <- "https://cran.rstudio.com/"
db <- as.data.frame(available.packages(paste0(repo, "src/contrib/")), stringsAsFactors = FALSE)
pkgs <- db$Package files <- paste0(pkgs, "_", db$Version, ".tar.gz")

find_ext <- function(path) {
x <- unique(file_ext(untar(path, list = TRUE)))
x[!(x %in% "")]
}

for (i in seq_along(pkgs)) {
x <- find_ext(files[i])
write(paste0(x, collapse = "\t"), file = "exts.txt", append = TRUE)
}

Since this is very one-dimensional, we should look into the frequency table:

x <- readLines("exts.txt")
x <- tolower(unlist(strsplit(x, split = "\t")))
y <- sort(table(x), decreasing = TRUE)

It looks like we have 1,529 file extensions. It is also likely a heavy-tailed distribution, with 96% of all files designated 5% of the unique file types.

length(y)
#> [1] 1529
z <- y[y > 50L]
length(z) / length(y)
#> [1] 0.04905167
sum(z) / sum(y)
#> [1] 0.9611313

We can also cluster this frequency data with any one-dimensional data clustering algorithm such as the maximum homogeneity clustering, implemented in my R package oneclust. Say, we are interested in file extensions that appeared >=5 times:

library("oneclust")

eoi <- y[y > 4L]
cl <- oneclust(eoi, 4)
cl$cluster #> [1] 4 4 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [33] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [65] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [97] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [129] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [161] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [193] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [225] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #> [257] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Create a table for everything and display it with the awesome DT: df <- data.frame( "ext" = names(eoi), "mime" = mime::guess_type(paste0(".", names(eoi))), "count" = as.vector(eoi), "cluster" = dplyr::recode(cl$cluster, 1 = 4, 2 = 3, 3 = 2, 4 = 1)
)

After looking into the table, what is your interesting discovery?