David Robinson has done several screencasts where he analyzes a Tidy Tuesday dataset live. I have listened to a few of them and found them very interesting and instructive. As I don’t use R on a daily basis, I have not kept up with what the latest is in Tidyverse. So when I listened to his screencasts, I learnt functions that I was not aware of. Since I sometimes forget which function I learnt, I wanted to extract all the functions used in the screencasts so that it is easier for me to refer to the ones that I am not aware of but should learn.

The approach I took is:

# Load libraries
library(tidyverse)
library(rvest)
library(DT)

First we get all the Rmd file from Dave Robinson’s screencast github repo

# get list of files that have screencast analysis
githubrepo = read_html("https://github.com/dgrtwo/data-screencasts")
rmdfilelist = githubrepo %>% html_nodes(".content") %>% html_text() %>% str_trim()
rmdfilelist
##  [1] "Failed to load latest commit information."
##  [2] "women-workplace-app"                      
##  [3] "README.md"                                
##  [4] "baltimore_bridges.Rmd"                    
##  [5] "bike_traffic.Rmd"                         
##  [6] "bird-collisions.Rmd"                      
##  [7] "board-games.Rmd"                          
##  [8] "cetaceans.Rmd"                            
##  [9] "college-majors.Rmd"                       
## [10] "data-screencasts.Rproj"                   
## [11] "french-trains.Rmd"                        
## [12] "golden-age-tv.Rmd"                        
## [13] "grand-slams.Rmd"                          
## [14] "malaria.Rmd"                              
## [15] "media-franchises.Rmd"                     
## [16] "medium-datasci.Rmd"                       
## [17] "movie-profit.Rmd"                         
## [18] "nobel-prize.Rmd"                          
## [19] "nyc-restaurants.Rmd"                      
## [20] "plastic-waste.Rmd"                        
## [21] "r-downloads.Rmd"                          
## [22] "ramen-ratings.Rmd"                        
## [23] "seattle-pets.Rmd"                         
## [24] "space-launches.Rmd"                       
## [25] "student-teacher-ratios.Rmd"               
## [26] "thanksgiving.Rmd"                         
## [27] "tidytuesday-tweets.Rmd"                   
## [28] "trees.Rmd"                                
## [29] "umbrella-week.Rmd"                        
## [30] "us-dairy.Rmd"                             
## [31] "us-wind.Rmd"                              
## [32] "us_phds.Rmd"                              
## [33] "wine-ratings.Rmd"                         
## [34] "women-workplace.Rmd"                      
## [35] "womens-world-cup.Rmd"
# remove first 3 files since they are not analysis files
rmdfilelist = rmdfilelist[4:length(rmdfilelist)]

# get the link to raw file for each Rmd file
rmdfileListlink = paste0("https://raw.githubusercontent.com/dgrtwo/data-screencasts/master/", rmdfilelist)

Read each Rmd file and extract

The approach to extract functions is based on the blog post(Top 100 most used R functions on GitHub) The approach looks for words that precede the left parenthesis. It is not perfect but is good enough for this analysis.

liblistL = list()
fnlistL = list()

for(i in 1:length(rmdfileListlink)) {
  
  #print(i)
  
  # Read the Rmd file
  rmdfile = read_file(rmdfileListlink[i])
  
  # get list of libraries used in the file
  liblist = str_extract_all(rmdfile, "library\\(.*\\)") %>% unlist() %>%
    gsub("(library|\\(|\\))", "", .)
  liblistL[[i]] = tibble(liblist)
  
  # pattern to look for function
  fnpattern = "([a-zA-Z][a-zA-Z0-9_.]{0,43}[(])|([a-zA-Z][a-zA-Z0-9_.]{0,43}[ ][(])"
  fnlist = map(str_extract_all(rmdfile, fnpattern), ~ gsub("\\(", "", .x))
  fnlistL[[i]] = tibble(fnlist = fnlist[[1]])
  
}

# dataframe with list of libraries
liblistdf = bind_rows(liblistL)
libcountdf = liblistdf %>% count(liblist, sort = TRUE)

The plot below shows the number of analysis in which each package was used.

ggplot(libcountdf) + geom_col(aes(x = fct_reorder(liblist, n), y = n), fill = "blue") + 
  labs(x = "", y = "# analysis that used the library") + theme_bw() + coord_flip()