Keeping up with Tidyverse Functions using Tidy Tuesday Screencasts

David Robinson has done several screencasts where he analyzes a Tidy Tuesday dataset live. I have listened to a few of them and found them very interesting and instructive. As I don’t use R on a daily basis, I have not kept up with what the latest is in Tidyverse. So when I listened to his screencasts, I learnt functions that I was not aware of. Since I sometimes forget which function I learnt, I wanted to extract all the functions used in the screencasts so that it is easier for me to refer to the ones that I am not aware of but should learn.

The approach I took is:

Get all the Rmd analysis files from the screencast github repo.
Extract the list of libraries and functions used in each .Rmd file
Plot frequencies of function use and review functions that I am not aware of

# Load libraries
library(tidyverse)
library(rvest)
library(DT)

First we get all the Rmd file from Dave Robinson’s screencast github repo

# get list of files that have screencast analysis
githubrepo = read_html("https://github.com/dgrtwo/data-screencasts")
rmdfilelist = githubrepo %>% html_nodes(".content") %>% html_text() %>% str_trim()
rmdfilelist

##  [1] "Failed to load latest commit information."
##  [2] "women-workplace-app"                      
##  [3] "README.md"                                
##  [4] "baltimore_bridges.Rmd"                    
##  [5] "bike_traffic.Rmd"                         
##  [6] "bird-collisions.Rmd"                      
##  [7] "board-games.Rmd"                          
##  [8] "cetaceans.Rmd"                            
##  [9] "college-majors.Rmd"                       
## [10] "data-screencasts.Rproj"                   
## [11] "french-trains.Rmd"                        
## [12] "golden-age-tv.Rmd"                        
## [13] "grand-slams.Rmd"                          
## [14] "malaria.Rmd"                              
## [15] "media-franchises.Rmd"                     
## [16] "medium-datasci.Rmd"                       
## [17] "movie-profit.Rmd"                         
## [18] "nobel-prize.Rmd"                          
## [19] "nyc-restaurants.Rmd"                      
## [20] "plastic-waste.Rmd"                        
## [21] "r-downloads.Rmd"                          
## [22] "ramen-ratings.Rmd"                        
## [23] "seattle-pets.Rmd"                         
## [24] "space-launches.Rmd"                       
## [25] "student-teacher-ratios.Rmd"               
## [26] "thanksgiving.Rmd"                         
## [27] "tidytuesday-tweets.Rmd"                   
## [28] "trees.Rmd"                                
## [29] "umbrella-week.Rmd"                        
## [30] "us-dairy.Rmd"                             
## [31] "us-wind.Rmd"                              
## [32] "us_phds.Rmd"                              
## [33] "wine-ratings.Rmd"                         
## [34] "women-workplace.Rmd"                      
## [35] "womens-world-cup.Rmd"

# remove first 3 files since they are not analysis files
rmdfilelist = rmdfilelist[4:length(rmdfilelist)]

# get the link to raw file for each Rmd file
rmdfileListlink = paste0("https://raw.githubusercontent.com/dgrtwo/data-screencasts/master/", rmdfilelist)

Read each Rmd file and extract

List of libraries used
List of functions used

The approach to extract functions is based on the blog post(Top 100 most used R functions on GitHub) The approach looks for words that precede the left parenthesis. It is not perfect but is good enough for this analysis.

liblistL = list()
fnlistL = list()

for(i in 1:length(rmdfileListlink)) {
  
  #print(i)
  
  # Read the Rmd file
  rmdfile = read_file(rmdfileListlink[i])
  
  # get list of libraries used in the file
  liblist = str_extract_all(rmdfile, "library\\(.*\\)") %>% unlist() %>%
    gsub("(library|\\(|\\))", "", .)
  liblistL[[i]] = tibble(liblist)
  
  # pattern to look for function
  fnpattern = "([a-zA-Z][a-zA-Z0-9_.]{0,43}[(])|([a-zA-Z][a-zA-Z0-9_.]{0,43}[ ][(])"
  fnlist = map(str_extract_all(rmdfile, fnpattern), ~ gsub("\\(", "", .x))
  fnlistL[[i]] = tibble(fnlist = fnlist[[1]])
  
}

# dataframe with list of libraries
liblistdf = bind_rows(liblistL)
libcountdf = liblistdf %>% count(liblist, sort = TRUE)

The plot below shows the number of analysis in which each package was used.

ggplot(libcountdf) + geom_col(aes(x = fct_reorder(liblist, n), y = n), fill = "blue") + 
  labs(x = "", y = "# analysis that used the library") + theme_bw() + coord_flip()

The top library as tidyverse is to be expected. It is interesting that lubridate is second. I can see that broom is used quite a bit since after exploratory analysis in the screencast, David explores some models. There are several that I was not aware of but I will probably look up the following: widyr, fuzzyjoin, glue, janitor, patchwork and the context in which they were used in the screencast.

# dataframe with list of functions
fnlistdf = bind_rows(fnlistL)
fncountdf = fnlistdf %>% count(fnlist, sort = TRUE)

The table below lists the functions extracted from Rmd files.

datatable(fncountdf)

The current logic extracts things such as aes, c, scale_x_continuous which are not of interest to us here. We will combine the above data with list of functions in tidyverse packages to clean up the list of functions of interest.

# get all functions in tidyverse packages
# https://stackoverflow.com/questions/8696158/find-all-functions-including-private-in-a-package/8696442#8696442
#
pkglist = tidyverse_packages()
pkglist[which(pkglist == "readxl\n(>=")] = "readxl"
pkgfunsL = list()
for(i in 1:length(pkglist)) {
  #print(i)
  pkgfuns = ls(getNamespace(pkglist[i]), all.names=TRUE)
  pkgfunsL[[i]] = tibble(pkg = pkglist[i], pkgfuns = pkgfuns)
}

pkgfunsdf = bind_rows(pkgfunsL)

# keep extracted functions that have an associated tidyverse package
fncountdf2 = inner_join(fncountdf, pkgfunsdf %>% rename(fnlist = pkgfuns), by = "fnlist")

fncountpltdf = fncountdf2 %>%
       mutate(fnlist2 = fct_reorder(fnlist, n))

# Count number of functions used in each package
pkgfncount = fncountdf2 %>% count(pkg, sort = TRUE)

The plot below shows the number of functions used in each package

ggplot(pkgfncount) + geom_col(aes(x = fct_reorder(pkg, nn), y = nn), fill = "blue") + 
  labs(x = "", y = "# functions") + theme_bw() + coord_flip()

As expected, most used functions are from ggplot2, dplyr, tidyr since there is lot of exploratory analysis and visualization of data in the screencasts.

We next plot the list of functions used from each package.

getplot = function(df) {
  p = ggplot() + geom_col(data = df, aes(x = fnlist2, y = n), fill = "blue") 
  p = p + facet_wrap(~pkg, scales = "free") + labs(x = "", y = "")
  p = p + coord_flip() + theme_bw()
  p 
}

getplot(fncountpltdf %>% 
          filter(pkg %in% c("ggplot2", "dplyr"))
)

getplot(fncountpltdf %>% 
          filter(pkg %in% c("tidyr", "lubridate", "stringr"))
)

getplot(fncountpltdf %>% 
          filter(!(pkg %in% c("ggplot2", "dplyr", "tidyr", "lubridate", "stringr")))
)

Based on the above figures, I am listing below some functions that I was not aware of and should learn

count function in dplyr as a easier way to count for each group or sum a variable for each group.
geom_col function in ggplot2 for bar graphs
I became aware of forcats package for working with factors. fct_reorder and fct_lump from the package were used frequently.
tidyr functions - nest/unnest, crossing, separate_rows
I realized that I know only a few functions in stringr and should learn more about several functions that were used in the screencast.

Session Info

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.13.6 (unknown)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2  DT_0.7          rvest_0.3.2     xml2_1.1.1     
##  [5] forcats_0.3.0   stringr_1.3.1   dplyr_0.7.8     purrr_0.2.4    
##  [9] readr_1.1.1     tidyr_0.8.2     tibble_1.4.2    ggplot2_3.1.0  
## [13] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       lubridate_1.7.1  lattice_0.20-33  assertthat_0.2.0
##  [5] rprojroot_1.3-2  digest_0.6.13    psych_1.7.8      mime_0.5        
##  [9] R6_2.2.2         cellranger_1.1.0 plyr_1.8.4       backports_1.1.2 
## [13] reprex_0.2.1     evaluate_0.10.1  httr_1.3.1       pillar_1.3.0    
## [17] rlang_0.3.0.1    lazyeval_0.2.0   curl_2.0         readxl_1.0.0    
## [21] rstudioapi_0.7   rmarkdown_1.8    labeling_0.3     selectr_0.3-1   
## [25] foreign_0.8-66   htmlwidgets_1.3  munsell_0.5.0    shiny_1.0.3     
## [29] broom_0.4.2      httpuv_1.3.5     modelr_0.1.2     pkgconfig_2.0.1 
## [33] mnormt_1.5-5     htmltools_0.3.6  tidyselect_0.2.5 XML_3.98-1.4    
## [37] crayon_1.3.4     dbplyr_1.2.2     withr_2.1.2      grid_3.3.1      
## [41] nlme_3.1-128     jsonlite_1.5     xtable_1.8-2     gtable_0.2.0    
## [45] DBI_1.0.0        magrittr_1.5     scales_1.0.0     cli_1.0.1       
## [49] stringi_1.2.4    reshape2_1.4.1   fs_1.2.6         tools_3.3.1     
## [53] glue_1.3.0       hms_0.4.2        crosstalk_1.0.0  parallel_3.3.1  
## [57] yaml_2.1.16      colorspace_1.2-6 knitr_1.19       bindr_0.1.1     
## [61] haven_1.1.0

Keeping up with Tidyverse Functions using Tidy Tuesday Screencasts

Notes of Dabbler

2019-08-07

Session Info