Explore Kaggle User Profile

I extracted the user information from kaggle for about top 500 ranked kaggle users and created an exploratory plots of best rank for a user by country and number of users by country.

Scraping User Information

The following site has users listed in the order of rank over multiple pages. About 13 pages cover users with ranks upto 520. I extracted the information on user name, user rank, number of competitions and location country. Not all of the users had listed the country and so there were some unknown locations in the data.

# set working directory
setwd("~/notesofdabbler/githubfolder/blog_notesofdabbler/exploreKaggle/")

# load libraries
library(rvest)
## Warning: package 'rvest' was built under R version 3.1.2
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(googleVis)
## Warning: package 'googleVis' was built under R version 3.1.2
## 
## Welcome to googleVis version 0.5.7
## 
## Please read the Google API Terms of Use
## before you start using the package:
## https://developers.google.com/terms/
## 
## Note, the plot method of googleVis will by default use
## the standard browser to display its output.
## 
## See the googleVis package vignettes for more details,
## or visit http://github.com/mages/googleVis.
## 
## To suppress this message use:
## suppressPackageStartupMessages(library(googleVis))
op <- options(gvis.plot.tag='chart')

library(ggplot2)

# Get url of pages that contain users
# 13 pages correspond to about top 500 ranks

url = "http://www.kaggle.com/users"
urllist = paste(url,seq(2,13),sep="?page=")
urllist = c(url,urllist)
head(urllist)
## [1] "http://www.kaggle.com/users"       
## [2] "http://www.kaggle.com/users?page=2"
## [3] "http://www.kaggle.com/users?page=3"
## [4] "http://www.kaggle.com/users?page=4"
## [5] "http://www.kaggle.com/users?page=5"
## [6] "http://www.kaggle.com/users?page=6"
# Function to parse user information from each page
getURL = function(url){
    # get list of users
    users = html(url) %>% html_nodes(".users-list")
    # get user names
    usrname = users %>% html_nodes(".profilelink") %>% html_text()
    # get user rank
    usrrnk = users %>% html_nodes(".rank") %>% html_text()
    # get number of competitions a user participated in 
    usrnumcomp = users %>% html_nodes(".comps") %>% html_text()
    # get user country 
    usrloc = users %>% html_nodes("li") %>% html_text()
    usrloc2 = sapply(as.list(usrloc),function(x) {
        xsplit = strsplit(x,"\r\n")[[1]]
        xtrim = str_trim(xsplit)
        xnoblank = xtrim[xtrim != ""]
        xloc = xnoblank[length(xnoblank)]
        return(xloc)
      })
    # combine into a dataframe
    usrdf = data.frame(usrname,usrrnk,usrnumcomp,usrloc2,stringsAsFactors = FALSE)
    return(usrdf)
}

# compile user information 
usrdf = list()
length(usrdf) = length(urllist)

for(i in 1:length(urllist)){
  usrdf[[i]] = getURL(urllist[i])
}

usrdf2 = rbind_all(usrdf)
head(usrdf2)
##               usrname usrrnk      usrnumcomp            usrloc2
## 1                Owen    1st 28 competitions      United States
## 2     BreakfastPirate    2nd 30 competitions      United States
## 3        David Thaler    3rd 16 competitions      United States
## 4 Alexander D'yakonov    4th 28 competitions Russian Federation
## 5     Yasser Tabandeh    5th 28 competitions               Iran
## 6            KazAnova    6th 39 competitions             Greece

The user country field needed some manual cleaning to match the country name needed for googleVis geo chart. I exported the country list in the extracted Kaggle user data to a csv file, manually appended the country names that are needed for googleVis chart and imported it back.

# clean up location field 
# currently logic of extracting location was looking for last entry in a vector
# if location is unknown, the last entry corresponds to number of competitions
# if location field has number of competitions, then it is set to unknown
tmpqc = usrdf2[grepl("competition",usrdf2[["usrloc2"]]),]
usrdf2[["usrloc2"]][grepl("competition",usrdf2[["usrloc2"]])] = "unknown"

# Convert user rank to a numeric value
usrdf2["usrrnk2"] = as.numeric(gsub("[^(0-9)]","",usrdf2[["usrrnk"]]))
# Convert number of competitions to a numeric value
usrdf2["usrnumcomp2"] = as.numeric(gsub("[^(0-9)]","",usrdf2[["usrnumcomp"]]))

# Get list of countries
cntrycnt = usrdf2 %>% group_by(usrloc2) %>% summarize(cntrycnt = n()) %>% arrange(desc(cntrycnt))
cntrylist = cntrycnt$usrloc2

# write country list to a csv file for manual cleaning
# Needed to manually assign country name that googleVis chart recognizes
#write.csv(cntrylist,"cntrylist.csv")

# read in the cleaned country list
cntrylist_cleaned = read.csv("cntrylist_cleaned.csv",sep=",")

# merge cleaned country list
usrdf3 = merge(usrdf2,cntrylist_cleaned,by.x=c("usrloc2"),by.y=c("cntryName"))

head(usrdf3)
##      usrloc2         usrname usrrnk      usrnumcomp usrrnk2 usrnumcomp2
## 1  Argentina   Martin Martin   69th 12 competitions      69          12
## 2 Austin, TX   Jason Sumpter  211th 15 competitions     211          15
## 3  Australia            tund  331st  8 competitions     331           8
## 4  Australia James Petterson  166th  9 competitions     166           9
## 5  Australia   Elliot Dawson  147th 30 competitions     147          30
## 6  Australia  Scott Thompson  267th 17 competitions     267          17
##      cntryName2
## 1     Argentina
## 2 United States
## 3     Australia
## 4     Australia
## 5     Australia
## 6     Australia

Analyze and Visualize Data

I summarize the data by country to get the following:

  • Number of users
  • Averge rank
  • Best rank
  • Competitions per user
# Summary by country
#  - number of users
#  - average rank of users
#  - best rank of users
#  - competitions per user
cntrycnt3 = usrdf3 %>% group_by(cntryName2)%>%
                  summarize(cntrycnt = n(),avgrnk = mean(usrrnk2),bestrnk = min(usrrnk2),
                            totcomp = sum(usrnumcomp2)) %>% 
                  arrange(bestrnk)
cntrycnt3["compperusr"] = cntrycnt3[["totcomp"]]/cntrycnt3[["cntrycnt"]]
head(cntrycnt3)
## Source: local data frame [6 x 6]
## 
##           cntryName2 cntrycnt   avgrnk bestrnk totcomp compperusr
## 1      United States      148 260.6486       1    1980   13.37838
## 2 Russian Federation       28 198.1429       4     464   16.57143
## 3               Iran        1   5.0000       5      28   28.00000
## 4             Greece        4 262.5000       6      86   21.50000
## 5             Turkey        1   7.0000       7      23   23.00000
## 6              Spain       14 233.2143       8     171   12.21429

The plot below shows the best rank in each country

# plot user data (best rank by country) in a geochart
pltdf = cntrycnt3 %>% select(-totcomp)
names(pltdf) = c("country","NumberOfUsers","AverageRank","BestRank","CompetitionsPerUser")
G = gvisGeoChart(pltdf, locationvar = "country",
                        colorvar = "BestRank",
                        options=list(height=600,width=600))
T = gvisTable(pltdf,options = list(height=600,width=600),
              format=list(AverageRank = "#",
                          CompetitionsPerUser = "#.#"))

GT = gvisMerge(G,T,horizontal = TRUE)
plot(GT)

The plot below shows the number of users by country (caveat: this is just based on top 500 ranked users)

# Plot number of users by country
G2 = gvisGeoChart(pltdf, locationvar = "country",
                  colorvar = "NumberOfUsers",
                  options=list(height=600,width=600))
plot(G2)

Session Info

All analysis was done with RStudio 0.98.1062.

sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.0   googleVis_0.5.7 dplyr_0.2       stringr_0.6.2  
## [5] rvest_0.1.0    
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5  
##  [5] formatR_1.0      grid_3.1.1       gtable_0.1.2     htmltools_0.2.6 
##  [9] httr_0.5         knitr_1.8        magrittr_1.0.1   MASS_7.3-33     
## [13] munsell_0.4.2    parallel_3.1.1   plyr_1.8.1       proto_0.3-10    
## [17] Rcpp_0.11.2      RCurl_1.95-4.3   reshape2_1.4     RJSONIO_1.3-0   
## [21] rmarkdown_0.3.3  scales_0.2.4     selectr_0.2-2    tools_3.1.1     
## [25] XML_3.98-1.1