I extracted the user information from kaggle for about top 500 ranked kaggle users and created an exploratory plots of best rank for a user by country and number of users by country.
The following site has users listed in the order of rank over multiple pages. About 13 pages cover users with ranks upto 520. I extracted the information on user name, user rank, number of competitions and location country. Not all of the users had listed the country and so there were some unknown locations in the data.
# set working directory
setwd("~/notesofdabbler/githubfolder/blog_notesofdabbler/exploreKaggle/")
# load libraries
library(rvest)
## Warning: package 'rvest' was built under R version 3.1.2
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(googleVis)
## Warning: package 'googleVis' was built under R version 3.1.2
##
## Welcome to googleVis version 0.5.7
##
## Please read the Google API Terms of Use
## before you start using the package:
## https://developers.google.com/terms/
##
## Note, the plot method of googleVis will by default use
## the standard browser to display its output.
##
## See the googleVis package vignettes for more details,
## or visit http://github.com/mages/googleVis.
##
## To suppress this message use:
## suppressPackageStartupMessages(library(googleVis))
op <- options(gvis.plot.tag='chart')
library(ggplot2)
# Get url of pages that contain users
# 13 pages correspond to about top 500 ranks
url = "http://www.kaggle.com/users"
urllist = paste(url,seq(2,13),sep="?page=")
urllist = c(url,urllist)
head(urllist)
## [1] "http://www.kaggle.com/users"
## [2] "http://www.kaggle.com/users?page=2"
## [3] "http://www.kaggle.com/users?page=3"
## [4] "http://www.kaggle.com/users?page=4"
## [5] "http://www.kaggle.com/users?page=5"
## [6] "http://www.kaggle.com/users?page=6"
# Function to parse user information from each page
getURL = function(url){
# get list of users
users = html(url) %>% html_nodes(".users-list")
# get user names
usrname = users %>% html_nodes(".profilelink") %>% html_text()
# get user rank
usrrnk = users %>% html_nodes(".rank") %>% html_text()
# get number of competitions a user participated in
usrnumcomp = users %>% html_nodes(".comps") %>% html_text()
# get user country
usrloc = users %>% html_nodes("li") %>% html_text()
usrloc2 = sapply(as.list(usrloc),function(x) {
xsplit = strsplit(x,"\r\n")[[1]]
xtrim = str_trim(xsplit)
xnoblank = xtrim[xtrim != ""]
xloc = xnoblank[length(xnoblank)]
return(xloc)
})
# combine into a dataframe
usrdf = data.frame(usrname,usrrnk,usrnumcomp,usrloc2,stringsAsFactors = FALSE)
return(usrdf)
}
# compile user information
usrdf = list()
length(usrdf) = length(urllist)
for(i in 1:length(urllist)){
usrdf[[i]] = getURL(urllist[i])
}
usrdf2 = rbind_all(usrdf)
head(usrdf2)
## usrname usrrnk usrnumcomp usrloc2
## 1 Owen 1st 28 competitions United States
## 2 BreakfastPirate 2nd 30 competitions United States
## 3 David Thaler 3rd 16 competitions United States
## 4 Alexander D'yakonov 4th 28 competitions Russian Federation
## 5 Yasser Tabandeh 5th 28 competitions Iran
## 6 KazAnova 6th 39 competitions Greece
The user country field needed some manual cleaning to match the country name needed for googleVis geo chart. I exported the country list in the extracted Kaggle user data to a csv file, manually appended the country names that are needed for googleVis chart and imported it back.
# clean up location field
# currently logic of extracting location was looking for last entry in a vector
# if location is unknown, the last entry corresponds to number of competitions
# if location field has number of competitions, then it is set to unknown
tmpqc = usrdf2[grepl("competition",usrdf2[["usrloc2"]]),]
usrdf2[["usrloc2"]][grepl("competition",usrdf2[["usrloc2"]])] = "unknown"
# Convert user rank to a numeric value
usrdf2["usrrnk2"] = as.numeric(gsub("[^(0-9)]","",usrdf2[["usrrnk"]]))
# Convert number of competitions to a numeric value
usrdf2["usrnumcomp2"] = as.numeric(gsub("[^(0-9)]","",usrdf2[["usrnumcomp"]]))
# Get list of countries
cntrycnt = usrdf2 %>% group_by(usrloc2) %>% summarize(cntrycnt = n()) %>% arrange(desc(cntrycnt))
cntrylist = cntrycnt$usrloc2
# write country list to a csv file for manual cleaning
# Needed to manually assign country name that googleVis chart recognizes
#write.csv(cntrylist,"cntrylist.csv")
# read in the cleaned country list
cntrylist_cleaned = read.csv("cntrylist_cleaned.csv",sep=",")
# merge cleaned country list
usrdf3 = merge(usrdf2,cntrylist_cleaned,by.x=c("usrloc2"),by.y=c("cntryName"))
head(usrdf3)
## usrloc2 usrname usrrnk usrnumcomp usrrnk2 usrnumcomp2
## 1 Argentina Martin Martin 69th 12 competitions 69 12
## 2 Austin, TX Jason Sumpter 211th 15 competitions 211 15
## 3 Australia tund 331st 8 competitions 331 8
## 4 Australia James Petterson 166th 9 competitions 166 9
## 5 Australia Elliot Dawson 147th 30 competitions 147 30
## 6 Australia Scott Thompson 267th 17 competitions 267 17
## cntryName2
## 1 Argentina
## 2 United States
## 3 Australia
## 4 Australia
## 5 Australia
## 6 Australia
I summarize the data by country to get the following:
# Summary by country
# - number of users
# - average rank of users
# - best rank of users
# - competitions per user
cntrycnt3 = usrdf3 %>% group_by(cntryName2)%>%
summarize(cntrycnt = n(),avgrnk = mean(usrrnk2),bestrnk = min(usrrnk2),
totcomp = sum(usrnumcomp2)) %>%
arrange(bestrnk)
cntrycnt3["compperusr"] = cntrycnt3[["totcomp"]]/cntrycnt3[["cntrycnt"]]
head(cntrycnt3)
## Source: local data frame [6 x 6]
##
## cntryName2 cntrycnt avgrnk bestrnk totcomp compperusr
## 1 United States 148 260.6486 1 1980 13.37838
## 2 Russian Federation 28 198.1429 4 464 16.57143
## 3 Iran 1 5.0000 5 28 28.00000
## 4 Greece 4 262.5000 6 86 21.50000
## 5 Turkey 1 7.0000 7 23 23.00000
## 6 Spain 14 233.2143 8 171 12.21429
The plot below shows the best rank in each country
# plot user data (best rank by country) in a geochart
pltdf = cntrycnt3 %>% select(-totcomp)
names(pltdf) = c("country","NumberOfUsers","AverageRank","BestRank","CompetitionsPerUser")
G = gvisGeoChart(pltdf, locationvar = "country",
colorvar = "BestRank",
options=list(height=600,width=600))
T = gvisTable(pltdf,options = list(height=600,width=600),
format=list(AverageRank = "#",
CompetitionsPerUser = "#.#"))
GT = gvisMerge(G,T,horizontal = TRUE)
plot(GT)
|
|
The plot below shows the number of users by country (caveat: this is just based on top 500 ranked users)
# Plot number of users by country
G2 = gvisGeoChart(pltdf, locationvar = "country",
colorvar = "NumberOfUsers",
options=list(height=600,width=600))
plot(G2)
All analysis was done with RStudio 0.98.1062.
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.0 googleVis_0.5.7 dplyr_0.2 stringr_0.6.2
## [5] rvest_0.1.0
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5
## [5] formatR_1.0 grid_3.1.1 gtable_0.1.2 htmltools_0.2.6
## [9] httr_0.5 knitr_1.8 magrittr_1.0.1 MASS_7.3-33
## [13] munsell_0.4.2 parallel_3.1.1 plyr_1.8.1 proto_0.3-10
## [17] Rcpp_0.11.2 RCurl_1.95-4.3 reshape2_1.4 RJSONIO_1.3-0
## [21] rmarkdown_0.3.3 scales_0.2.4 selectr_0.2-2 tools_3.1.1
## [25] XML_3.98-1.1