Explore Kaggle Competition data

I wanted to explore the competition data in Kaggle and see if I can answer the following questions:

I tried to answer this question by looking at a scatter plot of number of participating teams vs prize money. The hypothesis before starting the analysis for me is that prize money is not a key motivator but the satisfaction of tackling a challenging problem is what drives participation.

I took the number of days a competition was run as a surrogate for the diffculty as percieved by the sponsor of the competition (this may not be the best metric but in the absence of anything else, I decided to use this). I tried to answer this question by looking at a scatter plot of prize money vs duration of competition.

I looked at a bar graphs of number of teams participating in each knowledge competition to answer this question.

Scrape Kaggle Competition data

The data I need for this exercise is the list of all competitions that were run, the prize money, number of teams, and the duration of competition. All of this data is available at the following site. The page only shows active competitions. To view all competitions, I needed to click a checkbox in the webpage which executes a javascript component to retrieve completed competitions. I saved the webpage locally (use the option of saving a complete webpage) since that resulted in a html file that shows the results of the executed javascript and hence has both active and completed competitions.

# set working directory
setwd("~/notesofdabbler/githubfolder/blog_notesofdabbler/exploreKaggle/")

# load libraries
library(rvest)
library(dplyr)
library(googleVis)
op <- options(gvis.plot.tag='chart')
library(ggplot2)

# Get list of competitions
localfile = "kaggleCompetitionList.html"
compdata = html(localfile)

# get names of competitions
compnames = compdata %>% html_nodes(".competition-details h4") %>% html_text()
# get links for each competition
complinks = compdata %>% html_nodes(".competition-details") %>% html_node("a") %>% html_attr("href") 
# get the competition type (knowledge, job or prize amount)
comptype = compdata %>% html_nodes(xpath = "//tr//td[2]") %>% html_text()

# Assign prizeCompetition label to competitions that have prizes
comptype2 = comptype
comptype2[grepl("[$]",comptype)] = "prizeCompetition"

# get numeric value of prize for prize competitions and set to 0 for other competition types
compprize = ifelse(comptype2 == "prizeCompetition",comptype,0)
compprize = as.numeric(gsub("[$,]","",compprize))

# get number of teams
compnumteams = compdata %>% html_nodes(xpath = "//tr//td[3]") %>% html_text()
compnumteams = as.numeric(compnumteams)

# combine into a dataframe
compdf = data.frame(compnames,complinks,comptype,comptype2,compprize,compnumteams,stringsAsFactors = FALSE)

head(compdf)
##                                              compnames
## 1                                Heritage Health Prize
## 2                                      GE Flight Quest
## 3 Flight Quest 2: Flight Optimization, Milestone Phase
## 4      Flight Quest 2: Flight Optimization, Main Phase
## 5     Flight Quest 2: Flight Optimization, Final Phase
## 6                           National Data Science Bowl
##                                    complinks comptype        comptype2
## 1   http://www.heritagehealthprize.com/c/hhp $500,000 prizeCompetition
## 2            http://www.gequest.com/c/flight $250,000 prizeCompetition
## 3 http://www.gequest.com/c/flight2-milestone $250,000 prizeCompetition
## 4      http://www.gequest.com/c/flight2-main $220,000 prizeCompetition
## 5     http://www.gequest.com/c/flight2-final $220,000 prizeCompetition
## 6    http://www.kaggle.com/c/datasciencebowl $175,000 prizeCompetition
##   compprize compnumteams
## 1    500000         1353
## 2    250000          173
## 3    250000          130
## 4    220000          122
## 5    220000           33
## 6    175000          148

To get the duration of a competition, I needed to get the data from the page for the competition.

# function to extract number of days a competition was run
# input is the specific page for the competition
# the total days is in parenthesis and regex is used to extract that
getDays = function(htmlnode){
  txt = htmlnode %>% html_text()
  txtlefttrim = gsub("^.*\\(","",txt)
  txtrttrim = gsub("\\).*$","",txtlefttrim)
  numdays = gsub("[,a-zA-Z ]","",txtrttrim)
  numdays = as.numeric(numdays)
  return(numdays)
}

# Get duration of each competition

duration = rep(0,length(complinks))
for(i in 1:length(complinks)){
  comppg = html(complinks[i])
  durationNode = comppg %>% html_nodes("#end-time-note")
  if(length(durationNode) > 0){
    duration[i] = getDays(durationNode)
  }
#  print(i)  
}

compdf["duration"] = duration

head(compdf)
##                                              compnames
## 1                                Heritage Health Prize
## 2                                      GE Flight Quest
## 3 Flight Quest 2: Flight Optimization, Milestone Phase
## 4      Flight Quest 2: Flight Optimization, Main Phase
## 5     Flight Quest 2: Flight Optimization, Final Phase
## 6                           National Data Science Bowl
##                                    complinks comptype        comptype2
## 1   http://www.heritagehealthprize.com/c/hhp $500,000 prizeCompetition
## 2            http://www.gequest.com/c/flight $250,000 prizeCompetition
## 3 http://www.gequest.com/c/flight2-milestone $250,000 prizeCompetition
## 4      http://www.gequest.com/c/flight2-main $220,000 prizeCompetition
## 5     http://www.gequest.com/c/flight2-final $220,000 prizeCompetition
## 6    http://www.kaggle.com/c/datasciencebowl $175,000 prizeCompetition
##   compprize compnumteams duration
## 1    500000         1353      730
## 2    250000          173      103
## 3    250000          130       50
## 4    220000          122      114
## 5    220000           33       35
## 6    175000          148       91

Analyze Kaggle competition data

First a tool tip field containing competition name, prize money and number of teams is created for use in googleVis charts.

# create a field to show as tooltip in googleVis scatter plot
# this has the following information:
# Competition name, prize, number of teams
compdf[["pop.html.tooltip"]] = paste(compdf[["compnames"]],"</br>",
                                     "Prize ($):",compdf[["compprize"]],"</br>",
                                     "Duration (days):",compdf[["duration"]],"</br>",
                                     "Number of teams:",compdf[["compnumteams"]],sep="")
compdf[["pop.html.tooltip"]][1]
## [1] "Heritage Health Prize</br>Prize ($):5e+05</br>Duration (days):730</br>Number of teams:1353"
  • Is prize money a motivating factor for participation?

The scatter plot below of number of teams vs prize money doesn’t show much of a trend thus indicating the prize money is not a key motivating factor for participation. The data only includes competitions that offered prize money and is public. (Note: You can zoom by dragging left mouse and reset zoom by right-clicking the plot. You can also hover over a point to see the info on the competition represented by the point)

# plot of number of teams vs prize
pltdf = compdf[,c("compprize","compnumteams","pop.html.tooltip")] %>% filter(compprize > 0)
plt = gvisScatterChart(pltdf,options=list(tooltip="{isHtml:'true'}",
                                          explorer="{actions: ['dragToZoom', 
                                          'rightClickToReset'],
                                          maxZoomIn:0.05}",
                                          vAxis="{title:'# teams'}",
                                          hAxis="{title:'Prize ($)'}",
                                          width=600,height=600))
plot(plt)
  • Is prize money that is set based on perceived difficulty of the problem?

The scatter plot of prize money vs competition duration (surrogate for difficulty) does not show a trend. The caveat is that competition duration might not be the right measure of difficulty. One hypothesis could be that the prize money is set by sponsor more based on the value they expect to achieve by implementing the solution rather than the difficulty of solving the problem.

# plot of duration vs prize
pltdf = compdf[,c("compprize","duration","pop.html.tooltip")] %>% filter(compprize > 0,duration > 0)
plt = gvisScatterChart(pltdf,options=list(tooltip="{isHtml:'true'}",
                                          explorer="{actions: ['dragToZoom', 
                                          'rightClickToReset'],
                                           maxZoomIn:0.05}",
                                          vAxis="{title:'Duration (days)'}",
                                          hAxis="{title:'Prize ($)'}",
                                          width=600,height=600))
plot(plt)
  • Which knowledge competitions are popular?

The bar graph below shows the knowledge competitions in decreasing order of number of participating teams. The top 2 competitions are “Titanic Machine Learning” and “Bike Sharing Demand”.

p=ggplot(data=compdf %>% filter(comptype2 == "Knowledge"),
       aes(x=reorder(compnames,compnumteams),y=compnumteams))+geom_bar(stat="identity")+
       xlab("")+ylab("# Teams")+
       coord_flip()+theme_bw()
print(p)