I wanted to explore how to scrape web data using R.I chose to scrape data for hotel reviews from Trip Advisor. I show snippets of code below for illustrative purposes. The full code for scraping the data is in the following location.
First I looked up the URL by typing the hotel name in Trip Advisor site. I chose J.W.Marriott in Indianapolis as an example. The main page URL is
But the search results span multiple pages. The URL follows the following pattern; for example, page 3 of search results have the URL
I scraped data for a hotel by looping through each page. First I extracted the page contents using the following code
# get html page content
doc=htmlTreeParse(urllink,useInternalNodes=TRUE)
Then I looked through the source of the page to find the DOM elements/tags that can be used to get a particular hotel review. The id for each hotel appears as part of href:
The set of all such links are extracted:
## get node sets
# review id
ns_id=getNodeSet(doc,"//div[@class='quote']/a[@href]")
By similarly looking through page source to see where other elements occur, I used the following code to extract nodes that have information on
# top quote for a review
ns_topquote=getNodeSet(doc,"//div[@class='quote']/a[@href]/span")
# get partial entry for review that shows in the page
ns_partialentry=getNodeSet(doc,"//div[@class='col2of2']//p[@class='partial_entry'][1]")
# date of rating
ns_ratingdt=getNodeSet(doc,"//div[@class='col2of2']//span[@class='ratingDate relativeDate' or @class='ratingDate']")
# rating (number of stars)
ns_rating=getNodeSet(doc,"//div[@class='col2of2']//span[@class='rate sprite-rating_s rating_s']/img[@alt]")
The actual content of interest is extracted from node set using two key functions xmlValue and xmlAttrs in XML package.
# get actual values extracted from node sets
# review id
id=sapply(ns_id,function(x) xmlAttrs(x)["id"])
# top quote for the review
topquote=sapply(ns_topquote,function(x) xmlValue(x))
# rating date (couple of formats seem to be used and hence a and b below)
ratingdta=sapply(ns_ratingdt,function(x) xmlAttrs(x)["title"])
ratingdtb=sapply(ns_ratingdt,function(x) xmlValue(x))
# rating (number of stars)
rating=sapply(ns_rating,function(x) xmlAttrs(x)["alt"])
# partial entry for review
partialentry=sapply(ns_partialentry,function(x) xmlValue(x))
To get the full review for a hotel, I clicked on the hotel review. The URL of a specific hotel review has the form (included reviewid in URL):
I used the following function with each review link to extract the full review. The actual code just involved getting the nodeset and extracting the contents of the node set just like I did above.
# function to extract full review given review id and full review urllink
getfullrev=function(urllink,id){
# get html content of page containing full review
docrev=htmlTreeParse(urllink,useInternalNodes=TRUE)
# extract node set containing full review
revid=paste("review_",id,sep="")
qry=paste("//p[@id='",revid,"']",sep="")
ns_fullrev=getNodeSet(docrev,eval(qry))
# get full review content
return(xmlValue(ns_fullrev[[1]]))
}
The full code is in the following location. The top 5 records of the dataset created with this code is shown below
## id
## 1 rn220694072
## 2 rn220524586
## 3 rn220162069
## 4 rn220108997
## 5 rn219995558
## 6 rn219906283
## topquote ratingdt
## 1 Excellent Hotel--especially if you can get a government rate! 2014-08-09
## 2 Close to football and baseball 2014-08-08
## 3 Best hotel experience ever! 2014-08-07
## 4 Great hotel 2014-08-06
## 5 Room with a VIEW 2014-08-06
## 6 Probably the best place to stay in Indy, plus great gym 2014-08-05
## rating
## 1 5 of 5 stars
## 2 5 of 5 stars
## 3 5 of 5 stars
## 4 5 of 5 stars
## 5 5 of 5 stars
## 6 4 of 5 stars
## partialentry
## 1 \nJW Marriott Hotels are among my favorite in the Marriott chain and on the rare occasion that they offer a government rate, they are an excellent place to stay. One of the largest hotels in the Midwest, the JW Indianapolis is about 5 blocks from the center of Indianapolis. My room was exceptionally clean and comfortable. I had access to...\n\n\nMore \n\n
## 2 \nThis property is close to their minor league baseball team and just a short distance from Lucas Field, home of the Colts. It is across the street from the convention center and government buildings. It is about 4 -5 blocks from the city center. The staff is exceptionally friendly and helpful. The rooms were cleaned-up on time. There was no...\n\n\nMore \n\n
## 3 \nAll the services from hotel staff, to the FedEx office to the restaurants were top notch! All the wait staff for both the restaurants gave excellent service as well as personable service! The Fedex store was extremely helpful and sympathetic to my crazy stressful circumstance of my books not showing up in time from my publisher. The hotel staff offered...\n\n\nMore \n\n
## 4 \nI've stayed at this hotel several times over the last year and have always enjoyed the experience. The location is ideal for access to Indianapolis sites and restaurants. Great hotel for the money - my first choice when traveling to Indianapolis.\n
## 5 \nMy husband's company had their annual sales meeting there...I was very lucky to accompany him. I loved this hotel! Very nice open lobby that one comes to expect from a Marriott. However our room, VIEW, 21st floor, had three floor to ceiling widows that looked over the White River and surrounding park, walk ways, museums, ball parks, and zoo. I...\n\n\nMore \n\n
## 6 \nStayed here for a few months.\n3 restaurants (sports bar, lounge bar, italian) all worth eating at!\nProperty is HUGE, used for many conferences.\nStarbucks on the 2nd floor - go up the elevator to the left of the entrance.\nThe gym is one of the best I have used, and rarely packed. Free weights always available.\nClose to the...\n\n\nMore \n\n
## ratingnum id2
## 1 5 220694072
## 2 5 220524586
## 3 5 220162069
## 4 5 220108997
## 5 5 219995558
## 6 4 219906283
## fullrev
## 1 \nJW Marriott Hotels are among my favorite in the Marriott chain and on the rare occasion that they offer a government rate, they are an excellent place to stay. One of the largest hotels in the Midwest, the JW Indianapolis is about 5 blocks from the center of Indianapolis. My room was exceptionally clean and comfortable. I had access to the concierge lounge which was well stocked in the evening/mornings and all of the staff I encountered were exceptionally professional and courteous. Wifi is free to Gold/Platinum Marriott members. Lots of choices for dining on the property and they are really close to downtown Indianapolis and one of my favorite restaurants (Palominos).\n
## 2 \nThis property is close to their minor league baseball team and just a short distance from Lucas Field, home of the Colts. It is across the street from the convention center and government buildings. It is about 4 -5 blocks from the city center. The staff is exceptionally friendly and helpful. The rooms were cleaned-up on time. There was no back-up in checking in as there often is when large groups check in. Concierge was helpful. Free wi-fi in the lobby. Starbucks on the second level\n
## 3 \nAll the services from hotel staff, to the FedEx office to the restaurants were top notch! All the wait staff for both the restaurants gave excellent service as well as personable service! The Fedex store was extremely helpful and sympathetic to my crazy stressful circumstance of my books not showing up in time from my publisher. The hotel staff offered great local suggestions and were even courteous during extremely busy times!\n
## 4 \nI've stayed at this hotel several times over the last year and have always enjoyed the experience. The location is ideal for access to Indianapolis sites and restaurants. Great hotel for the money - my first choice when traveling to Indianapolis.\n
## 5 \nMy husband's company had their annual sales meeting there...I was very lucky to accompany him. I loved this hotel! Very nice open lobby that one comes to expect from a Marriott. However our room, VIEW, 21st floor, had three floor to ceiling widows that looked over the White River and surrounding park, walk ways, museums, ball parks, and zoo. I am sure I am forgetting something because it was so expansive...while my husband was at meetings I could spend my days taking in all this area had to offer! The hotel staff was friendly and accommodating, food was good at all of his meetings. This JW was the best location ever for a business meeting...close to all the above plus downtown shopping and restaurants! Great stay:)\n
## 6 \nStayed here for a few months.3 restaurants (sports bar, lounge bar, italian) all worth eating at!Property is HUGE, used for many conferences.Starbucks on the 2nd floor - go up the elevator to the left of the entrance.The gym is one of the best I have used, and rarely packed. Free weights always available.Close to the river/canal walks,other hotels, the Indianapolis state museum.What would have made it a 5-star?Rooms are quiet, rarely heard anything\n
Next I did some exploratory analysis with this data and is described here.