Recently, Instacart released a dataset of ~3 million orders made by ~200,000 users at different days of week and times of day. There is also an ongoing Kaggle competition to predict which products a user will buy again. My goal here is more modest where I just wanted to explore the dataset to find patterns of purchasing behaviour by hour of day, day of week and number of days since prior order. An example of this kind of analysis is also shown in their blog. Here I wanted to explore if I can find such kind of patters by using the very common and popular dimension reduction technique - Principal Component Analysis (PCA). There are several great resources that introduce PCA if you are not familiar with PCA. One of the resources is the set of video lectures on machine learning by Prof. Hastie and Prof. Tibshirani.
The general approach that I have followed is:
Spoiler Alert: Since my analysis is basic, don’t be disappointed if there are no big Aha moments (there will be none). But I think it is still fun to see how we can extract such information directly from data.
We first load the libraries needed for the analysis
# load libraries
library(readr)
library(dplyr)
library(ggplot2)
library(scales)
library(tidyr)
library(htmlTable)
# source functions that are used for PCA analysis, extracting key information from PCA results
source("pca_fns.R")
I downloaded the data from the following link. The data dictionary is in the following link. Since the datasets are big, they are stored in tmpdata folder and not currently in the git repository. So somebody trying to reproduce this code will need to manually download the data and extract it to tmpdata subdirectory to be able to run this code.
# load datasets
aisles = read_csv("tmpdata/aisles.csv")
depts = read_csv("tmpdata/departments.csv")
prods = read_csv("tmpdata/products.csv")
orders = read_csv("tmpdata/orders.csv")
orders$order_hour_of_day = as.numeric(orders$order_hour_of_day)
ordprd_prior = read_csv("tmpdata/order_products__prior.csv")
ordprd_train = read_csv("tmpdata/order_products__train.csv")
# stack order-product datasets (prior and train) together
ordprd = bind_rows(ordprd_prior, ordprd_train)
# Basic Data Info
#
# number of users
length(unique(orders$user_id))
## [1] 206209
# number of orders
nrow(orders)
## [1] 3421083
Below are some basic info on the datasets
Next, we find products that account for top 80% of orders. We will work with this product set through the rest of the analysis
# Find products that account for top 80% of total orders
prod_orders = ordprd %>% group_by(product_id) %>% summarize(numorders = n())
prod_orders = inner_join(prod_orders, prods, by = "product_id")
prod_orders = prod_orders %>% arrange(desc(numorders)) %>% mutate(pctorders = cumsum(numorders)/sum(numorders)*100)
prod_top80pct = prod_orders %>% filter(prod_orders$pctorders <= 80)
The number of products are ~50K or which ~5K account for 80% of total orders
We first explore the number of orders by day of week and hour of day.
# orders by day of week/hour of day
wkhr_orders = orders %>% group_by(order_dow, order_hour_of_day) %>% summarize(numorders = n())
# Plot of orders by hour of day for each week of day
ggplot(data = wkhr_orders) +
geom_line(aes(x = order_hour_of_day, y = numorders, color = factor(order_dow))) +
xlab("Hour of Day") + ylab("# Orders") + scale_color_discrete(name = "Day of Week") +
theme_bw()
It is not clear whether day 0 is Saturday or Sunday. Since the number of orders is higher for days 0 and 1, maybe 0 and 1 are Saturday and Sunday. But certain other analysis further down seem to indicate that 0 is Sunday.
# create a dataset that calculates for each product, the percentage of product orders at each hour of day
ordprd2 = inner_join(ordprd, orders[,c("order_id","order_hour_of_day")], by = "order_id")
ordprd2a = ordprd2 %>% group_by(product_id, order_hour_of_day) %>%
summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd2a = inner_join(ordprd2a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")
Find products with different patterns of purchase timing by hour of day with PCA. Dataset for PCA has for each product (rows), the percentage of product orders at each hour of day (column)
ordprd2a_s = ordprd2a %>% select(-numorders) %>% spread(order_hour_of_day, pctorders, fill = 0)
Here I am plotting the average purchasing pattern by hour of day for a sample product item (chocolate sandwich cookies).
# Plot a row (product_id - 1, chocolate sandwich cookies)
ordprd2a_sample = data.frame(hour_of_day = names(ordprd2a_s)[-1], pctorders = t(ordprd2a_s[ordprd2a_s$product_id == 24852,-1]))
ggplot() + geom_line(data = ordprd2a_sample, aes(x = as.numeric(hour_of_day), y = pctorders)) + theme_bw()
Next we will do a PCA on this data. Since all the data is in percentages, I didn’t do any further scaling of data
# Perform a PCA (rows - products, variables - hour of day)
pcdata = ordprd2a_s
pc_mdl = prcomp(x = pcdata[,-1], center = TRUE, scale. = FALSE)
# Get the plot of cumulative variance explained by each principal component
pc_mdl_var = cumsum(pc_mdl$sdev^2)/sum(pc_mdl$sdev^2)
pc_mdl_vardf = data.frame(pc = 1:length(pc_mdl_var), cumvar = pc_mdl_var)
p_cumvar = ggplot() +
geom_line(data = pc_mdl_vardf, aes(x = pc, y = cumvar)) +
geom_point(data = pc_mdl_vardf, aes(x = pc, y = cumvar)) +
xlab("PC") + ylab("Cumulative Variance") +
scale_x_continuous(breaks = 1:length(pc_mdl_var)) +
scale_y_continuous(labels = percent, breaks = seq(0,1,by=0.1)) + expand_limits(y = 0) +
theme_bw(20)
p_cumvar
The plot of cumulative variance shows that first component accounts for 44% of variance, first two account for 58% and first 3 account for 67% of variance. Next, we will look at the first two loadings since first 2 components account for 58% of variance.
# Get loadings plot for 2 principal components
numcomp = 2
pc_loadings = data.frame(pc_mdl$rotation)
pc_loadings$hour_of_day = seq(0,23)
pc_loadings_g = pc_loadings %>% gather(pc, pcload, -hour_of_day)
pcfilt = paste0("PC",seq(1,numcomp))
pc_loadings_g = pc_loadings_g[pc_loadings_g$pc %in% pcfilt,]
p_pcload = ggplot() + geom_line(data = pc_loadings_g, aes(x = hour_of_day, y = pcload, color = pc)) +
theme_bw(20)
p_pcload
First principal component loading PC1 indicates a pattern of either higher percentage of purcahses in the morning or evening. The second principal component loading indicates a pattern where there is higher purchase around 11am and 4pm. To check which product items follow these patterns, we look at products that either have high scores or low scores on a principal component. So here we take the top 20 and bottom 20 products in terms of their scores on PC1. The actual pattern still may not quite match the loading plot since the overall pattern is a combination of all principal component loadings.
# Scatter plot of PC1 and PC2 scores
pc_scores = data.frame(pc_mdl$x)
pc_scores$product_id = pcdata$product_id
pc_scores = inner_join(pc_scores, prods[,c("product_id", "product_name")], by = "product_id")
# sort products based on PC1 scores
pc_scores_pick = pc_scores %>% arrange(desc(PC1))
# get top and bottom 20 products based on the scores
prod_top20 = pc_scores_pick$product_name[1:20]
prod_bot20 = pc_scores_pick$product_name[(nrow(pc_scores_pick) - 19):nrow(pc_scores_pick)]
# plot hourly purchase profile for a list of products that are in top and bottom 20 scores of PC1
prodlist_topbot20 = data.frame(top_scores = prod_top20, bottom_scores = prod_bot20)
prodlist = c(prod_top20, rev(prod_bot20))
prodlist_label = c(rep("top 20 scores", 20), rep("bottom 20 scores", 20))
scoreid = c(seq(1,20), seq(1,20))
prodlistdf = data.frame(product_name = prodlist, prodlist_label, scoreid, stringsAsFactors = FALSE)
prodlistdf = inner_join(prodlistdf, prods[, c("product_id", "product_name")], by = "product_name")
prodlist_profile = inner_join(ordprd2a_s, prodlistdf[, c("product_id", "product_name", "prodlist_label")], by = "product_id")
prodlist_profile_g = prodlist_profile %>% gather(hour_of_day, pctorders, -product_name, -product_id, -prodlist_label)
ggplot() +
geom_line(data = prodlist_profile_g, aes(x = as.numeric(hour_of_day), y = pctorders,
group = product_name, color = prodlist_label)) +
xlab("Hour of Day") + ylab("% orders") +
scale_x_continuous(breaks = seq(0,23)) + scale_y_continuous(labels = percent) +
theme_bw(20) + theme(legend.title = element_blank())
Below is the table that lists the actual products that are in top and bottom scores of PC1. Ice cream purchases tend to occur more in the evening. Items like granola bars, krispie treats, apples are purchased more in the morning.
htmlTable(prodlist_topbot20)
top_scores | bottom_scores | |
---|---|---|
1 | Salted Caramel Core Ice Cream | Pistachios |
2 | Chocolate Fudge Brownie Ice Cream | Sparkling Water, Bottles |
3 | The Tonight Doughâ„¢ Ice Cream | Popcorn |
4 | Cashew Milk Salted Caramel Cluster Non-Dairy Frozen Dessert | Extra Fancy Unsalted Mixed Nuts |
5 | Ice Cream, Super Premium, Mint Chocolate Chip | Cheez-It Cheddar Cracker |
6 | Raspberry Chocolate Chip Gelato | Variety |
7 | Coconut Milk Non Dairy Frozen Dessert No Sugar Added Mint Chip | Peanut Milk Chocolate XXL Bag |
8 | Organic Cookies & Cream Ice Cream | Cheez-It Baked Snack Crackers |
9 | Half Baked® Ice Cream | Crunchy Oats 'n Honey Granola Bars |
10 | Half Baked Frozen Yogurt | 0% Greek Strained Yogurt |
11 | Chocolate Chocolate Chip Ice Cream | Original Rice Krispies Treats |
12 | New York Super Fudge Chunk® Ice Cream | Milk Chocolate Almonds |
13 | Butter Pecan Ice Cream | Apples |
14 | Vanilla, Chocolate, Strawberry Ice Cream | Trail Mix |
15 | Mint Chocolate Cookie Ice Cream | Nutri Grain Bars Multi Pack |
16 | Caramel Cone Ice Cream | Dry Roasted Almonds |
17 | Chocolate Peanut Butter Cup Gelato | Sweet & Salty Nut Granola Bars Peanut |
18 | Americone Dream® Ice Cream | Pub Mix |
19 | Cherry Garcia Ice Cream | Half And Half Ultra Pasteurized |
20 | Dulce de Leche Caramel Ice Cream | Wint-O-Green |
Dataset for PCA has for each product (rows), the percentage of product orders at each day of week (column)
ordprd3 = inner_join(ordprd, orders[, c("order_id", "order_dow")], by = "order_id")
ordprd3a = ordprd3 %>% group_by(product_id, order_dow) %>%
summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd3a = inner_join(ordprd3a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")
ordprd3a_s = ordprd3a %>% select(-numorders) %>% spread(order_dow, pctorders, fill = 0)
The different analysis for PCA done for the case of purchase by hour of day have been wrapped into 3 functions that are defined in pca_funs.R
(code)[pca_funs.R]
# perform PCA and plot cumulative variance
pcmdl_dow = getpca(pcdata = ordprd3a_s)
pcmdl_dow$p_cumvar
In this case, the first two principal components account for ~90% of variance. The loadings for first two components are plotted next.
pcplots_dow = get_pcaplots(pcmdl_dow, numcomp = 2, ordprd3a_s, ordprd3a_s, prods, seq(0,6), "day of week")
pcplots_dow$p_pcload
The products with high scores on first component would tend to be purchased more at the ends of the week and products with low scores on first component would tend to be purchased less at the ends of the week. The plot of products having top and bottom 20 scores on PC1 is shown next.
pcplots_dow$p_items[[1]]
The table below shows the names of products with top and bottom 20 scores on PC1. It looks like cookies and chocolates are bought less in weekends compared to cooking items.
htmlTable(pcplots_dow$prodlist_topbot20[[1]])
top_scores | bottom_scores | |
---|---|---|
1 | Black Beans Reduced Sodium | Pub Mix |
2 | 99% Fat Free Natural Goodness Chicken Broth | Wint-O-Green |
3 | Mild Diced Green Chiles | Peanut Milk Chocolate XXL Bag |
4 | 90% Lean Ground Beef | White Multifold Towels |
5 | Organic Brussels Sprouts | Sweet & Salty Nut Granola Bars Peanut |
6 | Jumbo Lump Crab Cake | Trail Mix |
7 | Organic No Salt Added Diced Tomatoes | Milk Chocolate Almonds |
8 | Spaghetti Squash | Nutri Grain Bars Multi Pack |
9 | Boneless Beef Chuck Roast | Cheez-It Baked Snack Crackers |
10 | Tikka Masala Simmer Sauce | 49 Flavors Jelly Belly Jelly Beans |
11 | Golden Sweet Potato | Milk Chocolate M&Ms |
12 | No Salt Added Black Beans | Kit Kat & Reese's Assorted Miniatures |
13 | Poblano Pepper | Fruit Snacks |
14 | Crushed Tomatoes With Basil | Cookie Tray |
15 | Organic Tomato Sauce No Salt Added | Original Rice Krispies Treats |
16 | Organic Great Northern Beans | Crunchy Oats 'n Honey Granola Bars |
17 | Mexican Style Hominy | Half And Half Ultra Pasteurized |
18 | Boneless Skinless Chicken Breasts | Dry Roasted Almonds |
19 | Dark Red Kidney Beans No Salt Added | Extra Fancy Unsalted Mixed Nuts |
20 | Petite Diced Tomatoes | Variety |
The products with high scores on second component would tend to be purchased more at the end of the week and products with low scores on first component would tend to be purchased more at the beginning of the week. The plot of products having top and bottom 20 scores on PC2 is shown next.
pcplots_dow$p_items[[2]]
Looks like some wine and vodka is purchased more as the week progresses.
htmlTable(pcplots_dow$prodlist_topbot20[[2]])
top_scores | bottom_scores | |
---|---|---|
1 | Handmade Vodka From Austin, Texas | BBQ Chopped Salad |
2 | Little Sumpin' Sumpin' Ale | Teriyaki Beef Jerky |
3 | IPA | Honey Turkey |
4 | Handmade Vodka | Blueberry Pecan Plus Fiber Fruit & Nut Bar |
5 | Humboldt Fog | Extra Lean Ground Turkey Breast |
6 | Belgian White Wheat Ale | Turkey |
7 | Prosecco Sparkling Wine | Chocolate Mint Builder's Bar |
8 | Lemon-Lime Soda | Ground Turkey Breast |
9 | Vanilla Cupcakes | Fresh Cut Baby Carrots |
10 | Camembert | Organic Chocolate Chip ZBar Kids Energy Snack |
11 | Challah Bread | Builder's Chocolate Peanut Butter Bar |
12 | Hash Browns Country Style | Whole Grain Brown Ready Rice |
13 | Tostitos Scoops | Boneless And Skinless Chicken Breast |
14 | Pure Cane Confectioners 10-X Powdered Sugar | Dark Chocolate Nuts & Sea Salt Bars |
15 | Milk Chocolate Candy Bars | Taco Seasoning |
16 | Beer | Organic Boneless Skinless Chicken Breast |
17 | Tonic Water | Plus Cranberry Almond + Antioxidants with Macadamia Nuts Bar |
18 | Ice | Crispy Cheddar Crackers |
19 | Premium Lager Beer | Organic Zucchini Spirals |
20 | Puffs Cheese Flavored Snacks | Crunchy Peanut Butter Builder's Bar |
Dataset for PCA has for each product (rows), the percentage of product orders at each of days since prior order (column)
ordprd4 = inner_join(ordprd, orders[!is.na(orders$days_since_prior_order), c("order_id", "days_since_prior_order")], by = "order_id")
ordprd4a = ordprd4 %>% group_by(product_id, days_since_prior_order) %>%
summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd4a = inner_join(ordprd4a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")
ordprd4a_s = ordprd4a %>% select(-numorders) %>% spread(days_since_prior_order, pctorders, fill = 0)
# PCA of purchase timing (days since prior order)
pcmdl_prior = getpca(pcdata = ordprd4a_s)
pcmdl_prior$p_cumvar
First two principal components account for 75% of variance. The loadings for the first two components is shown next.
pcplots_prior = get_pcaplots(pcmdl_prior, numcomp = 2, ordprd4a_s, ordprd4a_s, prods, seq(0, 30), "days since prior order")
pcplots_prior$p_pcload
Products with high scores in first principal component will tend to be purchased more in orders that are less than a week from previous order and products with low scores on the first principal component will tend to be purchased in orders that are more than a month after the previous order
pcplots_prior$p_items[[1]]
Items like trash bag, laundary detergents, towels are in orders that are made less often (longer than once per month) whereas the food/drink related items are in orders that are made more frequently (less than a week)
htmlTable(pcplots_prior$prodlist_topbot20[[1]])
top_scores | bottom_scores | |
---|---|---|
1 | Ultra-Purified Water | Tall Kitchen Bag With Febreze Odor Shield |
2 | Grass milk Raw Cheddar Cheese | Fresh Scent Laundry Detergent |
3 | White with 30% Cacao Content Organic Chocolate | All Natural White Vinegar |
4 | Baby Food Pouch - Butternut Squash, Carrot & Chickpea | Fabric Softener Dryer Sheet Outdoor Fresh 160CT Fabric Enhancers |
5 | Organics Big Bird's Apple 100% Juice | XL Pick-A-Size Paper Towel Rolls |
6 | Stage 1 Apples Sweet Potatoes Pumpkin & Blueberries Organic Pureed Baby Food | April Fresh Liquid Fabric Softener |
7 | Organic Whole Milk | Shredded Parmigiano Reggiano |
8 | Whole Organic Omega 3 Milk | Makeup Remover Cleansing Towelettes |
9 | Just Sweet Potato Baby Food | Glass Cleaner |
10 | Organic Homogenized Whole Milk | Premium Sliced Bacon |
11 | Honeydew Chunks | Organic Variety Pack |
12 | 100% Whole Wheat Bagels | Chicken Apple Sausage |
13 | Half And Half Ultra Pasteurized | Solid White Albacore Tuna |
14 | Organic Stage 2 Winter Squash Baby Food Puree | Versatile Stain Remover |
15 | Very Berry Flavor Sparking Mineral Water | Crunchy Rice Rollers |
16 | Baby Food Pouch - Roasted Carrot Spinach & Beans | Organic Quinoa |
17 | Organic Apples, Carrots and Parsnips Puree | Seltzer Water |
18 | Unsweetened Premium Iced Tea | Avocado Oil |
19 | Bag of Red Delicious Apples | Non-Scratch Scrub Sponges |
20 | Organic Chamomile with Lavender Herbal Tea Bags | Spa Cuisine Butternut Squash Ravioli |
Products with high scores in principal component 2 will tend to have higher percentage of orders with days since previous order of less than a week and products with low scores in principal component 2 will tend to have a higher percentage of orders that are weekly in frequency.
pcplots_prior$p_items[[2]]
Juices, milk and wine tend to be in orders who frequency is less than a week. Items like Greek Yogurt, Apples and Granola bar are in orders who frequency is weekly.
htmlTable(pcplots_prior$prodlist_topbot20[[2]])
top_scores | bottom_scores | |
---|---|---|
1 | Ultra-Purified Water | Dove Dark Chocolate Snowflake Promises |
2 | Ice | Crunchy Peanut Butter Granola |
3 | Vodka | Wheat Thins Original |
4 | Organic European Style Sweet Unsalted Butter | Plus Cranberry Almond + Antioxidants with Macadamia Nuts Bar |
5 | 80 Vodka Holiday Edition | Pub Mix |
6 | Chardonnay Wine | Dark Chocolate Cinnamon Pecan Bar |
7 | Organic Coconut Bliss Vanilla Island Ice Cream | Non-Fat Blueberry on the Bottom Greek Yogurt |
8 | Natural Artesian Bottled Water | Pomegranate on the Bottom Non-Fat Greek Yogurt |
9 | Chardonnay | Almond & Apricot Bar |
10 | Grass milk Raw Cheddar Cheese | Apples |
11 | Handmade Vodka From Austin, Texas | Coconut Chocolate Chip |
12 | Lemon Love Juice Drink | Trail Mix Fruit & Nut Chewy Granola Bars |
13 | Pure Cane Confectioners Powdered Sugar | Total 0% Blueberry Acai Greek Yogurt |
14 | Premium Ice Cubes | Pears |
15 | Honest Face, Hand, & Baby Wipes | Builder's Chocolate Peanut Butter Bar |
16 | Organic Kale Chips | Fruit Snacks |
17 | Sauvignon Blanc | Black Pepper Snapea Crisps |
18 | Cacao Powder | Teriyaki Beef Jerky |
19 | Garden Vegetable Salad Roll | Cheez-It Cheddar Cracker |
20 | Tall Kitchen Bag With Febreze Odor Shield | Sweet & Salty Nut Almond Granola Bars |
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] htmlTable_1.9 tidyr_0.6.0 scales_0.4.1 ggplot2_2.2.1 dplyr_0.5.0
## [6] readr_1.1.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 knitr_1.15.1 magrittr_1.5 hms_0.3
## [5] munsell_0.4.3 colorspace_1.2-6 R6_2.1.3 stringr_1.2.0
## [9] plyr_1.8.4 tools_3.3.1 grid_3.3.1 checkmate_1.8.2
## [13] gtable_0.2.0 DBI_0.5-1 htmltools_0.3.6 lazyeval_0.2.0
## [17] assertthat_0.1 rprojroot_1.2 digest_0.6.12 tibble_1.2
## [21] codetools_0.2-14 htmlwidgets_0.7 evaluate_0.10 rmarkdown_1.5
## [25] labeling_0.3 stringi_1.1.5 backports_1.0.5