Recently, Instacart released a dataset of ~3 million orders made by ~200,000 users at different days of week and times of day. There is also an ongoing Kaggle competition to predict which products a user will buy again. My goal here is more modest where I just wanted to explore the dataset to find patterns of purchasing behaviour by hour of day, day of week and number of days since prior order. An example of this kind of analysis is also shown in their blog. Here I wanted to explore if I can find such kind of patters by using the very common and popular dimension reduction technique - Principal Component Analysis (PCA). There are several great resources that introduce PCA if you are not familiar with PCA. One of the resources is the set of video lectures on machine learning by Prof. Hastie and Prof. Tibshirani.

The general approach that I have followed is:

Spoiler Alert: Since my analysis is basic, don’t be disappointed if there are no big Aha moments (there will be none). But I think it is still fun to see how we can extract such information directly from data.

We first load the libraries needed for the analysis

# load libraries
library(readr)
library(dplyr)
library(ggplot2)
library(scales)
library(tidyr)
library(htmlTable)

# source functions that are used for PCA analysis, extracting key information from PCA results
source("pca_fns.R")

I downloaded the data from the following link. The data dictionary is in the following link. Since the datasets are big, they are stored in tmpdata folder and not currently in the git repository. So somebody trying to reproduce this code will need to manually download the data and extract it to tmpdata subdirectory to be able to run this code.

# load datasets
aisles = read_csv("tmpdata/aisles.csv")
depts = read_csv("tmpdata/departments.csv")
prods = read_csv("tmpdata/products.csv")

orders = read_csv("tmpdata/orders.csv")
orders$order_hour_of_day = as.numeric(orders$order_hour_of_day)

ordprd_prior = read_csv("tmpdata/order_products__prior.csv")
ordprd_train = read_csv("tmpdata/order_products__train.csv")

# stack order-product datasets (prior and train) together
ordprd = bind_rows(ordprd_prior, ordprd_train)

# Basic Data Info
#

# number of users
length(unique(orders$user_id))
## [1] 206209
# number of orders
nrow(orders)
## [1] 3421083

Below are some basic info on the datasets

Next, we find products that account for top 80% of orders. We will work with this product set through the rest of the analysis

# Find products that account for top 80% of total orders
prod_orders = ordprd %>% group_by(product_id) %>% summarize(numorders = n())
prod_orders = inner_join(prod_orders, prods, by = "product_id")
prod_orders = prod_orders %>% arrange(desc(numorders)) %>% mutate(pctorders = cumsum(numorders)/sum(numorders)*100)
prod_top80pct = prod_orders %>% filter(prod_orders$pctorders <= 80)

The number of products are ~50K or which ~5K account for 80% of total orders

PCA to find patterns of purchase by hour of day

We first explore the number of orders by day of week and hour of day.

# orders by day of week/hour of day
wkhr_orders = orders %>% group_by(order_dow, order_hour_of_day) %>% summarize(numorders = n())

# Plot of orders by hour of day for each week of day
ggplot(data = wkhr_orders) + 
  geom_line(aes(x = order_hour_of_day, y = numorders, color = factor(order_dow))) + 
  xlab("Hour of Day") + ylab("# Orders") + scale_color_discrete(name = "Day of Week") + 
  theme_bw() 

It is not clear whether day 0 is Saturday or Sunday. Since the number of orders is higher for days 0 and 1, maybe 0 and 1 are Saturday and Sunday. But certain other analysis further down seem to indicate that 0 is Sunday.

# create a dataset that calculates for each product, the percentage of product orders at each hour of day
ordprd2 = inner_join(ordprd, orders[,c("order_id","order_hour_of_day")], by = "order_id")
ordprd2a = ordprd2 %>% group_by(product_id, order_hour_of_day) %>% 
            summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd2a = inner_join(ordprd2a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")

Find products with different patterns of purchase timing by hour of day with PCA. Dataset for PCA has for each product (rows), the percentage of product orders at each hour of day (column)

ordprd2a_s = ordprd2a %>% select(-numorders) %>% spread(order_hour_of_day, pctorders, fill = 0)

Here I am plotting the average purchasing pattern by hour of day for a sample product item (chocolate sandwich cookies).

# Plot a row (product_id - 1, chocolate sandwich cookies)
ordprd2a_sample = data.frame(hour_of_day = names(ordprd2a_s)[-1], pctorders = t(ordprd2a_s[ordprd2a_s$product_id == 24852,-1]))
ggplot() + geom_line(data = ordprd2a_sample, aes(x = as.numeric(hour_of_day), y = pctorders)) + theme_bw()

Next we will do a PCA on this data. Since all the data is in percentages, I didn’t do any further scaling of data

# Perform a PCA (rows - products, variables - hour of day)
pcdata = ordprd2a_s
pc_mdl = prcomp(x = pcdata[,-1], center = TRUE, scale. = FALSE)

# Get the plot of cumulative variance explained by each principal component
pc_mdl_var = cumsum(pc_mdl$sdev^2)/sum(pc_mdl$sdev^2)
pc_mdl_vardf = data.frame(pc = 1:length(pc_mdl_var), cumvar = pc_mdl_var)
p_cumvar = ggplot() + 
           geom_line(data = pc_mdl_vardf, aes(x = pc, y = cumvar)) + 
           geom_point(data = pc_mdl_vardf, aes(x = pc, y = cumvar)) + 
           xlab("PC") + ylab("Cumulative Variance") + 
           scale_x_continuous(breaks = 1:length(pc_mdl_var)) + 
           scale_y_continuous(labels = percent, breaks = seq(0,1,by=0.1)) + expand_limits(y = 0) + 
           theme_bw(20)
p_cumvar

The plot of cumulative variance shows that first component accounts for 44% of variance, first two account for 58% and first 3 account for 67% of variance. Next, we will look at the first two loadings since first 2 components account for 58% of variance.

# Get loadings plot for 2 principal components
numcomp = 2

pc_loadings = data.frame(pc_mdl$rotation)
pc_loadings$hour_of_day = seq(0,23)
pc_loadings_g = pc_loadings %>% gather(pc, pcload, -hour_of_day)

pcfilt = paste0("PC",seq(1,numcomp))
pc_loadings_g = pc_loadings_g[pc_loadings_g$pc %in% pcfilt,]

p_pcload = ggplot() + geom_line(data = pc_loadings_g, aes(x = hour_of_day, y = pcload, color = pc)) +
             theme_bw(20)
p_pcload

First principal component loading PC1 indicates a pattern of either higher percentage of purcahses in the morning or evening. The second principal component loading indicates a pattern where there is higher purchase around 11am and 4pm. To check which product items follow these patterns, we look at products that either have high scores or low scores on a principal component. So here we take the top 20 and bottom 20 products in terms of their scores on PC1. The actual pattern still may not quite match the loading plot since the overall pattern is a combination of all principal component loadings.

# Scatter plot of PC1 and PC2 scores
pc_scores = data.frame(pc_mdl$x)
pc_scores$product_id = pcdata$product_id
pc_scores = inner_join(pc_scores, prods[,c("product_id", "product_name")], by = "product_id")

# sort products based on PC1 scores
pc_scores_pick = pc_scores %>% arrange(desc(PC1))

# get top and bottom 20 products based on the scores
prod_top20 = pc_scores_pick$product_name[1:20]
prod_bot20 = pc_scores_pick$product_name[(nrow(pc_scores_pick) - 19):nrow(pc_scores_pick)]

# plot hourly purchase profile for a list of products that are in top and bottom 20 scores of PC1
prodlist_topbot20 = data.frame(top_scores = prod_top20, bottom_scores = prod_bot20)
prodlist = c(prod_top20, rev(prod_bot20))
prodlist_label = c(rep("top 20 scores", 20), rep("bottom 20 scores", 20))
scoreid = c(seq(1,20), seq(1,20))
prodlistdf = data.frame(product_name = prodlist, prodlist_label, scoreid, stringsAsFactors = FALSE)
prodlistdf = inner_join(prodlistdf, prods[, c("product_id", "product_name")], by = "product_name")

prodlist_profile = inner_join(ordprd2a_s, prodlistdf[, c("product_id", "product_name", "prodlist_label")], by = "product_id")

prodlist_profile_g = prodlist_profile %>% gather(hour_of_day, pctorders, -product_name, -product_id, -prodlist_label)

ggplot() + 
  geom_line(data = prodlist_profile_g, aes(x = as.numeric(hour_of_day), y = pctorders, 
                                           group = product_name, color = prodlist_label)) + 
         xlab("Hour of Day") + ylab("% orders") + 
         scale_x_continuous(breaks = seq(0,23)) + scale_y_continuous(labels = percent) + 
         theme_bw(20) + theme(legend.title = element_blank())

Below is the table that lists the actual products that are in top and bottom scores of PC1. Ice cream purchases tend to occur more in the evening. Items like granola bars, krispie treats, apples are purchased more in the morning.

htmlTable(prodlist_topbot20)
top_scores bottom_scores
1 Salted Caramel Core Ice Cream Pistachios
2 Chocolate Fudge Brownie Ice Cream Sparkling Water, Bottles
3 The Tonight Doughâ„¢ Ice Cream Popcorn
4 Cashew Milk Salted Caramel Cluster Non-Dairy Frozen Dessert Extra Fancy Unsalted Mixed Nuts
5 Ice Cream, Super Premium, Mint Chocolate Chip Cheez-It Cheddar Cracker
6 Raspberry Chocolate Chip Gelato Variety
7 Coconut Milk Non Dairy Frozen Dessert No Sugar Added Mint Chip Peanut Milk Chocolate XXL Bag
8 Organic Cookies & Cream Ice Cream Cheez-It Baked Snack Crackers
9 Half Baked® Ice Cream Crunchy Oats 'n Honey Granola Bars
10 Half Baked Frozen Yogurt 0% Greek Strained Yogurt
11 Chocolate Chocolate Chip Ice Cream Original Rice Krispies Treats
12 New York Super Fudge Chunk® Ice Cream Milk Chocolate Almonds
13 Butter Pecan Ice Cream Apples
14 Vanilla, Chocolate, Strawberry Ice Cream Trail Mix
15 Mint Chocolate Cookie Ice Cream Nutri Grain Bars Multi Pack
16 Caramel Cone Ice Cream Dry Roasted Almonds
17 Chocolate Peanut Butter Cup Gelato Sweet & Salty Nut Granola Bars Peanut
18 Americone Dream® Ice Cream Pub Mix
19 Cherry Garcia Ice Cream Half And Half Ultra Pasteurized
20 Dulce de Leche Caramel Ice Cream Wint-O-Green

PCA to find patterns of purchase by day of week

Dataset for PCA has for each product (rows), the percentage of product orders at each day of week (column)

ordprd3 = inner_join(ordprd, orders[, c("order_id", "order_dow")], by = "order_id")

ordprd3a = ordprd3 %>% group_by(product_id, order_dow) %>% 
  summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd3a = inner_join(ordprd3a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")
ordprd3a_s = ordprd3a %>% select(-numorders) %>% spread(order_dow, pctorders, fill = 0)

The different analysis for PCA done for the case of purchase by hour of day have been wrapped into 3 functions that are defined in pca_funs.R (code)[pca_funs.R]

# perform PCA and plot cumulative variance
pcmdl_dow = getpca(pcdata = ordprd3a_s)
pcmdl_dow$p_cumvar

In this case, the first two principal components account for ~90% of variance. The loadings for first two components are plotted next.

pcplots_dow = get_pcaplots(pcmdl_dow, numcomp = 2, ordprd3a_s, ordprd3a_s, prods, seq(0,6), "day of week")
pcplots_dow$p_pcload

The products with high scores on first component would tend to be purchased more at the ends of the week and products with low scores on first component would tend to be purchased less at the ends of the week. The plot of products having top and bottom 20 scores on PC1 is shown next.

pcplots_dow$p_items[[1]]

The table below shows the names of products with top and bottom 20 scores on PC1. It looks like cookies and chocolates are bought less in weekends compared to cooking items.

htmlTable(pcplots_dow$prodlist_topbot20[[1]])
top_scores bottom_scores
1 Black Beans Reduced Sodium Pub Mix
2 99% Fat Free Natural Goodness Chicken Broth Wint-O-Green
3 Mild Diced Green Chiles Peanut Milk Chocolate XXL Bag
4 90% Lean Ground Beef White Multifold Towels
5 Organic Brussels Sprouts Sweet & Salty Nut Granola Bars Peanut
6 Jumbo Lump Crab Cake Trail Mix
7 Organic No Salt Added Diced Tomatoes Milk Chocolate Almonds
8 Spaghetti Squash Nutri Grain Bars Multi Pack
9 Boneless Beef Chuck Roast Cheez-It Baked Snack Crackers
10 Tikka Masala Simmer Sauce 49 Flavors Jelly Belly Jelly Beans
11 Golden Sweet Potato Milk Chocolate M&Ms
12 No Salt Added Black Beans Kit Kat & Reese's Assorted Miniatures
13 Poblano Pepper Fruit Snacks
14 Crushed Tomatoes With Basil Cookie Tray
15 Organic Tomato Sauce No Salt Added Original Rice Krispies Treats
16 Organic Great Northern Beans Crunchy Oats 'n Honey Granola Bars
17 Mexican Style Hominy Half And Half Ultra Pasteurized
18 Boneless Skinless Chicken Breasts Dry Roasted Almonds
19 Dark Red Kidney Beans No Salt Added Extra Fancy Unsalted Mixed Nuts
20 Petite Diced Tomatoes Variety

The products with high scores on second component would tend to be purchased more at the end of the week and products with low scores on first component would tend to be purchased more at the beginning of the week. The plot of products having top and bottom 20 scores on PC2 is shown next.

pcplots_dow$p_items[[2]]

Looks like some wine and vodka is purchased more as the week progresses.

htmlTable(pcplots_dow$prodlist_topbot20[[2]])
top_scores bottom_scores
1 Handmade Vodka From Austin, Texas BBQ Chopped Salad
2 Little Sumpin' Sumpin' Ale Teriyaki Beef Jerky
3 IPA Honey Turkey
4 Handmade Vodka Blueberry Pecan Plus Fiber Fruit & Nut Bar
5 Humboldt Fog Extra Lean Ground Turkey Breast
6 Belgian White Wheat Ale Turkey
7 Prosecco Sparkling Wine Chocolate Mint Builder's Bar
8 Lemon-Lime Soda Ground Turkey Breast
9 Vanilla Cupcakes Fresh Cut Baby Carrots
10 Camembert Organic Chocolate Chip ZBar Kids Energy Snack
11 Challah Bread Builder's Chocolate Peanut Butter Bar
12 Hash Browns Country Style Whole Grain Brown Ready Rice
13 Tostitos Scoops Boneless And Skinless Chicken Breast
14 Pure Cane Confectioners 10-X Powdered Sugar Dark Chocolate Nuts & Sea Salt Bars
15 Milk Chocolate Candy Bars Taco Seasoning
16 Beer Organic Boneless Skinless Chicken Breast
17 Tonic Water Plus Cranberry Almond + Antioxidants with Macadamia Nuts Bar
18 Ice Crispy Cheddar Crackers
19 Premium Lager Beer Organic Zucchini Spirals
20 Puffs Cheese Flavored Snacks Crunchy Peanut Butter Builder's Bar

PCA to find patterns of purchase by days since prior order

Dataset for PCA has for each product (rows), the percentage of product orders at each of days since prior order (column)

ordprd4 = inner_join(ordprd, orders[!is.na(orders$days_since_prior_order), c("order_id", "days_since_prior_order")], by = "order_id")

ordprd4a = ordprd4 %>% group_by(product_id, days_since_prior_order) %>% 
  summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd4a = inner_join(ordprd4a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")
ordprd4a_s = ordprd4a %>% select(-numorders) %>% spread(days_since_prior_order, pctorders, fill = 0)

# PCA of purchase timing (days since prior order)
pcmdl_prior = getpca(pcdata = ordprd4a_s)
pcmdl_prior$p_cumvar

First two principal components account for 75% of variance. The loadings for the first two components is shown next.

pcplots_prior = get_pcaplots(pcmdl_prior, numcomp = 2, ordprd4a_s, ordprd4a_s, prods, seq(0, 30), "days since prior order")
pcplots_prior$p_pcload

Products with high scores in first principal component will tend to be purchased more in orders that are less than a week from previous order and products with low scores on the first principal component will tend to be purchased in orders that are more than a month after the previous order

pcplots_prior$p_items[[1]]

Items like trash bag, laundary detergents, towels are in orders that are made less often (longer than once per month) whereas the food/drink related items are in orders that are made more frequently (less than a week)

htmlTable(pcplots_prior$prodlist_topbot20[[1]])
top_scores bottom_scores
1 Ultra-Purified Water Tall Kitchen Bag With Febreze Odor Shield
2 Grass milk Raw Cheddar Cheese Fresh Scent Laundry Detergent
3 White with 30% Cacao Content Organic Chocolate All Natural White Vinegar
4 Baby Food Pouch - Butternut Squash, Carrot & Chickpea Fabric Softener Dryer Sheet Outdoor Fresh 160CT Fabric Enhancers
5 Organics Big Bird's Apple 100% Juice XL Pick-A-Size Paper Towel Rolls
6 Stage 1 Apples Sweet Potatoes Pumpkin & Blueberries Organic Pureed Baby Food April Fresh Liquid Fabric Softener
7 Organic Whole Milk Shredded Parmigiano Reggiano
8 Whole Organic Omega 3 Milk Makeup Remover Cleansing Towelettes
9 Just Sweet Potato Baby Food Glass Cleaner
10 Organic Homogenized Whole Milk Premium Sliced Bacon
11 Honeydew Chunks Organic Variety Pack
12 100% Whole Wheat Bagels Chicken Apple Sausage
13 Half And Half Ultra Pasteurized Solid White Albacore Tuna
14 Organic Stage 2 Winter Squash Baby Food Puree Versatile Stain Remover
15 Very Berry Flavor Sparking Mineral Water Crunchy Rice Rollers
16 Baby Food Pouch - Roasted Carrot Spinach & Beans Organic Quinoa
17 Organic Apples, Carrots and Parsnips Puree Seltzer Water
18 Unsweetened Premium Iced Tea Avocado Oil
19 Bag of Red Delicious Apples Non-Scratch Scrub Sponges
20 Organic Chamomile with Lavender Herbal Tea Bags Spa Cuisine Butternut Squash Ravioli

Products with high scores in principal component 2 will tend to have higher percentage of orders with days since previous order of less than a week and products with low scores in principal component 2 will tend to have a higher percentage of orders that are weekly in frequency.

pcplots_prior$p_items[[2]]

Juices, milk and wine tend to be in orders who frequency is less than a week. Items like Greek Yogurt, Apples and Granola bar are in orders who frequency is weekly.

htmlTable(pcplots_prior$prodlist_topbot20[[2]])
top_scores bottom_scores
1 Ultra-Purified Water Dove Dark Chocolate Snowflake Promises
2 Ice Crunchy Peanut Butter Granola
3 Vodka Wheat Thins Original
4 Organic European Style Sweet Unsalted Butter Plus Cranberry Almond + Antioxidants with Macadamia Nuts Bar
5 80 Vodka Holiday Edition Pub Mix
6 Chardonnay Wine Dark Chocolate Cinnamon Pecan Bar
7 Organic Coconut Bliss Vanilla Island Ice Cream Non-Fat Blueberry on the Bottom Greek Yogurt
8 Natural Artesian Bottled Water Pomegranate on the Bottom Non-Fat Greek Yogurt
9 Chardonnay Almond & Apricot Bar
10 Grass milk Raw Cheddar Cheese Apples
11 Handmade Vodka From Austin, Texas Coconut Chocolate Chip
12 Lemon Love Juice Drink Trail Mix Fruit & Nut Chewy Granola Bars
13 Pure Cane Confectioners Powdered Sugar Total 0% Blueberry Acai Greek Yogurt
14 Premium Ice Cubes Pears
15 Honest Face, Hand, & Baby Wipes Builder's Chocolate Peanut Butter Bar
16 Organic Kale Chips Fruit Snacks
17 Sauvignon Blanc Black Pepper Snapea Crisps
18 Cacao Powder Teriyaki Beef Jerky
19 Garden Vegetable Salad Roll Cheez-It Cheddar Cracker
20 Tall Kitchen Bag With Febreze Odor Shield Sweet & Salty Nut Almond Granola Bars

Session Info

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] htmlTable_1.9 tidyr_0.6.0   scales_0.4.1  ggplot2_2.2.1 dplyr_0.5.0  
## [6] readr_1.1.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     knitr_1.15.1     magrittr_1.5     hms_0.3         
##  [5] munsell_0.4.3    colorspace_1.2-6 R6_2.1.3         stringr_1.2.0   
##  [9] plyr_1.8.4       tools_3.3.1      grid_3.3.1       checkmate_1.8.2 
## [13] gtable_0.2.0     DBI_0.5-1        htmltools_0.3.6  lazyeval_0.2.0  
## [17] assertthat_0.1   rprojroot_1.2    digest_0.6.12    tibble_1.2      
## [21] codetools_0.2-14 htmlwidgets_0.7  evaluate_0.10    rmarkdown_1.5   
## [25] labeling_0.3     stringi_1.1.5    backports_1.0.5