Exploring Instacart Dataset with PCA

Recently, Instacart released a dataset of ~3 million orders made by ~200,000 users at different days of week and times of day. There is also an ongoing Kaggle competition to predict which products a user will buy again. My goal here is more modest where I just wanted to explore the dataset to find patterns of purchasing behaviour by hour of day, day of week and number of days since prior order. An example of this kind of analysis is also shown in their blog. Here I wanted to explore if I can find such kind of patters by using the very common and popular dimension reduction technique - Principal Component Analysis (PCA). There are several great resources that introduce PCA if you are not familiar with PCA. One of the resources is the set of video lectures on machine learning by Prof. Hastie and Prof. Tibshirani.

The general approach that I have followed is:

Do principal component analysis on the data (each row is a product, each column is a time period (hour of day, day of week or number of days since prior order))
Review the loading plots of first two principal components to see purchase patterns and check the variance accounted by them.
Identify top 20 products that have high absolute scores in either first or the second principal component
Check the purchasing pattern by computing the average number of orders in each time period for the products that were identified as having top scores in one of the principal components.

Spoiler Alert: Since my analysis is basic, don’t be disappointed if there are no big Aha moments (there will be none). But I think it is still fun to see how we can extract such information directly from data.

We first load the libraries needed for the analysis

# load libraries
library(readr)
library(dplyr)
library(ggplot2)
library(scales)
library(tidyr)
library(htmlTable)

# source functions that are used for PCA analysis, extracting key information from PCA results
source("pca_fns.R")

I downloaded the data from the following link. The data dictionary is in the following link. Since the datasets are big, they are stored in tmpdata folder and not currently in the git repository. So somebody trying to reproduce this code will need to manually download the data and extract it to tmpdata subdirectory to be able to run this code.

# load datasets
aisles = read_csv("tmpdata/aisles.csv")
depts = read_csv("tmpdata/departments.csv")
prods = read_csv("tmpdata/products.csv")

orders = read_csv("tmpdata/orders.csv")
orders$order_hour_of_day = as.numeric(orders$order_hour_of_day)

ordprd_prior = read_csv("tmpdata/order_products__prior.csv")
ordprd_train = read_csv("tmpdata/order_products__train.csv")

# stack order-product datasets (prior and train) together
ordprd = bind_rows(ordprd_prior, ordprd_train)

# Basic Data Info
#

# number of users
length(unique(orders$user_id))

## [1] 206209

# number of orders
nrow(orders)

## [1] 3421083

Below are some basic info on the datasets

The number of users are ~200,000.
The number of orders are ~3.4M.

Next, we find products that account for top 80% of orders. We will work with this product set through the rest of the analysis

# Find products that account for top 80% of total orders
prod_orders = ordprd %>% group_by(product_id) %>% summarize(numorders = n())
prod_orders = inner_join(prod_orders, prods, by = "product_id")
prod_orders = prod_orders %>% arrange(desc(numorders)) %>% mutate(pctorders = cumsum(numorders)/sum(numorders)*100)
prod_top80pct = prod_orders %>% filter(prod_orders$pctorders <= 80)

The number of products are ~50K or which ~5K account for 80% of total orders

PCA to find patterns of purchase by hour of day

We first explore the number of orders by day of week and hour of day.

# orders by day of week/hour of day
wkhr_orders = orders %>% group_by(order_dow, order_hour_of_day) %>% summarize(numorders = n())

# Plot of orders by hour of day for each week of day
ggplot(data = wkhr_orders) + 
  geom_line(aes(x = order_hour_of_day, y = numorders, color = factor(order_dow))) + 
  xlab("Hour of Day") + ylab("# Orders") + scale_color_discrete(name = "Day of Week") + 
  theme_bw()

It is not clear whether day 0 is Saturday or Sunday. Since the number of orders is higher for days 0 and 1, maybe 0 and 1 are Saturday and Sunday. But certain other analysis further down seem to indicate that 0 is Sunday.

# create a dataset that calculates for each product, the percentage of product orders at each hour of day
ordprd2 = inner_join(ordprd, orders[,c("order_id","order_hour_of_day")], by = "order_id")
ordprd2a = ordprd2 %>% group_by(product_id, order_hour_of_day) %>% 
            summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd2a = inner_join(ordprd2a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")

Find products with different patterns of purchase timing by hour of day with PCA. Dataset for PCA has for each product (rows), the percentage of product orders at each hour of day (column)

ordprd2a_s = ordprd2a %>% select(-numorders) %>% spread(order_hour_of_day, pctorders, fill = 0)

Here I am plotting the average purchasing pattern by hour of day for a sample product item (chocolate sandwich cookies).

# Plot a row (product_id - 1, chocolate sandwich cookies)
ordprd2a_sample = data.frame(hour_of_day = names(ordprd2a_s)[-1], pctorders = t(ordprd2a_s[ordprd2a_s$product_id == 24852,-1]))
ggplot() + geom_line(data = ordprd2a_sample, aes(x = as.numeric(hour_of_day), y = pctorders)) + theme_bw()

Next we will do a PCA on this data. Since all the data is in percentages, I didn’t do any further scaling of data

# Perform a PCA (rows - products, variables - hour of day)
pcdata = ordprd2a_s
pc_mdl = prcomp(x = pcdata[,-1], center = TRUE, scale. = FALSE)

# Get the plot of cumulative variance explained by each principal component
pc_mdl_var = cumsum(pc_mdl$sdev^2)/sum(pc_mdl$sdev^2)
pc_mdl_vardf = data.frame(pc = 1:length(pc_mdl_var), cumvar = pc_mdl_var)
p_cumvar = ggplot() + 
           geom_line(data = pc_mdl_vardf, aes(x = pc, y = cumvar)) + 
           geom_point(data = pc_mdl_vardf, aes(x = pc, y = cumvar)) + 
           xlab("PC") + ylab("Cumulative Variance") + 
           scale_x_continuous(breaks = 1:length(pc_mdl_var)) + 
           scale_y_continuous(labels = percent, breaks = seq(0,1,by=0.1)) + expand_limits(y = 0) + 
           theme_bw(20)
p_cumvar

The plot of cumulative variance shows that first component accounts for 44% of variance, first two account for 58% and first 3 account for 67% of variance. Next, we will look at the first two loadings since first 2 components account for 58% of variance.

# Get loadings plot for 2 principal components
numcomp = 2

pc_loadings = data.frame(pc_mdl$rotation)
pc_loadings$hour_of_day = seq(0,23)
pc_loadings_g = pc_loadings %>% gather(pc, pcload, -hour_of_day)

pcfilt = paste0("PC",seq(1,numcomp))
pc_loadings_g = pc_loadings_g[pc_loadings_g$pc %in% pcfilt,]

p_pcload = ggplot() + geom_line(data = pc_loadings_g, aes(x = hour_of_day, y = pcload, color = pc)) +
             theme_bw(20)
p_pcload

First principal component loading PC1 indicates a pattern of either higher percentage of purcahses in the morning or evening. The second principal component loading indicates a pattern where there is higher purchase around 11am and 4pm. To check which product items follow these patterns, we look at products that either have high scores or low scores on a principal component. So here we take the top 20 and bottom 20 products in terms of their scores on PC1. The actual pattern still may not quite match the loading plot since the overall pattern is a combination of all principal component loadings.

# Scatter plot of PC1 and PC2 scores
pc_scores = data.frame(pc_mdl$x)
pc_scores$product_id = pcdata$product_id
pc_scores = inner_join(pc_scores, prods[,c("product_id", "product_name")], by = "product_id")

# sort products based on PC1 scores
pc_scores_pick = pc_scores %>% arrange(desc(PC1))

# get top and bottom 20 products based on the scores
prod_top20 = pc_scores_pick$product_name[1:20]
prod_bot20 = pc_scores_pick$product_name[(nrow(pc_scores_pick) - 19):nrow(pc_scores_pick)]

# plot hourly purchase profile for a list of products that are in top and bottom 20 scores of PC1
prodlist_topbot20 = data.frame(top_scores = prod_top20, bottom_scores = prod_bot20)
prodlist = c(prod_top20, rev(prod_bot20))
prodlist_label = c(rep("top 20 scores", 20), rep("bottom 20 scores", 20))
scoreid = c(seq(1,20), seq(1,20))
prodlistdf = data.frame(product_name = prodlist, prodlist_label, scoreid, stringsAsFactors = FALSE)
prodlistdf = inner_join(prodlistdf, prods[, c("product_id", "product_name")], by = "product_name")

prodlist_profile = inner_join(ordprd2a_s, prodlistdf[, c("product_id", "product_name", "prodlist_label")], by = "product_id")

prodlist_profile_g = prodlist_profile %>% gather(hour_of_day, pctorders, -product_name, -product_id, -prodlist_label)

ggplot() + 
  geom_line(data = prodlist_profile_g, aes(x = as.numeric(hour_of_day), y = pctorders, 
                                           group = product_name, color = prodlist_label)) + 
         xlab("Hour of Day") + ylab("% orders") + 
         scale_x_continuous(breaks = seq(0,23)) + scale_y_continuous(labels = percent) + 
         theme_bw(20) + theme(legend.title = element_blank())

Below is the table that lists the actual products that are in top and bottom scores of PC1. Ice cream purchases tend to occur more in the evening. Items like granola bars, krispie treats, apples are purchased more in the morning.

htmlTable(prodlist_topbot20)

	top_scores	bottom_scores
1	Salted Caramel Core Ice Cream	Pistachios
2	Chocolate Fudge Brownie Ice Cream	Sparkling Water, Bottles
3	The Tonight Dough™ Ice Cream	Popcorn
4	Cashew Milk Salted Caramel Cluster Non-Dairy Frozen Dessert	Extra Fancy Unsalted Mixed Nuts
5	Ice Cream, Super Premium, Mint Chocolate Chip	Cheez-It Cheddar Cracker
6	Raspberry Chocolate Chip Gelato	Variety
7	Coconut Milk Non Dairy Frozen Dessert No Sugar Added Mint Chip	Peanut Milk Chocolate XXL Bag
8	Organic Cookies & Cream Ice Cream	Cheez-It Baked Snack Crackers
9	Half Baked® Ice Cream	Crunchy Oats 'n Honey Granola Bars
10	Half Baked Frozen Yogurt	0% Greek Strained Yogurt
11	Chocolate Chocolate Chip Ice Cream	Original Rice Krispies Treats
12	New York Super Fudge Chunk® Ice Cream	Milk Chocolate Almonds
13	Butter Pecan Ice Cream	Apples
14	Vanilla, Chocolate, Strawberry Ice Cream	Trail Mix
15	Mint Chocolate Cookie Ice Cream	Nutri Grain Bars Multi Pack
16	Caramel Cone Ice Cream	Dry Roasted Almonds
17	Chocolate Peanut Butter Cup Gelato	Sweet & Salty Nut Granola Bars Peanut
18	Americone Dream® Ice Cream	Pub Mix
19	Cherry Garcia Ice Cream	Half And Half Ultra Pasteurized
20	Dulce de Leche Caramel Ice Cream	Wint-O-Green

PCA to find patterns of purchase by day of week

Dataset for PCA has for each product (rows), the percentage of product orders at each day of week (column)

ordprd3 = inner_join(ordprd, orders[, c("order_id", "order_dow")], by = "order_id")

ordprd3a = ordprd3 %>% group_by(product_id, order_dow) %>% 
  summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd3a = inner_join(ordprd3a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")
ordprd3a_s = ordprd3a %>% select(-numorders) %>% spread(order_dow, pctorders, fill = 0)

The different analysis for PCA done for the case of purchase by hour of day have been wrapped into 3 functions that are defined in pca_funs.R (code)[pca_funs.R]

# perform PCA and plot cumulative variance
pcmdl_dow = getpca(pcdata = ordprd3a_s)
pcmdl_dow$p_cumvar

In this case, the first two principal components account for ~90% of variance. The loadings for first two components are plotted next.

pcplots_dow = get_pcaplots(pcmdl_dow, numcomp = 2, ordprd3a_s, ordprd3a_s, prods, seq(0,6), "day of week")
pcplots_dow$p_pcload

The products with high scores on first component would tend to be purchased more at the ends of the week and products with low scores on first component would tend to be purchased less at the ends of the week. The plot of products having top and bottom 20 scores on PC1 is shown next.

pcplots_dow$p_items[[1]]

The table below shows the names of products with top and bottom 20 scores on PC1. It looks like cookies and chocolates are bought less in weekends compared to cooking items.

htmlTable(pcplots_dow$prodlist_topbot20[[1]])

	top_scores	bottom_scores
1	Black Beans Reduced Sodium	Pub Mix
2	99% Fat Free Natural Goodness Chicken Broth	Wint-O-Green
3	Mild Diced Green Chiles	Peanut Milk Chocolate XXL Bag
4	90% Lean Ground Beef	White Multifold Towels
5	Organic Brussels Sprouts	Sweet & Salty Nut Granola Bars Peanut
6	Jumbo Lump Crab Cake	Trail Mix
7	Organic No Salt Added Diced Tomatoes	Milk Chocolate Almonds
8	Spaghetti Squash	Nutri Grain Bars Multi Pack
9	Boneless Beef Chuck Roast	Cheez-It Baked Snack Crackers
10	Tikka Masala Simmer Sauce	49 Flavors Jelly Belly Jelly Beans
11	Golden Sweet Potato	Milk Chocolate M&Ms
12	No Salt Added Black Beans	Kit Kat & Reese's Assorted Miniatures
13	Poblano Pepper	Fruit Snacks
14	Crushed Tomatoes With Basil	Cookie Tray
15	Organic Tomato Sauce No Salt Added	Original Rice Krispies Treats
16	Organic Great Northern Beans	Crunchy Oats 'n Honey Granola Bars
17	Mexican Style Hominy	Half And Half Ultra Pasteurized
18	Boneless Skinless Chicken Breasts	Dry Roasted Almonds
19	Dark Red Kidney Beans No Salt Added	Extra Fancy Unsalted Mixed Nuts
20	Petite Diced Tomatoes	Variety

The products with high scores on second component would tend to be purchased more at the end of the week and products with low scores on first component would tend to be purchased more at the beginning of the week. The plot of products having top and bottom 20 scores on PC2 is shown next.

pcplots_dow$p_items[[2]]

Looks like some wine and vodka is purchased more as the week progresses.

htmlTable(pcplots_dow$prodlist_topbot20[[2]])

	top_scores	bottom_scores
1	Handmade Vodka From Austin, Texas	BBQ Chopped Salad
2	Little Sumpin' Sumpin' Ale	Teriyaki Beef Jerky
3	IPA	Honey Turkey
4	Handmade Vodka	Blueberry Pecan Plus Fiber Fruit & Nut Bar
5	Humboldt Fog	Extra Lean Ground Turkey Breast
6	Belgian White Wheat Ale	Turkey
7	Prosecco Sparkling Wine	Chocolate Mint Builder's Bar
8	Lemon-Lime Soda	Ground Turkey Breast
9	Vanilla Cupcakes	Fresh Cut Baby Carrots
10	Camembert	Organic Chocolate Chip ZBar Kids Energy Snack
11	Challah Bread	Builder's Chocolate Peanut Butter Bar
12	Hash Browns Country Style	Whole Grain Brown Ready Rice
13	Tostitos Scoops	Boneless And Skinless Chicken Breast
14	Pure Cane Confectioners 10-X Powdered Sugar	Dark Chocolate Nuts & Sea Salt Bars
15	Milk Chocolate Candy Bars	Taco Seasoning
16	Beer	Organic Boneless Skinless Chicken Breast
17	Tonic Water	Plus Cranberry Almond + Antioxidants with Macadamia Nuts Bar
18	Ice	Crispy Cheddar Crackers
19	Premium Lager Beer	Organic Zucchini Spirals
20	Puffs Cheese Flavored Snacks	Crunchy Peanut Butter Builder's Bar

PCA to find patterns of purchase by days since prior order

Dataset for PCA has for each product (rows), the percentage of product orders at each of days since prior order (column)

ordprd4 = inner_join(ordprd, orders[!is.na(orders$days_since_prior_order), c("order_id", "days_since_prior_order")], by = "order_id")

ordprd4a = ordprd4 %>% group_by(product_id, days_since_prior_order) %>% 
  summarize(numorders = n()) %>% mutate(pctorders = numorders/sum(numorders))
ordprd4a = inner_join(ordprd4a, prod_top80pct[,c("product_id"), drop = FALSE], by = "product_id")
ordprd4a_s = ordprd4a %>% select(-numorders) %>% spread(days_since_prior_order, pctorders, fill = 0)

# PCA of purchase timing (days since prior order)
pcmdl_prior = getpca(pcdata = ordprd4a_s)
pcmdl_prior$p_cumvar

First two principal components account for 75% of variance. The loadings for the first two components is shown next.

pcplots_prior = get_pcaplots(pcmdl_prior, numcomp = 2, ordprd4a_s, ordprd4a_s, prods, seq(0, 30), "days since prior order")
pcplots_prior$p_pcload

Products with high scores in first principal component will tend to be purchased more in orders that are less than a week from previous order and products with low scores on the first principal component will tend to be purchased in orders that are more than a month after the previous order

pcplots_prior$p_items[[1]]

Items like trash bag, laundary detergents, towels are in orders that are made less often (longer than once per month) whereas the food/drink related items are in orders that are made more frequently (less than a week)

htmlTable(pcplots_prior$prodlist_topbot20[[1]])

	top_scores	bottom_scores
1	Ultra-Purified Water	Tall Kitchen Bag With Febreze Odor Shield
2	Grass milk Raw Cheddar Cheese	Fresh Scent Laundry Detergent
3	White with 30% Cacao Content Organic Chocolate	All Natural White Vinegar
4	Baby Food Pouch - Butternut Squash, Carrot & Chickpea	Fabric Softener Dryer Sheet Outdoor Fresh 160CT Fabric Enhancers
5	Organics Big Bird's Apple 100% Juice	XL Pick-A-Size Paper Towel Rolls
6	Stage 1 Apples Sweet Potatoes Pumpkin & Blueberries Organic Pureed Baby Food	April Fresh Liquid Fabric Softener
7	Organic Whole Milk	Shredded Parmigiano Reggiano
8	Whole Organic Omega 3 Milk	Makeup Remover Cleansing Towelettes
9	Just Sweet Potato Baby Food	Glass Cleaner
10	Organic Homogenized Whole Milk	Premium Sliced Bacon
11	Honeydew Chunks	Organic Variety Pack
12	100% Whole Wheat Bagels	Chicken Apple Sausage
13	Half And Half Ultra Pasteurized	Solid White Albacore Tuna
14	Organic Stage 2 Winter Squash Baby Food Puree	Versatile Stain Remover
15	Very Berry Flavor Sparking Mineral Water	Crunchy Rice Rollers
16	Baby Food Pouch - Roasted Carrot Spinach & Beans	Organic Quinoa
17	Organic Apples, Carrots and Parsnips Puree	Seltzer Water
18	Unsweetened Premium Iced Tea	Avocado Oil
19	Bag of Red Delicious Apples	Non-Scratch Scrub Sponges
20	Organic Chamomile with Lavender Herbal Tea Bags	Spa Cuisine Butternut Squash Ravioli

Products with high scores in principal component 2 will tend to have higher percentage of orders with days since previous order of less than a week and products with low scores in principal component 2 will tend to have a higher percentage of orders that are weekly in frequency.

pcplots_prior$p_items[[2]]

Juices, milk and wine tend to be in orders who frequency is less than a week. Items like Greek Yogurt, Apples and Granola bar are in orders who frequency is weekly.

htmlTable(pcplots_prior$prodlist_topbot20[[2]])

	top_scores	bottom_scores
1	Ultra-Purified Water	Dove Dark Chocolate Snowflake Promises
2	Ice	Crunchy Peanut Butter Granola
3	Vodka	Wheat Thins Original
4	Organic European Style Sweet Unsalted Butter	Plus Cranberry Almond + Antioxidants with Macadamia Nuts Bar
5	80 Vodka Holiday Edition	Pub Mix
6	Chardonnay Wine	Dark Chocolate Cinnamon Pecan Bar
7	Organic Coconut Bliss Vanilla Island Ice Cream	Non-Fat Blueberry on the Bottom Greek Yogurt
8	Natural Artesian Bottled Water	Pomegranate on the Bottom Non-Fat Greek Yogurt
9	Chardonnay	Almond & Apricot Bar
10	Grass milk Raw Cheddar Cheese	Apples
11	Handmade Vodka From Austin, Texas	Coconut Chocolate Chip
12	Lemon Love Juice Drink	Trail Mix Fruit & Nut Chewy Granola Bars
13	Pure Cane Confectioners Powdered Sugar	Total 0% Blueberry Acai Greek Yogurt
14	Premium Ice Cubes	Pears
15	Honest Face, Hand, & Baby Wipes	Builder's Chocolate Peanut Butter Bar
16	Organic Kale Chips	Fruit Snacks
17	Sauvignon Blanc	Black Pepper Snapea Crisps
18	Cacao Powder	Teriyaki Beef Jerky
19	Garden Vegetable Salad Roll	Cheez-It Cheddar Cracker
20	Tall Kitchen Bag With Febreze Odor Shield	Sweet & Salty Nut Almond Granola Bars

Session Info

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] htmlTable_1.9 tidyr_0.6.0   scales_0.4.1  ggplot2_2.2.1 dplyr_0.5.0  
## [6] readr_1.1.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     knitr_1.15.1     magrittr_1.5     hms_0.3         
##  [5] munsell_0.4.3    colorspace_1.2-6 R6_2.1.3         stringr_1.2.0   
##  [9] plyr_1.8.4       tools_3.3.1      grid_3.3.1       checkmate_1.8.2 
## [13] gtable_0.2.0     DBI_0.5-1        htmltools_0.3.6  lazyeval_0.2.0  
## [17] assertthat_0.1   rprojroot_1.2    digest_0.6.12    tibble_1.2      
## [21] codetools_0.2-14 htmlwidgets_0.7  evaluate_0.10    rmarkdown_1.5   
## [25] labeling_0.3     stringi_1.1.5    backports_1.0.5

Exploring Instacart Dataset with PCA

Notesofdabbler

2017-05-22

PCA to find patterns of purchase by hour of day

PCA to find patterns of purchase by day of week

PCA to find patterns of purchase by days since prior order

Session Info