Market Basket Analysis in R - Part 3

Claire Matuka
Mar 10, 2022
6 min read

If you have been following along on the market basket analysis series, you my friend are the GOAT.

Now, if you haven't; I strongly encourage you to first understand the theory of Market Basket Analysis. This article is a replica of what was done in the Market Basket Analysis on Python article. The only difference is that we are now coding in R.

As I stated in the Python article, finding real data is not easy. Even if you do, the data has already been analyzed and used by all the data analysts and data scientists in the world.

So for this particular tutorial, I will use the arules package to carry out market basket analysis. Let us take a minute to appreciate packages and how they save us so much time. The arules package has an inbuilt dataset "Groceries", which is what has been used in most tutorials. Of course, I did not use this dataset. We like having fun and honestly I am not particularly keen on knowing about onions today.

I will be using the SHEIN dataset I generated in Python. SHEIN is an an online apparel store or boutique. I will be gathering insights on graphic t-shirts (graphic tees) as they are a must-have in the sunny Nairobi weather.

If you however want more, you can also check out these two datasets.

This tutorial involves the following steps:

Import dataset
Transform dataset into class transactions
Get frequent item sets
Create association rules
Draw conclusions and recommendations

1. Import dataset

The following code was used to import the data into R:

#import the data
library(readxl)
shein_data <- read_excel("C:/Users/USER/Downloads/SHEIN Generated Graphic Tee data.xlsx")
View(shein_data)

You will have to change the path depending on the location of the dataset on your computer.

The resulting dataset was as follows:

2. Transform dataset into class transactions

The SHEIN dataset consists of 25 items (graphic tees) and a total of 2000 transactions. It is simply a matrix of ones and zeros, such that each row represents a transaction, and each column represents an item.

Now, I thought I could simply just import the data, run the code and get the results. I played myself.

To run apriori in R, your dataset needs to be of the class transactions. Luckily, this can be done in the arules package. This is why it is important to read the documentation.

#get the class of the imported data
class(shein_data)

It is clear that the imported dataset is a data.frame. On the arules documentation, it is possible to transform, a list of lists, long and wide format data.frame and even a 0-1 matrix, into an object of class transactions. Check out the documentation and depending on the format of the data you have, find the most appropriate method for you.

For this particular SHEIN data, I first converted it into a matrix, then converted the matrix into a transactions object.

#convert into a data matrix
shein_matrix <- data.matrix(shein_data[2:26])

#label the rows and columns
## Set item names (columns) and transaction labels (rows)
colnames(shein_matrix) <- colnames(shein_data[2:26])
rownames(shein_matrix) <- shein_data$`Invoice No`

View(shein_matrix)

As can be seen, the row names of the matrix are the Invoice Number of each transaction, while the column names are the names of the graphic tees.

To convert the matrix into a transactions object, the following code was used:

#convert into transactions class
## Create transactions
library(arules)
shein_trans <- transactions(shein_matrix)
shein_trans

Just like that, it's done.

3. Get frequent item sets

When using Python, it is necessary to get the frequent item sets before you can get the association rules. In R however, this is an optional step.

In my opinion, it is good measure and good practice just so that you are able to understand the data better.

In the theory article, we mentioned Support which basically measures how frequently an itemset appears in the dataset.

Take note:

I chose to focus on the association between two items although it is possible to have an association that involves more than two items. Your code may take slightly longer to run if you choose to try and get the association between several items.
I also selected a minimum support of 0.01. You are free to select any value depending on what you consider as frequent. For me, I was simply interested in items that appeared in at least 1% of all transactions. This is around 20 transactions out of the total 2000 transactions.
I used the apriori function from the arules package to generate the frequent item sets.

#apriori
#getting the frequent itemsets 
freq_item <- apriori(shein_trans, parameter = list(supp = 0.01, maxlen=2, target = "frequent itemsets"))

## Display the 5 itemsets with the highest support
freq_item_top5 <- sort(freq_item)[1:5]
inspect(freq_item_top5)

This resulted in the following list of item sets:

It is clear that the "Letter & Tape Print Tee" is the most frequently purchased item as it was purchased in 11.3% of all the transactions.

Here is a cute visual of the t-shirt as obtained from the SHEIN website.

4. Create association rules

Other than Support, we also mentioned two other metrics: Confidence and Lift.

Confidence measures the percentage of all transactions satisfying X that also satisfy Y

Lift measures the ratio of the observed support to that expected if X and Y were independent. If Lift>1, it means that the item sets occur a lot more often than expected if they were independent. A higher lift value indicates a higher association between the items.

To generate the association rules, we still use the arules library. There are two different methods that can be used.

Method 1

In this method, we simply use the apriori function to generate the rules. We do not use frequent item sets in this method. Please note that in the parameters:

I chose to focus on the association rules between two items and so I set minlen = 2 and maxlen = 2
The default confidence level is set at 0.8, which is quite high. This can possibly result in having no rules generated. I chose to set it at 0.01 since I am more concerned about the lift value when it comes to association rules.
I set target = "rules" because I wanted to get the association rules. It is very important to specify this.

#getting the rules
rules_method1 <- apriori(shein_trans, parameter = list(supp = 0.01, minlen = 2, maxlen=2, confidence = 0.01, target = "rules"))

inspect(head(rules_method1, by = "lift"))

Method 2

In this method, we use the ruleInduction function to generate the rules. Difference from method 1 is that we use the frequent item sets we generated when using this function. Please note that in the parameters:

The default confidence level is set at 0.8, which is quite high. This can possibly result in having no rules generated. I chose to set it at 0.01 since I am more concerned about the lift value when it comes to association rules.

#second way of getting the rules
rules_method2 <- ruleInduction(freq_item, transactions = shein_trans, confidence = 0.01, method = "apriori")

inspect(head(rules_method2, by = "lift"))

The outputs for Method 1 and Method 2 are exactly the same. It is up to you to choose which one you prefer.

For this particular tutorial, I am keen on just the top 2 item sets with the highest association. You are however free to focus on any number of associations.

Images from shein.com

The "Rhinestone Letter Crop Tee" and "DAZY Letter Graphic Tee" have the highest lift and can therefore be said to be the most associated items. The confidence level for "Rhinestone Letter Crop Tee" ----> "DAZY Letter Graphic Tee" is 0.1695906 which is greater than 0.1457286, which is the confidence level for "DAZY Letter Graphic Tee"---> "Rhinestone Letter Crop Tee". We will therefore select the first rule that simply indicates that customers are more likely to buy the "DAZY Letter Graphic Tee" after purchasing the "Rhinestone Letter Crop Tee".

The "Letter and Rocket Print Tee" and "Rhinestone Letter Crop Tee" have the second highest lift and can therefore be said to be among the most associated items. The confidence level for "Letter and Rocket Print Tee"----> "Rhinestone Letter Crop Tee" is 0.1428571 which is less than 0.1611111, which is the confidence level for "Rhinestone Letter Crop Tee" ----> "Letter and Rocket Print Tee". We will therefore select the second rule that simply indicates that customers are more likely to buy the "Letter and Rocket Print Tee" after purchasing the "Rhinestone Letter Crop Tee".

Image from shein.com

5. Draw conclusions and recommendations

Based on the insights gathered, I would recommend that:

On the website, after a customer views, add to their cart, or purchases the "Rhinestone Letter Crop Tee", they should get recommended the "DAZY Letter Graphic Tee" and "Letter and Rocket Print Tee".
Two new products that are actually product bundles could be added to the website, at a discounted price.

**************

Hope you enjoyed this tutorial. Which graphic tee do you like the most?

Disclaimer:

Just incase you missed it, none of this data is actually real. It was just generated for the purpose of this tutorial. I have no access to information on SHEIN transactions.