Instacart Market Basket Analysis Challenge
This blog is about my first Kaggle Challenge. Everything I did to solve the challenge. Hopefully, this blog will give you some good understanding of building Recommendation Systems through Classification.
Instacart is an American company that provides grocery delivery and pick-up service. The Company operates in the U.S and Canada. Instacart offers its services via a website and mobile app. Unlike another E-commerce website providing products directly from Seller to Customer. Instacart allows users to buy products from participating vendors. And this shopping is done by a Personal Shopper.
Task: The task here is to recommend products to users. The challenge is the recommendation should be useful for the users. And had higher chances of getting reordered by users.
[Q].Why don’t I hire someone to do this manually?
[A]. Okay! let’s assume 2–3 people get hired for this task but at max how many recommendations can, they make daily. 100–200? Also are we sure that those recommendations are correct and the user will buy those products.
To reduce this overhead Instacart had this challenge in Kaggle through which they can get a good recommendation system along with having some good Machine Learning/Deep Learning Engineers join their team.
Table Of Content:
1. Business Problem
2. Mapping to ML Problem
3. Data Acquisition
4. Data Overview
5. Checking & Handling Missing Data
6. Data Visualization
7. Feature Engineering
8. Train-Test Data Creation
9. Trying Various Models
10. Creating Submission Files
11. Comparing Results
12. Final Functions
13. Future Work
14. References
1. Business Problem
Proper understanding of Business Problem is the most important step for any Machine Learning/Deep Learning task. So let’s try to understand the business problem first.
We are provided 7 data frames. Each data frame has unique features. Our task is to use all the features and given an order_id and product_id predict whether the specific product is going to be reordered or not. And then have to return a list of all such products for the order_id which are going to be reordered.
2. Mapping To ML Problem
So by going through Business Problem the problem seems to look a recommendation problem and actually it is. But the approach to solving this problem is a bit unique. Here we have to build a Classification Model.
Using this model we predict the products to be reordered. Now we create a list of all products for the given ‘order_id’ which are going to be reordered. This way we get a list of products to be recommended.
Evaluation Metric: F1-Score. Since it’s a Kaggle Problem we have predefined evaluation metrics.
3. Data Acquisition
“Data is the new Oil”: Clive Humby
Data is the most important thing for a school going kid to a Machine Learning Professional. Data Acquisition is the initial step of Data Science Lifecycle and also the most important one. Data do matters for everyone but for building any Machine Learning or Deep Learning Model, data is the sole of that model.
So data should be acquired very wisely and carefully.
Since this is Kaggle Problem data is already given and its in form of .csv files.
CSV being most common format it becomes a bit easy to work with it.
4. Data Overview
We have data and the entire model is going to be built on top of that. So we should have a glance at data. Like what all we have how can we use them in future.
Attributes Of Each CSV files are given below
1. Aisles.csv
1. Aisle Id: A unique Id to represent each aisle (Integer: int16)
2. Aisle: Contains the name of aisle based on products on the aisle (String)2. Departments.csv
1. Department Id : Unique integer to represent each department (Integer : int16)
2. Department : String which tells the name of department depending upon products in department (String)3. Order Products Prior.csv
1. Order Id : Unique integer to represent order (Integer : int32)
2. Product Id : Unique integer for each product (Integer : int32)
3. Add to cart order : Order in which product is added in the cart (Integer : int16)
4. Reordered : Binary variable (0 = Not reordered /1 = Reordered)4. Order Products Train.csv
1. Order Id : Unique integer to represent order (Integer : int)
2. Product Id : Unique integer for each product (Integer : int32)
3. Add to cart order : Order in which product is added in the cart (Integer : int16)
4. Reordered : Binary variable (0 = Not reordered /1 = Reordered)5. Order.csv
1. Order Id : Unique integer to represent order (Integer : int64)
2. User Id : Unique integer to represent different users (Integer : int64)
3. Eval Set : Tells whether order is from Prior or Train (String)
4. Order Number : Order number for the order made by customer (Integer : int64)
5. Order Dow : Ranges from 0-6 where 0 = Sunday and 6 = Saturday (Integer : int16)
6. Order Hour Of Day : Ranges from 0-23 where 0 = 12 a.m and 23 = 11 p.m (Integer : int16)
7. Day Since Prior Order : Number of days since last order is placed (Integer : int16)6. Products.csv
1. Product Id : Unique Id for each product since there are huge number of products so will have large range (Integer : int64)
2. Product Name : Name of the product (String)
3. Aisle Id : Id of the aisle where the product is present (Integer : int64)
4. Department Id : Unique Id for each department (Integer : int32)7. Sample Submission.csv
1. Order Id : Id for each order placed (Integer : int64)
2. Products : List of space delimited product id which all products will be ordered (Integer : int64)
[Q]. What’s the difference between Order Products Prior & Order Products Train?
[A]. Both are the same data frame but for each user, the most recent orders are placed in Order Products Prior whereas users previous records are stored in Order Products Train.
5. Checking & Handling Missing Values
First, we load all the CSV files.
It’s clearly visible that feature ‘days_since_prior_order’ has Null values.
So to handle this null values we can either replace this Null values with 0 or 1.
But I wanted to replace them with the mean values so here is the code for it.
Other then this there are no Null values. So, moving to the visualization step.
6. Data Visualization
Visualization of data is also an important step as it gives us some insights of data like mean, max, min values for features also we can identify outliers just by seeing the plots.
In the original notebook EDA.ipynb there are so many plots and adding all here will make the blog look boring and time taking for readers. So I am adding the important ones only.
1. Eval Set Distribution
2. Orders By Day Of Week
3. Products Ordered By Hour Of Day
4. HeatMap For Orders Placed By Day Of Week & Hour Of Day
5. Order By Day Since Prior Order Features
6. Departments vs Products In them
7. Reordered Plot
8. Products v/s Times Ordered
9. Departments v/s Time Ordered From
10. Aisles v/s Time Ordered
11. Number Of Reorders v/s Days
12. Number Of Reorders v/s Time Of Day
13. Reordering v/s Day Of Week & Time Of Day
14. Pie Chart For Department
15. Pie Chart For Reorders Placed Day Wise
There are more plots with respect to each feature which can be checked on the Github repository URL given at bottom of the page.
Conclusion From EDA
1. Most orders are placed between 9 a.m to 4 p.m
2. Most order is placed at 10 a.m on Sunday.
3. Many users orders with a difference of 1 day from a prior order.
4. Start of the week and end of the week sees maximum order placement. With day 30th having the highest number of orders.
5. Maximum items placed by users are 51 most times. But few times there were 100 items placed by users.
6. All most ordered products are vegetables or fruits.
7. Banana is the most ordered and reordered item.
8. Produce is the department with the most number of orders and re-orders.
9. Bulk is the department with least orders.
10. Most numbers of re-orders are done at 10 a.m with 3 a.m having least number of re-orders.
11. If the item is added at the earlier position at cart so it has higher chances of getting re-ordered.
7. Feature Engineering
Feature Engineering is an art which comes from experience. In every ML problem, we are given some set of features which may or may not be informative for us. So, using these features we draw some more features. As the newly engineered features can bring drastic improvement in model performance. Also, more the features better are the predictions.
Given features in the data, I have come across multiple new features. Some are taken directly from Kaggle and some are an extension to them along with some totally new features in our bucket. Also, there are some intermediate features which were created just to support other features and these intermediate features were later dropped. I am going to use the notation ‘I’ in front of those features while explaining them. Here is the list of features I created.
- Max Number Of Orders: At the most number of products ordered by the user in a single order.
- Mean Number Of Orders: Average number of products user order in each order.
- Min Number Of Orders: Least number of product user ever ordered.
- Total Products Per Order: Number of products user order in each order.
- Average User Product: Average of above features.
- Max Order Day: Day in which user ordered maximum products.
- Max Order By Hour: Hour of the day when user ordered maximum products.
- Total Reorders: Total number of reorders done by the user.
- Total Non-Reorders: Count of times when the user didn’t reorder the product.
- Reorder Ratio: Its the ratio of total reorders/(total reorder + total non reorders).
- Times Product Ordered: Number of times a product got reordered.
- Product Reorder Ratio: Ratio of times product reorder to times product ordered.
- Product Not Reorder Time: Count of products not getting reorder.
- Average Product Cart Position: The average position of the product being added to the cart
- Late Product Cart Position: The last ever position product added to cart.
- Early Product Cart Position: Earliest position ever in cart.
- Average Product Order Day: On average whats the day when product gets ordered.
- Late Product Order Day: The last time in a week when the product got ordered
- Early Product Order Day: Earliest day of the week when the product got ordered.
- Average Product Hour: Average time of the day when the product gets ordered.
- Late Product Hour: The latest time of day when the product got ordered.
- Early Product Hour: The earliest time of day when the product got ordered.
- Times Product By User (I): Times specific user ordered a specific product. Same as an above feature but we created it as we are going to modify and drop if later.
- Times Bought (I): Times user ordering specific products
- Total Orders By User (I): Max orders by the user.
- Order Range (I): Total Order By User — (Earliest Order Position) + 1
- Recent Ten: For a user and product times product ordered in recent 10 orders.
- Recent Ten Order Ratio: Just divide above feature with 10 to get the ratio of recent 10 orders.
Once we get all these features we keep merging the data frames depending on common features. Below is the snapshot of the code of some features.
8. Train-Test Data Creation(Biggest Challenge)
Since we have so many CSV files given its really hard to identify which data to use as train and test. ‘Eval_Set’ is a feature in ‘Orders.csv’ which has 3 values Train, Test, Prior. It tells the datapoint belongs to which set.
Also, we have order_products_prior and order_products_train which creates confusion which data to pick for the train.
Solution:
1. At first, we get data points which have eval_set value as train or test in a single data frame called train_test_data.
2. Now we merge this train_test_data with the featured data frame which we got after creating and merging features.
3. Now we separate the train and test data from train_test_data into train_data and test_data.
4. Now we merge this train_data with given order_products_train on product_id and order_id.
While creating features we used order_products_prior and now since we merged it with order_products_train. We use all the given data.
Finally, we get the train and test data.
9. Trying Various Models
Machine Learning is a fastest-growing field with tons of research paper getting published every month. Also, new Algorithms keep coming up. So, how will we pick the best?
There is no rule or guidelines telling that this particular model works well in every case.
So, Machine Learning is all about trying out things and then coming with the best results. During this hit and trial approach, we get to learn a lot of new concepts. Also, for picking a model some basic knowledge of Domain is helpful.
Since its a Classification Problem, I am going to try out various algorithms I am familiar with. Then after creating submission file for each of the model I am gonna submit results in the Kaggle and see which model performs well in Test Data at runtime.
Before training model, there are few steps to be done like Train-Test-Split, Null Value Check, Normalizing Data
- Logistic Regression: It's one of the simplest and popular classification algorithms. Logistic Regression tries to draw a plane such that it can separate points with different labels. Let’s get into code directly.
Every model has various parameters and no one knows which value is going to perform well. So, we do HyperParameter Tuning.
HyperParameter Tuning can be done using various approaches like a custom for-loop, Randomized Search CV, Grid Search CV.
In my case, I am going to use custom for-loop as I want to keep track of each parameter’s score.
2. Decision Tree: Name itself says its a tree-based algorithm. Decision Tree is flow chart like structure. Which keeps going deep if condition keeps passing.
3. Random Forest: Random Forest is flexible, easy to use machine learning algorithm that produces good results many times even without hyper-parameter tuning. It is also one of the most used algorithms, because of its simplicity and diversity (it can be used for both classification and regression tasks).
4. Light GBM: Earlier I wanted to try XGBoost but it was too slow to train as an alternative of it I couldn’t find a better option then LightGBM. As LightGBM is comparatively fast to train, performs well, and also require less memory to train so will work well within limited computation power.
As my next model, I am going to try out Stacking Classifier using the above model with the expectation of good results
5. Stacking Classifier: Stacking Classifier works on the principle of “ Many are better than one”. More distinct model better performance, as every model, learns different things and as final results, all there learning is merged. While using stacking Classifiers we will use there best parameter values which we got them on tuning.
As we already have all models tuned so we aren’t going to do any kind of HyperParameter Tuning here.
6. Meta Classifier: Since the results from the above model were not that good so giving a try to something totally new.
[Q]. What is meta classifier? Also, how it works?
[A]. A meta classifier is a classifier which does classification on the classified labels. Not sure, What I said? let’s see it step by step
- We split our train-test data into 50–50.
- Now using random sampling we create ‘n’ random samples.
- Now ‘ n’ base models are built on these ‘n’ random sampled data frames.
- Now we do prediction on the 50% test data created in step 1. We store these ‘n’ prediction by ‘n’ models in a single data frame let’s call it meta_dataframe.
- Now we build a meta-model which is trained on the meta_dataframe. This model is our meta-classifier.
- For test points, the test points are passed through each base model and predictions are done. Using this prediction again a data frame is created. And then meta-classifier does prediction on this data frame.
- Depending on personal choice any model can be used as a Base Model and Meta_Classifier.
Trying meta-classifier, I realised more the number of random samples and base models better are results. I came to this conclusion after seeing the result difference from 5 base models to 30. Because of computation resources limitation, I wasn’t able to have more number of Base Models.
Using LightGBM as meta classifier. Reason for doing so is its having better validation F1 score and also at lesser depth, it gives good results which come at more depth in Decision Tree or Random Forest.
So far I tried 6 models. I also tried Multi-Level Perceptron but results were not as good as expected so discarding it from the blog.
Results of almost every model are very similar. So to choose the best model will have to make a submission on Kaggle and observe there Test Score.
10. Creating Submission File
[Q]. What does it mean by creating submission file?
[A]. The models used here are classification models which return 0 or 1. But for submission in Kaggle, we have to submit a file with an ‘order_id’ and list of ‘product_id’ to be ordered.
[Q]. How to create submission files when we have 0 and 1 as labels?
[A]. We will create a dictionary which will have ‘order_id’ as its key. And will look for all products with reordered_label ==1 for this ‘order_id’. Now we create a list of all such ‘product_id’ and this list is set as the value of the dictionary. And this is how we get our dictionary.
Now we convert this dictionary to pandas CSV data frame and submit on Kaggle.
11. Comparing Results
Since we got submission files for each model using the final_predict_function and create_submission_file function. So after submission for each model in Kaggle here are the results.
Although there isn’t much difference between Decision Tree, Random Forest & Light GBM. But still, Decision Tree wins the race with a small margin of .00001
So, The final Kaggle Score is 0.36646
12. Final Functions
Till now we did train various model and did prediction. But we should have a function such that if the model gets deployed it should be able to do predictions at run time and also give metrics score. The reason predicts function takes test point and a data frame is that in this problem we have to predict whether a product will be reordered or not. But if we get a new user there is no point of the product being reordered. So we take a data frame to create features and then predict.
1. Predict Function
- This function takes 2 inputs.
- First is a small data frame for a specific user, second is the test point in which we have to give predictions
- Since we have all the columns in the given data frame. So we create multiple new data frames each with their unique features based on column values of the given small data frame.
- Later on in the function, we merge all this data frame created. The given data frame has reordered column too which is our true label so we get reordered features too.
- Our test_point do not have the reordered column as we have to predict on it.
- We merge our given data frame’s few columns with the merged features data frame created.
- Now we drop the ‘reordered’ and merge this data frame with the test point. So now we have all the features in test point too.
- Now we load our best model and finally predict the data frame as well as a test point.
- Now we add the results as a column in the data frame and we create a dictionary with this data frame such that we have keys as the ‘order_id’ and values are the list of products for the specific ‘order_id’ and ‘user_id’. This is our final prediction.
2. Evaluate Metrics
- The function takes a test_point as input. Now we look for the datapoint with same user_id, product_id, order_id.
- If the test_point exists in the train data we get its label.
- Now we load the model and predict on the test_point and calculate the f1_score with the true label.
- If the test_point doesn’t exist in the train data so we don’t have the true label and we can’t get the f1_score so we return the message “Since the test point doesn’t exist in train data so we don’t have a real label so we can’t return metrics score.”.
13. Future Work
As a further extension of my solution, I would like to try out some Deep Learning Models.
Also, I am looking to try Association Rules and Apriori Algorithm to get some more features.
Improvements To Current Approach
As an improvement to the existing approach, I tried out some new features which are focussed on ‘Reorder’ feature.
Also, I tried with meta classifier. And it was absolutely working fine and could have provided a really good score if computation resource limitation wouldn’t have been an issue.
14. References
I went through various Kaggle discussion and other sources to get a better understanding of the problem statement to draw some insights and come up with a good solution.
https://www.lexjansen.com/sesug/2019/SESUG2019_Paper-252_Final_PDF. pdf
https://medium.com/kaggle-blog/instacart-market-basket-analysis -feda2700cded
https://pdfs.semanticscholar.org/449e/7116d7e2cff37b4d3b1357a23953231b4709.pdf
https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38126
https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38161
https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38159
https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38100
To check out my entire work you can visit my GitHub repository: https://github.com/Vishal-Mendekar/Instacart-Market-Basket-Analysis