SANTANDER BANK
PRODUCT RECOMMENDATION SYSTEM
Dhairya kothari, Michigan
Technological University
rahul Gowla, Michigan Technological University
Santander bank is wholly owned subsidy of the
Spanish Santander group. Santander bank offers a lending hand to their
customers through personalizing their product recommendations, they want to
improve their customer’s satisfaction by recommending appropriate products
based on their previous purchases. With this recommendation system, they can
meet each individual customer needs and ensure their satisfaction. In the project,
we mainly concentrated on building User Based Collaborative Filtering because
this gives more accurate results when compared with other method. Brief idea on
solving or developing this recommendation system (UBCF) follows collecting the
required amount of data for the analysis, data preprocessing, Implementing
UBCF, conclusion. Results can be evaluated by checking the accuracy of this
model.
Concepts: • Datamining
→ Recommender systems → Collaborative filtering →Using
Jaccard distance
Key Words: Recommender systems, User Based Collaborative Filtering (UBCF),
Jaccard distance, Product Recommender system, R programming
ACM
Reference format:
Ben Trovato, G.K.M. Tobin, Lars Thřrvӓld, Lawrence P. Leipuner,
Sean Fogarty, Charles Palmer, John Smith, and Julius P. Kumquat. 1997. SIG
Paper in word Format. ACM J. Comput.
Cult. Herit. 9, 4, Article 39 (March 2010),4 pages.
DOI: 10.1145/1234
1 INTRODUCTION
Recommender systems are very important when considered for
business value, these systems will pull money for the companies. These days’
companies such Amazon, Netflix and many companies are using recommender systems
because they produce actual results which helps their customers by saving a lot
of time.
In this project, we are provided with 1.5 years of customer’s
behavior in purchasing the products of Santander bank. The challenge is to
predict the products which the costumers may buy in future. To be precise data
from date 2015-01-28 to 2016-05-28 is provided in addition to this we need to
predict what products those customers may purchase on 2016-06-28 (last month).
As we will be predicting the products for customers this problem is classified
as recommender system.
2 BACKGROUND
In recommender
systems, based on what we need to predict, system can be designed as
content-based systems or collaborative filtering systems. Content-based systems
are systems which examine the properties of the item and recommend same item
with similar feature. On the other hand, collaborative filtering systems are
systems which recommend based on the similarity measure between the items or
users and recommend those items. In our project as we need to predict the
products, which the customers will purchase in the future, we will be building
a User based collaborative filtering model. We will be briefly dealing with
association rules in which we mine the relation between variables. For example,
if customer buys product A then there are high chances that customer even buys
product B. We will be strongly finding the relation between the variables.
Generally, UBCF finds the users with same behavior and recommends products
based on their behavior. Here distance is been computed between the users and
the minimum distance users will be recommended with same products. For easy
understand we will walk through a simple example on how UBCF works. Consider
figure 2. in this we can see the items, which the users purchased in the past.
customer |
Item 1 |
Item 2 |
Item 3 |
Item 4 |
Item 5 |
Item 6 |
A |
X |
X |
X |
|
X |
X |
B |
X |
|
X |
X |
X |
X |
C |
|
|
X |
X |
X |
|
Figure 2
In this example, we can see that user A and user B have
similar behavior and that behavior is computed using various methods but in our
project, we used Jaccard distance similarity because it does not work with
weights it will consider only two cases yes if a product is bought and No
product is not purchased. As in our dataset we need to predict the product it
makes sense to use Jaccard distance similarity upon other methods.
3 BUILDING THE MODEL
In this
project, we programmed this model using R programming language. In R tool we
have a package calle “recommender lab” which deals with recommender systems. Developing
recommender system followed these steps
3.1 Collecting
the required amount of data for the analysis
3.2 Data Preprocessing
3.3 Implementing UBCF
3.4 Evaluation
3.1 Collection of dataset
3.1 COLLECTION OF DATASET
Data is being
provided by kaggle.com, a hub for datasets. Initially downloading this dataset
was the challenging task as it is huge dataset (approx. 3GB) it was really hard
to download and read the file. Excel cannot deal with such a huge file we tried
many methods and finally found a library in R which can deal with large
datasets. R contains a library called “fread” under data.table package, which
was helpful to handle larger datasets.
3.2 DATA PREPROCESSING
In the dataset,
we did not consider all the variables we neglected certain variables, which are
not significant, and we used only few of them for our prediction. When we check
the distribution of age, variable it’s interesting that this dataset is having
bimodal peaks one peak around 18 to 20 of age and another 45 to 50 of age group
(Figure 3.2). When data points are considered we have a total of 13647309 rows
and 48 variables, which is very huge. As it is complex to run our model on
entire dataset, we tried to sample it. We have a column called user id which is
the unique customer ID given to each customer in the Santander bank. We found
that on an average there were 625457 customers in each month and the customer’s
behavior was reflected in every month. For our convenience, we randomly sampled
these customers from one month and chosen only those customers in all the
months of our dataset. to be specific we took a sample of 100000 customers in
each month so that data will be balanced and it will not be biased to one
direction.
Figure
3.2
3.3 IMPLIMENTING UBCF
In this
project, we tried implementing User Based Collaborative Filtering method. UBCF
is a method where we compute the similarity between the users and finds the
users with similar behavior and recommend products based on their past
purchases.
In R programming,
we have a package called recommend lab, which deals with recommender systems.
Using this package, we can easily build a UBCF model. We can specify the
distance calculating method within the parameters. Here we used Jacard
similarity to compute the similarity between users.
4 RESULTS/EVALUATION
Evaluating results was
really a tough task. As our data contains huge samples we calculated our traditional
accuracy in samples. First, we considered the products purchased in the May
month as testing data. We observed there were more than 800000 observation in
this, which was difficult to compute we sampled 5000 samples (such that those
represent the medians points of nearest data points, in turn nearly
representing the complete data set) with nearest neighbor 5 and computed the
traditional accuracy, by what we recommended and what the customer bought
anyways without the system. We ended up with 22 % this is decent accuracy
because we just executed our model on small fraction of medians of the dataset.
If we consider 10 nearest neighbors with enormous number of samples then the
accuracy will increase further. Based on the model we will recommend products
to the customers.
We also suggest that more accurate performance metrics be
implemented to track the system performance, like number of clicks or people
actually buying the products after implementing the system and % increase in
sale.
In this project, the main challenge we faced was with the
dataset. As it was massive we couldn’t read it directly into R tool. We had to
do lots of preprocessing. Moving on we could recommend product to the customers
in the May month accurately. We even observed the histogram of the customers
where we could observe that age group between 18 to 25 and 40 to 50 has the
maximum number accounts and are performing maximum number of transactions.
Concentrating on these age group customers and recommending products to them
will surely boost up the companies’ profits.
As our data set is quite large and code execution takes significant
time, for convenience, we have saved the RatingMatrix.csv (dimension: user X
products) and the final recommendations to the test Users in Recommendation.csv.
5 FUTURE WORK
Cold start: in this project, we recommended products based on their
purchases history compared with other users with similar behavior. But if there
are any new customers in the dataset he will not be having any history and it
will be hard to recommend products to him. This is called cold start problem.
In this project, we didn’t deal with cold start problem which can be done in
future.
REFERENCES
https://www.kaggle.com/c/santander-product-recommendation |
|
[2] |
http://www.snet.tu-berlin.de/fileadmin/fg220/courses/SS11/snet-project/recommender-systems_asanov.pdf |
[3] |
https://recsys.acm.org/recsys16/ |
[4] |
Collaborative feature-combination recommender
exploiting explicit and implicit user feedback. Markus
Zanker and Markus Jessenitschnig University Klagenfurt, Intelligent
Systems and Business Informatics Research Group Klagenfurt, Austria. |
[5] |
https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/ |
[6] |
Recommender systems handbook by Lior Rokach
and Bracha Shapira |
[7] P. Resnick and H.
R. Varian, “Recommender systems,” Commun. ACM, vol. 40, pp. 56–58, March 1997.