Every day, users around the world receive a large number of different mailings - only through the MailChimp service daily send a billion letters . Of these, 20.81% are discovered.
Every month, users of our sites receive newsletters with materials selected by the editor. About 21% of readers open these letters.
In order to increase this number, you can make them personalized. One way is to add a recommendation system that will prompt materials interesting to a particular reader.
In this article I will talk about how to implement a recommendation system from scratch based on collaborative filtering.
The first part of the article contains the theoretical basis for the implementation of the recommendation system. School math is enough to understand the material.
The second part describes a Python implementation for our site data.
Collaborative filtering is probably the easiest approach in recommender systems. It is based on the idea that similar users like similar objects, such as articles.
How to determine how much Vasily is like Ivan or an article about SQL Server to an article about PostgreSQL?
Let's look at an example. Let's say we have four users: Vasily, Ivan, Inna and Anna. The site has five articles: Article 1, Article 2, Article 3, Article 4 and Article 5. In the table below, the number at the intersection of the user and the article is the user's rating of the article on a five-point scale. Zero in the table are articles that have not been rated by the user. For example, Vasily liked articles 1, 3, and 4.
Table 1
Article 1 | Article 2 | Section 3 | Section 4 | Section 5 | |
---|---|---|---|---|---|
Vasiliy | four | 0 | five | five | 0 |
Ivan | 0 | 0 | four | five | 0 |
Inna | four | 2 | four | 0 | 0 |
Anna | five | five | 0 | 0 | five |
Intuitively, we can assume that if users like the same articles, then their tastes coincide. What do you think whose interests are similar to the interests of Vasily?
Vasily’s interests are more similar to the interests of Ivan and Inna and less similar to Anna’s interests. Why - it will be told further.
For further work, it is necessary to formalize and measure the “similarity” of Vasily and Ivan or Inna and Anna.
The easiest way to do this is to consider user ratings as a description of their profile. In the example, each row in the table is a description of one user. The first line - the description of Basil - is a vector of five numbers: [4, 0, 5, 5, 0]; the second - Ivan - [0, 0, 4, 5, 0]; the third is Inna - [4, 2, 4, 0, 0]; the fourth - Anne - [5, 5, 0, 0, 5].
Now you can introduce the concept of "measure of similarity" user descriptions.
One way to measure the "similarity" of users is to calculate the cosine distance between the vectors that describe them.
The cosine distance is calculated by the formula:
Where and - user description vectors; - scalar product of description vectors; , - lengths of description vectors.
The meaning of the cosine distance is as follows: if two vectors where and (user description vectors) are “similar”, then the angle between them will tend to zero, and the cosine of this angle will tend to unity. In the ideal case, when the "interests" of the two users coincide, the cosine distance for them will be zero.
Cosine distance between Vasily and Ivan:
Similarly, the cosine distance between Vasily and Anna is 0.715. That is, Vasily’s interests are more like Ivan’s interests than Anna’s.
This part is the most interesting. There are many different options. Below we consider two simple options.
The easiest option for calculating the predicted rating is to see what ratings the “similar” users put to the article and take the average rating:
In this formula:
A slightly more complicated option is to take into account the degree of similarity: ratings of more similar users should influence the final rating more than ratings of less similar ones:
In this formula:
When creating any recommendation system, you should determine the metric by which you can evaluate the quality of our model - how well the system offers the user new materials. For example, the root mean square error ( ) Is the square root of the average error for all user ratings. Formally, this measure is described by the formula:
In this formula
In the ideal case, when the predicted ratings coincided with the user’s equal to zero.
Consider an example. Two recommendation systems made predictions of estimates for Vasily. The result is in the table below.
Article 1 | Article 2 | Section 3 | Section 4 | Section 5 | |
---|---|---|---|---|---|
Vasiliy | four | 0 | five | five | 0 |
Recommender system 1 | one | 3 | five | 2 | 2 |
Recommender system 2 | four | one | five | 3 | 0 |
It is intuitively clear that the second recommendation system predicted ratings better than the first. Count :
The error for evaluations of the second recommendation system is expected to be significantly lower.
We have at our disposal most of the data on articles and users of the site: information on articles, tags, user likes, etc.
To implement collaborative filtering, user ratings are sufficient.
Hereinafter, the code is written “in the forehead” to demonstrate the logic of the recommendation system. In real life, it is better to use all the features of numpy
and pandas
.
import pandas as pd import numpy as np import os ratings_df = pd.read_csv('./input/Ratings.csv') print(' :', ratings_df.shape[0]) print(' :', ratings_df[ratings_df['Rate']].shape[0]) unique_user_ids = ratings_df[ratings_df['Rate']]['UserId'].unique() print(' :', len(unique_user_ids)) ratings_df.head()
Total data: 15313
Positive Ratings: 15121
Active users: 1007
Id | DocumentId | Rate | Userid | |
---|---|---|---|---|
0 | one | one | True | 5000 |
one | 2 | 878 | True | 2441 |
2 | 3 | 1512 | True | 678 |
3 | four | 1515 | True | 678 |
four | five | 877 | True | 5110 |
... | ... | ... | ... | ... |
1007 active users gave 15313 “ratings”. Of these, 15121 “likes”.
The data contains four columns: a row identifier from the database ( Id column), an object identifier ( DocumentId column), a sign that the user liked the article ( Rate column) and a user identifier ( UserId column).
For convenience, add the column RateInt . 1 in this column means the user liked the article; -1 - that did not like.
ratings_df['RateInt'] = ratings_df['Rate'].apply(lambda x: 1 if x else -1) ratings_df.head()
Id | DocumentId | Rate | Userid | RateInt | |
---|---|---|---|---|---|
0 | one | one | True | 5000 | one |
one | 2 | 878 | True | 2441 | one |
2 | 3 | 1512 | True | 678 | one |
3 | four | 1515 | True | 678 | one |
four | five | 877 | True | 5110 | one |
For further work, it is required to divide the data set into training and test: the training will be used to train the model, and the test will determine the quality of the predictions.
from sklearn.model_selection import train_test_split train, test = train_test_split(ratings_df, test_size=0.2)
For convenience, we transform each set into a table, where in the rows are the identifiers of users, and in the columns are identifiers of articles by analogy with the example at the beginning of the article.
def create_matrix(df): ratings_per_user = [] post_ids = df['DocumentId'].unique() for user_id in tqdm_notebook(all_users_ids, ''): row = {'user_id': user_id} ratings = df[df['UserId'] == user_id]['DocumentId'].values for post_id in post_ids: row[str(post_id)] = 1 if post_id in ratings else 0 ratings_per_user.append(row) return pd.DataFrame(ratings_per_user) train_df = create_matrix(train) test_df = create_matrix(test)
Matrix matching users and favorite articles will allow you to calculate the cosine distance between users:
from scipy import spatial def cos_distance(x1, x2): return spatial.distance.cosine(x1, x2) at_least_one_fav_post_users = list(train_valuable_df['user_id'].values) def calculate_distances(df): columns = df.columns[:-1] cp = at_least_one_fav_post_users.copy() data = [] for user_id_1 in tqdm_notebook(at_least_one_fav_post_users, ''): row = {'user_id': user_id_1} for user_id_2 in cp: x1 = df[df['user_id'] == user_id_1][columns].values[0] x2 = df[df['user_id'] == user_id_2][columns].values[0] row[str(user_id_2)] = cos_distance(x1, x2) data.append(row) return pd.DataFrame(data) train_distances = calculate_distances(train_valuable_df)
Now everything is ready in order to prompt users articles that they, we believe, will like.
We implement the two strategies for calculating recommendations described above: the average and weighted average ratings among similar users.
We take 10 users closest to the current and predict the rating as average for similar users for the article:
from tqdm import tqdm_notebook import heapq def rmse(predicted, actual): return ((predicted - actual) ** 2).mean() ** 0.5 def get_similar(id, n): df = train_distances[train_distances['user_id'] == id] d = df.to_dict('records')[0] top_similar_ids = heapq.nsmallest(n+1, d, key=d.get) top_similar = df[top_similar_ids] return top_similar.to_dict('records')[0] def get_predictions(id, n): top_similar_users = get_similar(id, n) top_similar_users_ids = list([int(x) for x in top_similar_users.keys()]) ratings_for_top_similar = train_df[train_df['user_id'].isin(top_similar_users_ids)] predicted_ratings = {} for article_id in train_df.columns[:-1]: predicted_ratings[article_id] = ratings_for_top_similar[article_id].mean() return predicted_ratings rand_n_users = train_distances.sample(50)['user_id'].values err = 0 for u in tqdm_notebook(rand_n_users): pred = get_predictions(u, 10) err += rmse(test_df[test_df['user_id'] == u][list(pred.keys())].values, pd.DataFrame(pred, index=[0]).values) print(err / len(rand_n_users))
For the first approach, we got an error equal to 0.855.
Article | Predicted rating |
---|---|
DIRECTUM 5.6. New full-text search | 0.6364 |
DIRECTUM 5.6 - more options for comfortable work | 0.6364 |
Development tool development in DIRECTUM 5.5 | 0.6364 |
DIRECTUM Introduces DirectumRX | 0.5455 |
The annual release of DIRECTUM is now 5.1! | 0.5455 |
A to K. DIRECTUM 5.0 is updated again | 0.5455 |
DIRECTUM Jazz - a new mobile solution from DIRECTUM | 0.5455 |
Have you already updated DIRECTUM? | 0.5455 |
DIRECTUM 5.6. Super Columns and Folder Actions | 0.5455 |
GitLab ISBL syntax highlighting | 0.5455 |
The second method takes into account the degree of similarity of users. Its implementation is almost identical to the first:
def get_predictions(id, n): similar_users = get_similar(u, 10) prediction = {} user_ids = list(similar_users.keys()) user_similarities = [] for user_id in user_ids: user_similarities.append(similar_users[user_id]) predicted_ratings = {} for article_id in train_df.columns[:-1]: prediction_for_article = 0 numerator = 0 denominator = 0 for user_id in user_ids: rating = train_df[train_df['user_id'] == int(user_id)][article_id].values[0] numerator += rating * (1 - similar_users[user_id]) denominator += np.abs(similar_users[user_id]) predicted_ratings[article_id] = numerator / denominator return predicted_ratings err = 0 for u in tqdm_notebook(rand_n_users): pred = get_predictions(u, 10) err += rmse(test_df[test_df['user_id'] == u][list(pred.keys())].values, pd.DataFrame(pred, index=[0]).values) print(err / len(rand_n_users))
In this case, they got error 0.866. The error is slightly larger than in the first case.
Article | Rating |
---|---|
DIRECTUM 5.6. New full-text search | 0.3095 |
DIRECTUM 5.6 - more options for comfortable work | 0.3095 |
Development tool development in DIRECTUM 5.5 | 0.3095 |
Many DIRECTUM Services - One Administration Tool | 0.2833 |
A to K. DIRECTUM 5.0 is updated again | 0.2809 |
The annual release of DIRECTUM is now 5.1! | 0.2784 |
DIRECTUM Introduces DirectumRX | 0.2778 |
Have you already updated DIRECTUM? | 0.2778 |
DIRECTUM 5.6. Super Columns and Folder Actions | 0.2758 |
DIRECTUM Ario - a new intelligent solution | 0.2732 |
The results can be used in different scenarios. For example, in the newsletters of new articles per month or add on the site the section “you may be interested.”
In this article, I tried in detail, using the example of a real task, to figure out how to make a recommendation system based on collaborative filtering.
The advantage of this approach is its versatility - the recommendations do not take into account which objects are recommended. One system can be used for both blog articles and products in the online store.
The disadvantages include the following:
In the next article, another approach will be considered - based on an analysis of the objects themselves.