Recommender system for Directum Club. Part One, Collaborative

Every day, users around the world receive a large number of different mailings - only through the MailChimp service daily send a billion letters . Of these, 20.81% are discovered.

Every month, users of our sites receive newsletters with materials selected by the editor. About 21% of readers open these letters.

In order to increase this number, you can make them personalized. One way is to add a recommendation system that will prompt materials interesting to a particular reader.

In this article I will talk about how to implement a recommendation system from scratch based on collaborative filtering.

The first part of the article contains the theoretical basis for the implementation of the recommendation system. School math is enough to understand the material.

The second part describes a Python implementation for our site data.

A bit of collaborative filtering theory

Collaborative filtering is probably the easiest approach in recommender systems. It is based on the idea that similar users like similar objects, such as articles.

What does “similar users” mean?

who are you?

How to determine how much Vasily is like Ivan or an article about SQL Server to an article about PostgreSQL?

Let's look at an example. Let's say we have four users: Vasily, Ivan, Inna and Anna. The site has five articles: Article 1, Article 2, Article 3, Article 4 and Article 5. In the table below, the number at the intersection of the user and the article is the user's rating of the article on a five-point scale. Zero in the table are articles that have not been rated by the user. For example, Vasily liked articles 1, 3, and 4.

Table 1

	Article 1	Article 2	Section 3	Section 4	Section 5
Vasiliy	four	0	five	five	0
Ivan	0	0	four	five	0
Inna	four	2	four	0	0
Anna	five	five	0	0	five

Intuitively, we can assume that if users like the same articles, then their tastes coincide. What do you think whose interests are similar to the interests of Vasily?

Vasily’s interests are more similar to the interests of Ivan and Inna and less similar to Anna’s interests. Why - it will be told further.

For further work, it is necessary to formalize and measure the “similarity” of Vasily and Ivan or Inna and Anna.

The easiest way to do this is to consider user ratings as a description of their profile. In the example, each row in the table is a description of one user. The first line - the description of Basil - is a vector of five numbers: [4, 0, 5, 5, 0]; the second - Ivan - [0, 0, 4, 5, 0]; the third is Inna - [4, 2, 4, 0, 0]; the fourth - Anne - [5, 5, 0, 0, 5].

Now you can introduce the concept of "measure of similarity" user descriptions.

One way to measure the "similarity" of users is to calculate the cosine distance between the vectors that describe them.

The cosine distance is calculated by the formula:

1 - c o s t h e t a = 1 - f r a c A c d o t B | | A | | c d o t | | B | |

$1 - cos \ theta = 1 - \ frac {A \ cdot B} {|| A || \ cdot || B ||}$

Where $A$ and $B$ - user description vectors; $A \ cdot B$ - scalar product of description vectors; $|| A ||$ , $|| B ||$ - lengths of description vectors.

The meaning of the cosine distance is as follows: if two vectors where $A$ and $B$ (user description vectors) are “similar”, then the angle between them will tend to zero, and the cosine of this angle will tend to unity. In the ideal case, when the "interests" of the two users coincide, the cosine distance for them will be zero.

Cosine distance between Vasily and Ivan:

1 - c o s t h e t a = 1 - f r a c 4 c d o t 0 + 0 c d o t 0 + 5 c d o t 4 + 5 c d o t 5 + 0 c d o t 0 s q r t 4^{2} + 0^{2} + 5^{2} + 5^{2} + 0^{2} c d o t s q r t 0^{2} + 0^{2} + 4^{2} + 5^{2} + 0^{2} = 0.1349

$1 - cos \ theta = 1 - \ frac {4 \ cdot 0 + 0 \ cdot 0 + 5 \ cdot 4 + 5 \ cdot 5 + 0 \ cdot 0} {\ sqrt {4 ^ 2 + 0 ^ 2 + 5 ^ 2 + 5 ^ 2 + 0 ^ 2} \ cdot \ sqrt {0 ^ 2 + 0 ^ 2 + 4 ^ 2 + 5 ^ 2 + 0 ^ 2}} = 0.1349$

Similarly, the cosine distance between Vasily and Anna is 0.715. That is, Vasily’s interests are more like Ivan’s interests than Anna’s.

How to predict user ratings?

This part is the most interesting. There are many different options. Below we consider two simple options.

Predicted rating - average rating among “similar” users

The easiest option for calculating the predicted rating is to see what ratings the “similar” users put to the article and take the average rating:

r_{u, i} = f r a c 1 N s u m_{u^{'} i n U} r_{u^{'}, i}

$r_ {u, i} = \ frac {1} {N} \ sum_ {u '\ in U} r_ {u', i}$

In this formula:

$r_ {u, i}$ Is the estimate that is predicted for $i$ th article and user $u$ ,
$r_ {u ', i}$ - user rating $u ’$ for $i$ th article
$U$ —Lots of “similar” users,
$N$ - the number of “similar” users.

Predicted rating - weighted average rating among “similar” users

A slightly more complicated option is to take into account the degree of similarity: ratings of more similar users should influence the final rating more than ratings of less similar ones:

r_{u, i} = f r a c s u m_{u^{'} i n U} (1 - s i m i l (u, u^{'})) r_{u^{'}, i} s u m_{u^{'} i n U} | 1 - s i m i l (u, u^{'}) |

$r_ {u, i} = \ frac {\ sum_ {u '\ in U} (1 - simil (u, u')) r_ {u ', i}} {\ sum_ {u' \ in U} | 1 - simil (u, u ') |}$

In this formula:

$r_ {u, i}$ Is the estimate that is predicted for $i$ th article and user $u$ ,
$r_ {u ', i}$ - user rating $u ’$ for $i$ th article
$U$ —Lots of “similar” users,
$simil (u, u ')$ - “similarity” (cosine distance) of users $u$ and $u ’$ .

How to measure the quality of recommendations?

When creating any recommendation system, you should determine the metric by which you can evaluate the quality of our model - how well the system offers the user new materials. For example, the root mean square error ( $RMSE$ ) Is the square root of the average error for all user ratings. Formally, this measure is described by the formula:

R M S E = s q r t f r a c 1 | D | s u m_{u, i i n D} (h a t r_{u, i} - r_{u, i})^{2}

$RMSE = \ sqrt {\ frac {1} {| D |} \ sum_ {u, i \ in D} (\ hat {r} _ {u, i} - r_ {u, i}) ^ 2}$

In this formula

$D$ - the set of all user ratings for articles,
$\ hat {r} _ {u, i}$ - predicted user rating $u$ article $i$ ,
$r_ {u, i}$ - real user rating $u$ article $i$ .

In the ideal case, when the predicted ratings coincided with the user’s $RMSE$ equal to zero.

Consider an example. Two recommendation systems made predictions of estimates for Vasily. The result is in the table below.

	Article 1	Article 2	Section 3	Section 4	Section 5
Vasiliy	four	0	five	five	0
Recommender system 1	one	3	five	2	2
Recommender system 2	four	one	five	3	0

It is intuitively clear that the second recommendation system predicted ratings better than the first. Count $RMSE$ :

R M S E_{(1)} = s q r t f r a c (4 - 1)^{2} + (0 - 3)^{2} + (5 - 5)^{2} + (5 - 2)^{2} + (0 - 2)^{2} 5 = $ 2.48

$RMSE _ {(1)} = \ sqrt {\ frac {(4-1) ^ 2 + (0-3) ^ 2 + (5-5) ^ 2 + (5-2) ^ 2 + (0-2 ) ^ 2} {5}} = $ 2.48$

R M S E_{(2)} = s q r t f r a c (4 - 4)^{2} + (0 - 1)^{2} + (5 - 5)^{2} + (5 - 3)^{2} + (0 - 0)^{2} 5 = 1

$RMSE _ {(2)} = \ sqrt {\ frac {(4-4) ^ 2 + (0-1) ^ 2 + (5-5) ^ 2 + (5-3) ^ 2 + (0-0 ) ^ 2} {5}} = 1$

The error for evaluations of the second recommendation system is expected to be significantly lower.

Implementation

We have at our disposal most of the data on articles and users of the site: information on articles, tags, user likes, etc.

To implement collaborative filtering, user ratings are sufficient.

Disclaimer

Hereinafter, the code is written “in the forehead” to demonstrate the logic of the recommendation system. In real life, it is better to use all the features of numpy

and pandas

.

 import pandas as pd import numpy as np import os ratings_df = pd.read_csv('./input/Ratings.csv') print(' :', ratings_df.shape[0]) print(' :', ratings_df[ratings_df['Rate']].shape[0]) unique_user_ids = ratings_df[ratings_df['Rate']]['UserId'].unique() print(' :', len(unique_user_ids)) ratings_df.head()

Output [1]

Total data: 15313

Positive Ratings: 15121

Active users: 1007

	Id	DocumentId	Rate	Userid
0	one	one	True	5000
one	2	878	True	2441
2	3	1512	True	678
3	four	1515	True	678
four	five	877	True	5110
...	...	...	...	...

1007 active users gave 15313 “ratings”. Of these, 15121 “likes”.

The data contains four columns: a row identifier from the database ( Id column), an object identifier ( DocumentId column), a sign that the user liked the article ( Rate column) and a user identifier ( UserId column).

For convenience, add the column RateInt . 1 in this column means the user liked the article; -1 - that did not like.

 ratings_df['RateInt'] = ratings_df['Rate'].apply(lambda x: 1 if x else -1) ratings_df.head()

Output [2]

	Id	DocumentId	Rate	Userid	RateInt
0	one	one	True	5000	one
one	2	878	True	2441	one
2	3	1512	True	678	one
3	four	1515	True	678	one
four	five	877	True	5110	one

For further work, it is required to divide the data set into training and test: the training will be used to train the model, and the test will determine the quality of the predictions.

 from sklearn.model_selection import train_test_split train, test = train_test_split(ratings_df, test_size=0.2)

For convenience, we transform each set into a table, where in the rows are the identifiers of users, and in the columns are identifiers of articles by analogy with the example at the beginning of the article.

 def create_matrix(df): ratings_per_user = [] post_ids = df['DocumentId'].unique() for user_id in tqdm_notebook(all_users_ids, ''): row = {'user_id': user_id} ratings = df[df['UserId'] == user_id]['DocumentId'].values for post_id in post_ids: row[str(post_id)] = 1 if post_id in ratings else 0 ratings_per_user.append(row) return pd.DataFrame(ratings_per_user) train_df = create_matrix(train) test_df = create_matrix(test)

Matrix matching users and favorite articles will allow you to calculate the cosine distance between users:

 from scipy import spatial def cos_distance(x1, x2): return spatial.distance.cosine(x1, x2) at_least_one_fav_post_users = list(train_valuable_df['user_id'].values) def calculate_distances(df): columns = df.columns[:-1] cp = at_least_one_fav_post_users.copy() data = [] for user_id_1 in tqdm_notebook(at_least_one_fav_post_users, ''): row = {'user_id': user_id_1} for user_id_2 in cp: x1 = df[df['user_id'] == user_id_1][columns].values[0] x2 = df[df['user_id'] == user_id_2][columns].values[0] row[str(user_id_2)] = cos_distance(x1, x2) data.append(row) return pd.DataFrame(data) train_distances = calculate_distances(train_valuable_df)

Now everything is ready in order to prompt users articles that they, we believe, will like.

We implement the two strategies for calculating recommendations described above: the average and weighted average ratings among similar users.

First way

We take 10 users closest to the current and predict the rating as average for similar users for the article:

 from tqdm import tqdm_notebook import heapq def rmse(predicted, actual): return ((predicted - actual) ** 2).mean() ** 0.5 def get_similar(id, n): df = train_distances[train_distances['user_id'] == id] d = df.to_dict('records')[0] top_similar_ids = heapq.nsmallest(n+1, d, key=d.get) top_similar = df[top_similar_ids] return top_similar.to_dict('records')[0] def get_predictions(id, n): top_similar_users = get_similar(id, n) top_similar_users_ids = list([int(x) for x in top_similar_users.keys()]) ratings_for_top_similar = train_df[train_df['user_id'].isin(top_similar_users_ids)] predicted_ratings = {} for article_id in train_df.columns[:-1]: predicted_ratings[article_id] = ratings_for_top_similar[article_id].mean() return predicted_ratings rand_n_users = train_distances.sample(50)['user_id'].values err = 0 for u in tqdm_notebook(rand_n_users): pred = get_predictions(u, 10) err += rmse(test_df[test_df['user_id'] == u][list(pred.keys())].values, pd.DataFrame(pred, index=[0]).values) print(err / len(rand_n_users))

For the first approach, we got an error equal to 0.855.

Recommendations for the casual user

Article	Predicted rating
DIRECTUM 5.6. New full-text search	0.6364
DIRECTUM 5.6 - more options for comfortable work	0.6364
Development tool development in DIRECTUM 5.5	0.6364
DIRECTUM Introduces DirectumRX	0.5455
The annual release of DIRECTUM is now 5.1!	0.5455
A to K. DIRECTUM 5.0 is updated again	0.5455
DIRECTUM Jazz - a new mobile solution from DIRECTUM	0.5455
Have you already updated DIRECTUM?	0.5455
DIRECTUM 5.6. Super Columns and Folder Actions	0.5455
GitLab ISBL syntax highlighting	0.5455

Second way

The second method takes into account the degree of similarity of users. Its implementation is almost identical to the first:

 def get_predictions(id, n): similar_users = get_similar(u, 10) prediction = {} user_ids = list(similar_users.keys()) user_similarities = [] for user_id in user_ids: user_similarities.append(similar_users[user_id]) predicted_ratings = {} for article_id in train_df.columns[:-1]: prediction_for_article = 0 numerator = 0 denominator = 0 for user_id in user_ids: rating = train_df[train_df['user_id'] == int(user_id)][article_id].values[0] numerator += rating * (1 - similar_users[user_id]) denominator += np.abs(similar_users[user_id]) predicted_ratings[article_id] = numerator / denominator return predicted_ratings err = 0 for u in tqdm_notebook(rand_n_users): pred = get_predictions(u, 10) err += rmse(test_df[test_df['user_id'] == u][list(pred.keys())].values, pd.DataFrame(pred, index=[0]).values) print(err / len(rand_n_users))

In this case, they got error 0.866. The error is slightly larger than in the first case.

Recommendations for the same random user

Article	Rating
DIRECTUM 5.6. New full-text search	0.3095
DIRECTUM 5.6 - more options for comfortable work	0.3095
Development tool development in DIRECTUM 5.5	0.3095
Many DIRECTUM Services - One Administration Tool	0.2833
A to K. DIRECTUM 5.0 is updated again	0.2809
The annual release of DIRECTUM is now 5.1!	0.2784
DIRECTUM Introduces DirectumRX	0.2778
Have you already updated DIRECTUM?	0.2778
DIRECTUM 5.6. Super Columns and Folder Actions	0.2758
DIRECTUM Ario - a new intelligent solution	0.2732

The results can be used in different scenarios. For example, in the newsletters of new articles per month or add on the site the section “you may be interested.”

Summary

In this article, I tried in detail, using the example of a real task, to figure out how to make a recommendation system based on collaborative filtering.

The advantage of this approach is its versatility - the recommendations do not take into account which objects are recommended. One system can be used for both blog articles and products in the online store.

The disadvantages include the following:

in the case of a large number of objects for recommendations, the user-object matrix becomes sparse, and it becomes more difficult to find sufficiently similar users (fewer user-object pairs match)
cold start problem - it is impossible for a new user to find similar users (there are strategies to circumvent this limitation, but they are not a panacea)
a system based on collaborative filtering tends to recommend popular objects, as the vast majority of users will appreciate such objects.

In the next article, another approach will be considered - based on an analysis of the objects themselves.

All Articles