Analysis of the emotional coloring of reviews from Kinopoisk

Introduction



Natural Language Processing (NLP) is a popular and important area of โ€‹โ€‹machine learning. In this hub I will describe my first project related to the analysis of the emotional coloring of movie reviews written in Python. The task of sentimental analysis is quite common among those who want to master the basic concepts of NLP, and can become an analogue of the 'Hello world' in this area.



In this article, we will go through all the main stages of the Data Science process: from creating your own dataset, processing it and extracting features using the NLTK library, and finally learning and tuning the model using scikit-learn. The task itself is to classify reviews into three classes: negative, neutral and positive.



Data Corpus Formation



To solve this problem, one could use some ready-made and annotated data body with feedback from IMDB, of which there are many on GitHub. But it was decided to create your own with reviews in Russian taken from Kinopoisk. In order not to manually copy them, we will write a web parser. I will use the requests library to send http requests , and BeautifulSoup to process html files. First, we define a function that will take a link to movie reviews and retrieve them. In order for Kinopoisk not to recognize the bot in us, you must specify the headers argument in the requests.get function, which will simulate the operation of the browser. It is necessary to pass a dictionary into it with the keys User-Agent, Accept-language and Accept, the values โ€‹โ€‹of which can be found in the browser developer tools. Next, a parser is created and reviews are retrieved from the page, which are stored in the _reachbanner_ html markup class.



import requests from bs4 import BeautifulSoup import numpy as np import time import os def load_data(url): r = requests.get(url, headers = headers) #  http  soup = BeautifulSoup(r.text, 'html.parser')#  html  reviews = soup.find_all(class_='_reachbanner_')#    reviews_clean = [] for review in reviews:#    html  reviews_clean.append(review.find_all(text=True)) return reviews_clean
      
      





We got rid of the html markup, however, our reviews are still BeautifulSoup objects, but we need to convert them to strings. The convert function does just that. We will also write a function that retrieves the name of the movie, which will later be used to save reviews.



 def convert(reviews): #     review_converted = [] for review in reviews: for i in review: map(str, i) review = ''.join(review) review_converted.append(review) return review_converted def get_name(url): #    r = requests.get(url, headers = headers) soup = BeautifulSoup(r.text, 'html.parser') name = soup.find(class_='alternativeHeadline') name_clean = name.find_all(text = True) #   , . .     return str(name_clean[0])
      
      





The last function of the parser will take a link to the main page of the movie, a review class and a way to save reviews. The function also defines delays between requests that are necessary to avoid a ban. The function contains a loop that retrieves and stores reviews starting from the first page, until it encounters a non-existent page from which the load_data function will extract an empty list and the loop will break.



 def parsing(url, status, path): page = 1 delays = [11, 12, 13, 11.5, 12.5, 13.5, 11.2, 12.3, 11.8] name = get_name(url) time.sleep(np.random.choice(delays)) #    while True: loaded_data = load_data(url + 'reviews/ord/date/status/{}/perpage/200/page/{}/'.format(status, page)) if loaded_data == []: break else: # E     ,    if not os.path.exists(path + r'\{}'.format(status)): os.makedirs(path + r'\{}'.format(status)) converted_data = convert(loaded_data) #   for i, review in enumerate(converted_data): with open(path + r'\{}\{}_{}_{}.txt'.format(status, name, page, i), 'w', encoding = 'utf-8') as output: output.write(review) page += 1 time.sleep(np.random.choice(delays))
      
      





Then, using the next cycle, you can extract reviews from movies that are in the urles list. A list of movies will need to be created manually. It would be possible, for example, to get a list of links to films by writing a function that would extract them from the top 250 films of a movie search so as not to do it manually, but 15-20 films would be enough to form a small dataset of a thousand reviews for each class. Also, if you get a ban, the program will display on which film and class the parser stopped to continue from the same place after passing the ban.



 path = #    urles = #    statuses = ['good', 'bad', 'neutral'] delays = [15, 20, 13, 18, 12.5, 13.5, 25, 12.3, 23] for url in urles: for status in statuses: try: parsing(url = url, status = status, path=path) print('one category done') time.sleep(np.random.choice(delays)) #       AttributeError except AttributeError: print(' : {}, {}'.format(url, status)) break #  else  ,      #    ,     else: print('one url done') continue break
      
      





Preliminary processing



After writing the parser, recalling random films for him and several bans from the movie search, I mixed the reviews in folders and selected 900 reviews from each class for training and the rest for the control group. Now it is necessary to pre-process the housing, namely tokenize and normalize it. Tokenizing means breaking the text down into components, in this case into words, since we will use the representation of a bag of words. And normalization consists in converting words to lower case, removing stop words and excess noise, stamming and any other tricks that help reduce the space of signs.



We import the necessary libraries.



Hidden text
 from nltk.corpus import PlaintextCorpusReader from nltk.stem.snowball import SnowballStemmer from nltk.probability import FreqDist from nltk.tokenize import RegexpTokenizer from nltk import bigrams from nltk import pos_tag from collections import OrderedDict from sklearn.metrics import classification_report, accuracy_score from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import GridSearchCV from sklearn.utils import shuffle from multiprocessing import Pool import numpy as np from scipy.sparse import csr_matrix
      
      







We start by defining a few small functions for text preprocessing. The first one, called lower_pos_tag, will take a list with words, convert them to lower case and save each token into a tuple with its part of speech. The operation of adding part of the speech of a word is called Part of speech (POS) tagging and is often used in NLP to extract entities. In our case, we will use parts of speech in the following function to filter words.



 def lower_pos_tag(words): lower_words = [] for i in words: lower_words.append(i.lower()) pos_words = pos_tag(lower_words, lang='rus') return pos_words
      
      





The texts contain a large number of words that are found too often to be useful for the model (the so-called stop words). Basically, these are prepositions, conjunctions, pronouns by which it is impossible to determine to which class recall refers. The clean function leaves only nouns, adjectives, verbs and adverbs. Note that it removes parts of speech, since they are not needed for the model itself. You can also notice that this function uses stamming, the essence of which is to drop suffixes and prefixes from words. This allows you to reduce the dimension of signs, since words with different genera and cases will be reduced to the same token. There is a more powerful analogue of stamming - lemmatization, it allows you to restore the initial form of the word. However, it works slower than stamming, and, in addition, NLTK does not have a Russian lemmatizer.



 def clean(words): stemmer = SnowballStemmer("russian") cleaned_words = [] for i in words: if i[1] in ['S', 'A', 'V', 'ADV']: cleaned_words.append(stemmer.stem(i[0])) return cleaned_words
      
      





Next, we write the final function, which will take the class label and retrieve all the reviews with this class. To read the case, we will use the raw method of the PlaintextCorpusReader object, which allows you to extract text from the specified file. Next, tokenization is used RegexpTokenizer, working on the basis of a regular expression. In addition to individual words, I added to the model bigrams, which are combinations of all neighboring words. This function also uses the FreqDist object, which returns the frequency of occurrence of words. It is used here to remove words that appear in all reviews of a particular class only once (they are also called hapaks). Thus, the function will return a dictionary containing documents presented as a bag of words and a list of all words for a particular class.



 corpus_root = #    def process(label): # Wordmatrix -     # All words -    data = {'Word_matrix': [], 'All_words': []} #      templist_allwords = [] #        corpus = PlaintextCorpusReader(corpus_root + '\\' + label, '.*', encoding='utf-8') #       names = corpus.fileids() #   tokenizer = RegexpTokenizer(r'\w+|[^\w\s]+') for i in range(len(names)): #   bag_words = tokenizer.tokenize(corpus.raw(names[i])) lower_words = lower_pos_tag(bag_words) cleaned_words = clean(lower_words) finalist = list(bigrams(cleaned_words)) + cleaned_words data['Word_matrix'].append(final_words) templist_allwords.extend(cleaned_words) #   templistfreq = FreqDist(templist_allwords) hapaxes = templistfreq.hapaxes() #    for word in templist_allwords: if word not in hapaxes: data['All_words'].append(word) return {label: data}
      
      





The pre-processing stage is the longest, so it makes sense to parallelize the processing of our case. This can be done using the multiprocessing module. In the next piece of program code, I start three processes that will simultaneously process three folders with different classes. Next, the results will be collected in one dictionary. This preprocessing is completed.



 if __name__ == '__main__': data = {} labels = ['neutral', 'bad', 'good'] p = Pool(3) result = p.map(process, labels) for i in result: data.update(i) p.close()
      
      





Vectorization



After we have pre-processed the case, we have a dictionary where for each class label there is a list with reviews that we tokenized, normalized and enriched with bigrams, as well as a list of words from all reviews of this class. Since the model cannot perceive the natural language as we are, the task now is to present our reviews in numerical form. To do this, we will create a common vocabulary, consisting of unique tokens, and with it we will vectorize each review.



To begin with, create a list that contains reviews of all classes along with their labels. Next, weโ€™ll create a common vocabulary, taking from each class 10,000 of the most common words using the most_common method of the same FreqDist . As a result, I got a vocabulary consisting of about 17,000 words.



 #     : # [([  ], _)] labels = ['neutral', 'bad', 'good'] labeled_data = [] for label in labels: for document in data[label]['Word_matrix']: labeled_data.append((document, label)) #      all_words = [] for label in labels: frequency = FreqDist(data[label]['All_words'] common_words = frequency.most_common(10000) words = [i[0] for i in common_words] all_words.extend(words) #    unique_words = list(OrderedDict.fromkeys(all_words))
      
      





There are several ways to vectorize text. The most popular of them: TF-IDF, direct and frequency coding. I used frequency coding, the essence of which is to present each review as a vector, the elements of which are the number of occurrences of each word from the vocabulary. NLTK has its own classifiers, you can use them too , but they work slower than their counterparts from scikit-learn and have fewer settings. Below is the code for coding for NLTK . However, I will use the Naive Bayes model from scikit-learn and encode the reviews, storing the attributes in a sparse matrix from SciPy , and the class labels in a separate NumPy array.



 #     nltk  : # # [({ : -   },  )] prepared_data = [] for x in labeled_data: d = defaultdict(int) for word in unique_words: if word in x[0]: d[word] += 1 if word not in x[0]: d[word] = 0 prepared_data.append((d, x[1])) #     scikit-learn #     matrix_vec = csr_matrix((len(labeled_data), len(unique_words)), dtype=np.int8).toarray() #     target = np.zeros(len(labeled_data), 'str') for index_doc, document in enumerate(labeled_data): for index_word, word in enumerate(unique_words): #  -     matrix_vec[index_doc, index_word] = document[0].count(word) target[index_doc] = document[1] #   X, Y = shuffle(matrix_vec, target)
      
      





Since in a dataset, reviews with certain tags go one after another, that is, first all neutral, then all negative, and so on, you need to mix them. To do this, you can use the shuffle function from scikit-learn . It is just suitable for situations when signs and class labels are in different arrays, because it allows you to mix two arrays in unison.



Model training



Now it remains to train the model and check its accuracy in the control group. As a model, we will use the model of the Naive Bayesian classifier. Scikit-learn has three Naive Bayes models depending on the distribution of data: binary, discrete, and continuous. Since the distribution of our features is discrete, we choose MultinomialNB .



The Bayesian classifier has the alpha hyper parameter , which is responsible for smoothing the model. Naive Bayes calculates the probabilities of each review belonging to all classes, for this multiplying the conditional probabilities of the appearance of all review words, provided that they belong to a particular class. But if some review word was not found in the training data set, then its conditional probability is zero, which nullifies the probability that the review belongs to any class. To avoid this, by default, one is added to all conditional word probabilities, that is, alpha equals one. However, this value may not be optimal. You can try to select alpha using grid search and cross validation.



 parameter = [1, 0, 0.1, 0.01, 0.001, 0.0001] param_grid = {'alpha': parameter} grid_search = GridSearchCV(MultinomialNB(), param_grid, cv=5) grid_search.fit(X, Y) Alpha, best_score = grid_search.best_params_, grid_search.best_score_
      
      





In my case, the grid hearth gives the optimal value of the hyperparameter equal to 0 with an accuracy of 0.965. However, such a value will obviously not be optimal for the control dataset, since there will be a large number of words not previously found in the training set. For a reference dataset, this model has an accuracy of 0.598. However, if you increase alpha to 0.1, the accuracy on the training data will drop to 0.82, and on the control data it will increase to 0.62. Most likely, on a larger data set, the difference will be more significant.



 model = MultinomialNB(0.1) model.fit(X, Y) # X_control, Y_control   ,   X  Y #        predicted = model.predict(X_control) #     score_test = accuracy_score(Y_control, predicted) #   report = classification_report(Y_control, predicted)
      
      







Output



It is assumed that the model should be used to predict reviews whose words were not used to form the vocabulary. Therefore, the quality of the model can be estimated by its accuracy on the control part of the data, which is 0.62. This is almost twice better than just guessing, but the accuracy is still pretty low.



According to the classification report, it is clear that the model copes worst with reviews that have a neutral color (accuracy 0.47 versus 0.68 for positive and 0.76 for negative). Indeed, neutral reviews contain words that are characteristic of both positive and negative reviews. Probably, the accuracy of the model can be improved by increasing the volume of the dataset, since the three-thousandth data set is rather modest. Also, it would be possible to reduce the problem to a binary classification of reviews into positive and negative, which would also increase the accuracy.



Thanks for reading.



PS If you want to practice yourself, my dataset can be downloaded below the link.



Link to dataset



All Articles