Task: to extract key expressions from the text in Russian. Python NLP

What was needed at the very beginning:





I believe that many people face the problem implicitly every day, after writing or analyzing an article, post, comment, note, report, etc. So by my occupation I had to deal with this problem many times a day. Therefore, it can be said that “laziness”, in the good sense of the word, led me to the idea of ​​automation.



Now, when I am writing this article, the idea has been preserved, but the data set of the final result has changed a lot:





The results of the nrlpk software (Natural Russian Language Processing by the Keys) prepare data for:





Quality



In order not to pull through the whole article those who believe only in numbers, and not in words and those who expect absolute quality and do not accept the other ...



The quality of the sample is in the range of 95–100% when testing articles with sizes not exceeding 3,500 words. The scatter is related to the quality and complexity of the presentation. Here is an example of one of the articles involved in the testing, but the result of its automatic analysis .



It is necessary to remove about 7-10% from the obtained quality interval, i.e. the actual level of quality will likely be 85-93%. This is due to the fact that:





A full list of articles that were tested, and detailed statistics of the results can be found on GitHub .



What specifically affected the quality of the result in each article can be found in the Reasons file on GitHub .



How to read the results



In each folder for a specific article being analyzed, there are 5 files with a data set in Unicode:



  1. words.csv - a list of relevant words, including a list of unidentified ones;
  2. keys.csv - a list of keywords, now it contains, in addition to marked expressions, also words that are repeated in the text at least a specified number of times - in this case, at least 4 times;
  3. garbage.csv - list of unidentified words;
  4. descr_words.csv - description (statistics) to the list of all words of the text;
  5. descr_keys.csv - description (statistics) to the list of keywords;


And reasons_quality.txt is a (optional) list of expressions from the article that were manually selected and missed the keys, or got incorrectly (according to the author nrlpk).



You can learn how to read these files from the Legend file on GitHub .



nrlpk allows you to get any data set in one of the following formats:





Testing methodology



  1. Software (automatic) text analysis.
  2. Manual (by eyes) identification, manual (marking) of key expressions and reconciliation of the received list of key expressions, with the list received automatically.
  3. Calculation of the percentage of quality : the number of missed expressions or trapped in keys is incorrect + the number of words in the garbage, to the total number of words in the text.


Instruments



nrlpk is written in Python 3.7.0. Already in the process of developing the future nrlpk software, two mandatory requirements appeared:





These requirements cast doubt on the use of NLTK and pymorphy2, which could solve some of the challenges.



To remove doubts, a manual marking of a selection of texts from the media taken from the largest Russian-language news aggregator on the subject of the military-industrial complex - VPK.Name was carried out . Labeling analysis revealed:





In addition, already at this stage, the need to collect and store a variety of statistical information about the processed objects became obvious.



Given these factors, Pandas was chosen as the basic package for working with data, which, in addition to the tasks described above, made it possible to carry out batch lemmatization.



After analyzing the available dictionaries of the Russian language for work , OpenCorpora was taken as a basis, which , by the way, is also used in pymorphy2.

It underwent transformation into a form convenient for working with Pandas, after which the following dictionaries were selected from it:





Dictionaries are saved unicode in a simple text format for management from any external program.



Since the basis for dictionaries in nrlpk and pymorphy2 is identical, the designation of parts of speech (gramme) is identical. The number of markers (non-standard grammes) at the moment is 16 and most of them, if marked expressions do not consist of several words, in addition to the marker, also have a designation of the speech part of the basic gramme. The designation of matching markers (non-standard grams) with pymorphy2 is identical, in particular:





By the way, for expressions containing numerical data, in nrlpk, in addition to NUMB and ROMN, the following markers are additionally used:





What is a multi-word keyword expression? For example, NUSR:





Why did you need describe to words and keys



At first, it was necessary to check the operation of nrlpk algorithms - whether words were lost, if an unnecessary union took place, what is the proportion of keys in the text, etc.



But as the software was being debugged, some “regularities” began to appear, the identification of which, as a task, was not posed to nrlpk:





An analysis of the mutual combination of statistics indicators can be no less interesting, for example:





Why is it all written



nrlpk is in a state of readiness to work with current quality indicators for processing Russian texts, but is not provided as a service. The author has clear and understandable directions of development towards increasing the percentage of quality and stabilizing this percentage. To develop this task, a strategic investor and / or a new copyright holder is required ready for the further development of the project to the stated goals.



PS



Tags for this (initial - Habré slightly changed) the text (below) is automatically generated nrlpk with the following parameters:





Detailed data on the result of processing nrlpk of this article can be found on GitHub .



Posted by: avl33



All Articles