What was needed at the very beginning:
- a program that “fishes” unique names of products in a specific industry from a raw text in Russian. Raw text - a text that a person wrote, simply expressing his thoughts and not caring about the formation or selection of a list of words;
- automatically obtained list of words;
- minimal manual or automated processing to convert the list into a set of hashtags or keywords for the text.
I believe that many people face the problem implicitly every day, after writing or analyzing an article, post, comment, note, report, etc. So by my occupation I had to deal with this problem many times a day. Therefore, it can be said that “laziness”, in the good sense of the word, led me to the idea of automation.
Now, when I am writing this article, the idea has been preserved, but the data set of the final result has changed a lot:
- not words are selected, but key expressions, including words;
- the list of key expressions is marked with 16 different markers;
- all words of the text (including non-key ones) are lemmatized - are given in the initial form or are unified under the displayed format;
- each word in the text has additional analytics related to the position in the text and the number of repetitions.
The results of the
nrlpk software (Natural Russian Language Processing by the Keys) prepare data for:
- analysis of texts of an unlimited range of topics and industries (development and testing was carried out on the basis of industry and defense industry materials - the Military Industrial Complex);
- automatic rubrication, classification, cataloging, materialization of materials (online sites);
- monitoring and filtering by content with system response settings (security services and systems in closed loops or online);
- multilayer markup of texts (AI).
Quality
In order not to pull through the whole article those who believe only in numbers, and not in words and those who expect absolute quality and do not accept the other ...
The quality of the sample is in the range of 95–100% when testing articles with sizes not exceeding 3,500 words. The scatter is related to the quality and complexity of the presentation. Here is
an example of one of the articles involved in the testing, but the
result of its automatic analysis .
It is necessary to remove about 7-10% from the obtained quality interval, i.e. the actual level of quality will likely be 85-93%. This is due to the fact that:
- during testing, the requirements for the selected data change, which I did not notice before and I believe that I do not notice everything now;
- when reconciling manually, there is my subjective opinion that what exactly can be recognized as a key in an article and what is not — and it most likely does not coincide with the key to the key with the opinion of the authors of the articles.
A full list of articles that were tested, and detailed statistics of the results
can be found on GitHub .
What specifically affected the quality of the result in each article can be
found in the Reasons file on GitHub .
How to read the results
In each folder for a specific article being analyzed, there are 5 files with a data set in Unicode:
- words.csv - a list of relevant words, including a list of unidentified ones;
- keys.csv - a list of keywords, now it contains, in addition to marked expressions, also words that are repeated in the text at least a specified number of times - in this case, at least 4 times;
- garbage.csv - list of unidentified words;
- descr_words.csv - description (statistics) to the list of all words of the text;
- descr_keys.csv - description (statistics) to the list of keywords;
And reasons_quality.txt is a (optional) list of expressions from the article that were manually selected and missed the keys, or got incorrectly (according to the author nrlpk).
You can
learn how to read these files
from the Legend file on GitHub .
nrlpk allows you to get any data set in one of the following formats:
- Pandas Dataframe (default);
- Python Dictionary;
- JSON
- CSV file.
Testing methodology
- Software (automatic) text analysis.
- Manual (by eyes) identification, manual (marking) of key expressions and reconciliation of the received list of key expressions, with the list received automatically.
- Calculation of the percentage of quality : the number of missed expressions or trapped in keys is incorrect + the number of words in the garbage, to the total number of words in the text.
Instruments
nrlpk is written in Python 3.7.0. Already in the process of developing the future nrlpk software, two mandatory requirements appeared:
- select expressions, not words - including words;
- the presence of a dictionary of specialized industry terms.
These requirements cast doubt on the use of NLTK and pymorphy2, which could solve some of the challenges.
To remove doubts, a manual marking of a selection of texts from the media taken from the largest Russian-language news aggregator on the subject of the military-industrial complex -
VPK.Name was carried out . Labeling analysis revealed:
- a whole layer of data that should not be subject to word-by-word tokenization and lemmatization;
- the impossibility in many cases of tokenization according to sentences to a serious transformation of the text to correct grammatical inaccuracies that the authors allow in more than 80% of articles. These inaccuracies in no way affect the perception of the text by a person, but very significantly affect the perception and interpretation of such text by the machine.
In addition, already at this stage, the need to collect and store a variety of statistical information about the processed objects became obvious.
Given these factors,
Pandas was chosen as the basic package for working with data, which, in addition to the tasks described above, made it possible to carry out batch lemmatization.
After analyzing the available dictionaries of the Russian language for
work ,
OpenCorpora was taken as a basis,
which , by the way, is also used in pymorphy2.
It underwent transformation into a form convenient for working with Pandas, after which the following dictionaries were selected from it:
- large - the entire base of words;
- bad words - words that are excluded from text analysis due to lack of significance;
- special - a dictionary of specialized (industry) words. The proper names are not included in the dictionary: names, names, surnames, addresses, products, products, companies, persons, etc. This is a fundamental and informed decision, since in any living industry / direction, such an approach would require constant monitoring and manual addition of dictionaries, which ruins the idea of facilitating labor through automation;
Dictionaries are saved unicode in a simple text format for management from any external program.
Since the basis for dictionaries in nrlpk and pymorphy2 is identical, the designation of
parts of speech (gramme) is identical. The number of markers (non-standard grammes) at the moment is 16 and most of them, if marked expressions do not consist of several words, in addition to the marker, also have a designation of the speech part of the basic gramme. The designation of matching markers (non-standard grams) with pymorphy2 is identical, in particular:
- NUMB is a number;
- ROMN is a Roman number;
- UNKN - the token could not be parsed.
By the way, for expressions containing numerical data, in nrlpk, in addition to NUMB and ROMN, the following markers are additionally used:
- NUSR - the expression contains one or more numeric data;
- MATH - The expression contains a mathematical formula.
What is a multi-word keyword expression? For example, NUSR:
- if the text is February 25, 2020, then the key expression will be February 25, 2020, with the lemma “02.25.2020”, the gramme “NUSR” and the marker NUSR;
- however, if the text says “February 25, 2020”, the key expression will be “February 25, 2020”, with the lemma “2f2g”, the gramme “WIQM” and the marker WIQM;
- if there are 25 tons in the text, then in the key we will see “25 tons”, with the lemma “2t”, where “NUSR” will also be used as a gramme and marker.
Why did you need describe to words and keys
At first, it was necessary to check the operation of nrlpk algorithms - whether words were lost, if an unnecessary union took place, what is the proportion of keys in the text, etc.
But as the software was being debugged, some “regularities” began to appear, the identification of which, as a task, was not posed to nrlpk:
- identification of words written with spelling errors;
- identification of texts with a bad style, bad-%> 35% (practical observations as a result of testing);
- identification of target (narrowly focused, clearly positioning) texts - skeys-% <5 without numeric keys (practical observations as a result of testing);
- identification of texts that are not subject to industry topics - skeys-% <1.
An analysis of the mutual combination of statistics indicators can be no less interesting, for example:
- identification of texts of "wide scope" - keys-%> 45% with ukeys-% tending to keys-%.
Why is it all written
nrlpk is in a state of readiness to work with current quality indicators for processing Russian texts, but is not provided as a service. The author has clear and understandable directions of development towards increasing the percentage of quality and stabilizing this percentage. To develop this task, a strategic investor and / or a new copyright holder is required ready for the further development of the project to the stated goals.
PS
Tags for this (initial - Habré slightly changed) the text (below) is automatically generated
nrlpk with the following parameters:
- Do not recognize as keys the expressions with numerical data;
- recognize as keys the words repeated in the text at least 8 times.
Detailed data on the result of processing nrlpk of this article can be found
on GitHub .
Posted by:
avl33