
The analysis and processing of texts in a natural language is a constantly relevant task that has been solved, is being solved and will be solved in all available ways. Today I would like to talk about solution tools for solving this problem, namely, in the Julia language. Of course, due to the youth of the language, there are no such developed analysis tools, such as, for example, Stanford CoreNLP, Apache OpenNLP, GATE, etc., as, for example, for the Java language. However, even libraries already developed can be used both for solving typical problems and can be recommended as an entry point for students who are interested in the field of word processing. And the syntactic simplicity of Julia and its advanced mathematical tools make it easy to immerse yourself in the tasks of clustering and classifying texts.
The purpose of this article is to review Julia word processing tools with a few explanations about their use. We will balance between a brief list of opportunities for those who are in the NLP topic, but would like to see exactly Julia tools, and more detailed explanations and application examples for those who decided to immerse themselves in the NLP (Natural Language Processing) area for the first time.
Well, now, let's move on to the package overview.
TextAnalysis.jl
The TextAnalysis.jl package is a basic library that implements a minimal set of typical text processing functions. It is with her that we start. Examples are partially taken from the documentation .
Document
The basic entity is a document.
The following types are supported:
- FileDocument - a document represented by a simple text file on disk
julia> pathname = "/usr/share/dict/words" "/usr/share/dict/words" julia> fd = FileDocument(pathname) A FileDocument * Language: Languages.English() * Title: /usr/share/dict/words * Author: Unknown Author * Timestamp: Unknown Time * Snippet: AA's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     - StringDocument - a document represented by a UTF-8 string and stored in RAM. The StringDocument structure provides for the storage of text as a whole.
 julia> str = "To be or not to be..." "To be or not to be..." julia> sd = StringDocument(str) A StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: To be or not to be...
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      -   TokenDocument - a document that is a sequence of UTF-8 tokens (highlighted words).  The TokenDocument
 
 
 
 structure stores a set of tokens, however, the full text cannot be restored without loss.
 julia> my_tokens = String["To", "be", "or", "not", "to", "be..."] 6-element Array{String,1}: "To" "be" "or" "not" "to" "be..." julia> td = TokenDocument(my_tokens) A TokenDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      -   NGramDocument - a document presented as a set of n-grams in a UTF8 representation, i.e. a sequence of n
 
 
 
 UTF-8 characters, and a counter for their occurrence. This option of presenting a document is one of the simplest ways to avoid some problems of the morphology of languages, typos, and features of language constructions in the analyzed texts. However, the fee for this is a decrease in the quality of text analysis compared to methods where language information is taken into account.
 julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2, "or" => 1, "not" => 1, "to" => 1, "be..." => 1) Dict{String,Int64} with 6 entries: "or" => 1 "be..." => 1 "not" => 1 "to" => 1 "To" => 1 "be" => 2 julia> ngd = NGramDocument(my_ngrams) A NGramDocument{AbstractString} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Or a short option:
 julia> str = "To be or not to be..." "To be or not to be..." julia> ngd = NGramDocument(str, 2) NGramDocument{AbstractString}(Dict{AbstractString,Int64}("To be" => 1,"or not" => 1,"be or" => 1,"or" => 1,"not to" => 1,"not" => 1,"to be" => 1,"to" => 1,"To" => 1,"be" => 2…), 2, TextAnalysis.DocumentMetadata( Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      A document can also be created simply using the generic Document constructor, and the library will find the appropriate implementation of the document.
 julia> Document("To be or not to be...") A StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: To be or not to be... julia> Document("/usr/share/dict/words") A FileDocument * Language: Languages.English() * Title: /usr/share/dict/words * Author: Unknown Author * Timestamp: Unknown Time * Snippet: AA's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah julia> Document(String["To", "be", "or", "not", "to", "be..."]) A TokenDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** julia> Document(Dict{String, Int}("a" => 1, "b" => 3)) A NGramDocument{AbstractString} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        As you can see, the body of the document consists of text / tokens and metadata.  The text of the document can be obtained using the text(...)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     method: 
 julia> td = TokenDocument("To be or not to be...") TokenDocument{String}(["To", "be", "or", "not", "to", "be"], TextAnalysis.DocumentMetadata( Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) julia> text(td) ┌ Warning: TokenDocument's can only approximate the original text └ @ TextAnalysis ~/.julia/packages/TextAnalysis/pcFQf/src/document.jl:111 "To be or not to be" julia> tokens(td) 6-element Array{String,1}: "To" "be" "or" "not" "to" "be"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        The example demonstrates a document with automatically parsed tokens.  We see that the call to text(td)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     issued a warning that the text was only approximately restored, since TokenDocument
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     does not store word delimiters.  The call of tokens(td)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     made it possible to get exactly the highlighted words. 
You can request metadata from a document:
 julia> StringDocument("This document has too foo words") A StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: This document has too foo words julia> language(sd) Languages.English() julia> title(sd) "Untitled Document" julia> author(sd) "Unknown Author" julia> timestamp(sd) "Unknown Time"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        And all of them can be changed by the corresponding functions.  Notation of modifying functions in Julia is the same as in the Ruby language.  A function that modifies an object has a suffix !
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      : 
 julia> using TextAnalysis.Languages julia> language!(sd, Languages.Russian()) Languages.Russian () julia> title!(sd, "") "" julia> author!(sd, " ..") " .." julia> import Dates:now julia> timestamp!(sd, string(now())) "2019-11-09T22:53:38.383"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Features of strings with UTF-8
Julia supports UTF-8 encoding when processing strings, so it has no problems with using non-Latin alphabets. Any character processing options are naturally available. However, keep in mind that row indexes for Julia are bytes, not characters. And each character can be represented by a different number of bytes. And there are separate methods for working with UNICODE characters. See Unicode-and-UTF-8 for details. But here is a simple example. Let's set a line with mathematical UNICODE-characters, separated from x and y by spaces:
 julia> s = "\u2200 x \u2203 y" "∀ x ∃ y" julia> length(s) # ! 7 julia> ncodeunits(s) # ! 11
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
      Now let's look at the indices:
 julia> s[1] '∀': Unicode U+2200 (category Sm: Symbol, math) julia> s[2] ERROR: StringIndexError("∀ x ∃ y", 2) [...] julia> s[3] ERROR: StringIndexError("∀ x ∃ y", 3) Stacktrace: [...] julia> s[4] ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        The example clearly shows that index 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     allowed us to obtain the symbol .  But all subsequent indexes up to 3 inclusive, led to an error.  And only the 4th index produced a space, as the next character in the string.  However, to determine the boundaries of characters by indexes in a string, there are useful functions prevind
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     (previous index), nextind
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     (next index) and thisind
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     (this index).  For example, for the gap found above, we ask where the border of the previous one is: 
 julia> prevind(s, 4) 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        We got index 1 as the beginning of the symbol ∀
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
 julia> thisind(s, 3) 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Checked index 3 and got the same valid 1.
  If we need to "go over" all the characters, then this can be done in at least two simple ways: 
      
        
        
        
      
      1) using the design: 
 julia> for c in s print(c) end ∀ x ∃ y
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        2) using eachindex
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     enumerator: 
 julia> collect(eachindex(s)) 7-element Array{Int64,1}: 1 4 5 6 7 10 11 julia> for i in eachindex(s) print(s[i]) end ∀ x ∃ y
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Document preprocessing
  If the text of the document was obtained from some external representation, then it is quite possible that there could be encoding errors in the byte stream.  To eliminate them, use the remove_corrupt_utf8!(sd)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     function.  The argument is the document discussed above. 
  The main function for processing documents in the TextAnalysis package is prepare!(...)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  For example, remove the punctuation marks from the text: 
 julia> str = StringDocument("here are some punctuations !!!...") julia> prepare!(str, strip_punctuation) julia> text(str) "here are some punctuations "
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        Also, a useful step in word processing is the conversion of all letters to lower case, as this simplifies the further comparison of words with each other.  In this case, in the general case, we must understand that we can lose important information about the text, for example, the fact that the word is a proper name or the word is the boundary of a sentence.  But it all depends on the model of further processing.  The lowercase is done by the remove_case!()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     Function. 
 julia> sd = StringDocument("Lear is mad") A StringDocument{String} julia> remove_case!(sd) julia> text(sd) "lear is mad"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        Along the way, we can delete garbage words, that is, those that are of no use in information retrieval and analysis for matches.  This can be done explicitly using the remove_words!(…)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     function and an array of these stop words. 
 julia> remove_words!(sd, ["lear"]) julia> text(sd) " is mad"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Among the words to be deleted, there are also articles, prepositions, pronouns, numbers and just stop words, which are parasitic in frequency of occurrence. For each specific language, these dictionaries are individual. And they are set in the Languages.jl package. Numbers interfere with us because in the future model a thermal document, they can greatly increase the dimension of the matrix, without improving, for example, the clustering of texts. However, in search problems, for example, it is no longer always possible to drop numbers.
Among the available cleaning methods are the following options:
-  prepare!(sd, strip_articles)
 
 
 
 
-  prepare!(sd, strip_indefinite_articles)
 
 
 
 
-  prepare!(sd, strip_definite_articles)
 
 
 
 
-  prepare!(sd, strip_preposition)
 
 
 
 
-  prepare!(sd, strip_pronouns)
 
 
 
 
-  prepare!(sd, strip_stopwords)
 
 
 
 
-  prepare!(sd, strip_numbers)
 
 
 
 
-  prepare!(sd, strip_non_letters)
 
 
 
 
-  prepare!(sd, strip_spares_terms)
 
 
 
 
-  prepare!(sd, strip_frequent_terms)
 
 
 
 
-  prepare!(sd, strip_html_tags)
 
 
 
 
  Options can be combined.  For example, in one call to prepare!
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      simultaneously remove articles, numbers and html tags - prepare!(sd, strip_articles| strip_numbers| strip_html_tags)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
Another type of processing is highlighting the base of words, removing endings and suffixes. This allows you to combine different word forms and dramatically reduce the dimensionality of the document presentation model. Dictionaries are required for this, so the language of the documents must be clearly indicated. Processing example in Russian:
 julia> sd = StringDocument("   ") StringDocument{String}("   ", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) julia> language!(sd, Languages.Russian()) Languages.Russian() julia> stem!(sd) julia> text(sd) "   "
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Document body
  Under the corpus is meant a set of documents that will be processed according to the same rules.  The TextAnalysis package implements the formation of a term document matrix.  And for its construction, we need to immediately have a complete set of documents.  In a simple example for documents: 
      
        
        
        
      
     D1 = "I like databases"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
      
        
        
        
      
     D2 = "I hate databases"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
this matrix looks like:
| I | like | hate | databases | |
|---|---|---|---|---|
| D1 | one | one | 0 | one | 
| D2 | one | 0 | one | one | 
Columns are represented by the words of documents, and rows are identifiers (or indices) of documents. Accordingly, the cell will be 0 if the word (term) does not appear in the document. And 1, if it occurs any number of times. More sophisticated models take into account both the frequency of occurrence (TF model) and significance in relation to the entire body (TF-IDF).
  We can build the body using the Corpus()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     constructor: 
 crps = Corpus([StringDocument("Document 1"), StringDocument("Document 2")])
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      If we request a list of terms immediately, we get:
 julia> lexicon(crps) Dict{String,Int64} with 0 entries
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        And, here, forcing the library to recount all the terms that are part of the enclosure using update_lexicon!(crps)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , we get a different result: 
 julia> update_lexicon!(crps) julia> lexicon(crps) Dict{String,Int64} with 3 entries: "1" => 1 "2" => 1 "Document" => 2
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      That is, we can see the selected terms (words and numbers) and their number of entries in the document body.
At the same time, we can clarify the frequency of the term, for example, “Document”:
 julia> lexical_frequency(crps, "Document") 0.5
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Also, we can build a reverse index, that is, for each topic, get the document numbers in the case. This index is used in information retrieval, when you need to find a list of documents where they occur from the list of terms:
 julia> update_inverse_index!(crps) julia> inverse_index(crps) Dict{String,Array{Int64,1}} with 3 entries: "1" => [1] "2" => [2] "Document" => [1, 2]
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        For the case as a whole, you can apply the preprocessing functions, the same as for each individual document.  Another method of the prepare!
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     function is used prepare!
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      considered earlier.  Here, the first argument is the case. 
 julia> crps = Corpus([StringDocument("Document ..!!"), StringDocument("Document ..!!")]) julia> prepare!(crps, strip_punctuation) julia> text(crps[1]) "Document " julia> text(crps[2]) "Document "
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      As well as for individual documents, you can request metadata for the entire body.
 julia> crps = Corpus([StringDocument("Name Foo"), StringDocument("Name Bar")]) julia> languages(crps) 2-element Array{Languages.English,1}: Languages.English() Languages.English() julia> titles(crps) 2-element Array{String,1}: "Untitled Document" "Untitled Document" julia> authors(crps) 2-element Array{String,1}: "Unknown Author" "Unknown Author" julia> timestamps(crps) 2-element Array{String,1}: "Unknown Time" "Unknown Time"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      You can set the values the same for the entire body at once or individual for specific documents by passing an array with element-by-element values for them.
 julia> languages!(crps, Languages.German()) julia> titles!(crps, "") julia> authors!(crps, "Me") julia> timestamps!(crps, "Now") julia> languages!(crps, [Languages.German(), Languages.English julia> titles!(crps, ["", "Untitled"]) julia> authors!(crps, ["Ich", "You"]) julia> timestamps!(crps, ["Unbekannt", "2018"])
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Feature Highlighting
Feature extraction is one of the basic stages of machine learning. This does not directly relate to the topic of this article, but in the documentation for the TextAnalysis package a rather large section is devoted to the identification of features in this very formulation. This section includes both, in fact, the construction of a term-document matrix, and many other methods. https://juliatext.github.io/TextAnalysis.jl/dev/features/
We briefly consider the proposed options.
The basic model for presenting documents is a model where a set of words is stored for each document. Moreover, their position is not important. Therefore, in an English-language writer, this option is called Bag of words. For each word, only the fact of its presence in the document, the frequency of occurrence (TF - Term Frequency) or a model that takes into account the frequency of occurrence of the term in the body as a whole (TF-IDF - Term Frequency - Inverse Document Frequency) is important.
  Take the simplest example with three documents containing the terms Document
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , 2
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , 3
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
 julia> using TextAnalysis julia> crps = Corpus([StringDocument("Document 1"), StringDocument("Document 2"), StringDocument("Document 1 3")])
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      We will not use preprocessing. But we’ll build the full vocabulary and matrix of the term document:
 julia> update_lexicon!(crps) julia> m = DocumentTermMatrix(crps) DocumentTermMatrix( [1, 1] = 1 [3, 1] = 1 [2, 2] = 1 [3, 3] = 1 [1, 4] = 1 [2, 4] = 1 [3, 4] = 1, ["1", "2", "3", "Document"], Dict("1" => 1,"2" => 2,"Document" => 4,"3" => 3))
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        The variable m
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     has a value of type DocumentTermMatrix
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  In the printed result, we see that the dimension is 3 documents in 4 terms, which include the word Document
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and the numbers 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , 2
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , 3
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  For further use of the model, we need a matrix in the traditional representation.  We can get it using the dtm()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     method: 
 julia> dtm(m) 3×4 SparseArrays.SparseMatrixCSC{Int64,Int64} with 7 stored entries: [1, 1] = 1 [3, 1] = 1 [2, 2] = 1 [3, 3] = 1 [1, 4] = 1 [2, 4] = 1 [3, 4] = 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        This option is represented by the SparseMatrixCSC
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     type, which is economical in representing a very sparse matrix, but there is only a limited number of libraries that support it.  The problem of the size of the term document matrix is due to the fact that the number of terms grows very quickly with the number of processed documents.  If you do not pre-process documents, then absolutely all words with all their word forms, numbers, dates will fall into this matrix.  Even if the number of word forms is reduced due to reduction to the main form, the number of remaining stems will be on the order of thousands - tens of thousands.  That is, the full dimension of the term document matrix is determined by the total product of this quantity by the number of processed documents.  A full matrix requires storing not only units but also zeros, however it is easier to use than SparseMatrixCSC
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  You can get it using the other dtm(..., :dense)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     method or by converting a sparse matrix to a full one using the collect()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     method: 
 julia> dtm(m, :dense) 3×4 Array{Int64,2}: 1 0 0 1 0 1 0 1 1 0 1 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      If you print an array of terms, then in each line it is easy to see the original composition of the documents (the original order of the terms is not taken into account).
 julia> m.terms 4-element Array{String,1}: "1" "2" "3" "Document"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        The term document matrix for frequency models can be obtained using the tf()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and tf_idf()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     methods: 
 julia> tf(m) |> collect 3×4 Array{Float64,2}: 0.5 0.0 0.0 0.5 0.0 0.5 0.0 0.5 0.333333 0.0 0.333333 0.333333
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      It is easy to see the significance of terms for each of the documents. The first two documents contain two terms. The last is three. So their weight is reduced.
  And for TF-IDF and tf_idf()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     method: 
 julia> tdm = tf_idf(m) |> collect 3×4 Array{Float64,2}: 0.202733 0.0 0.0 0.0 0.0 0.549306 0.0 0.0 0.135155 0.0 0.366204 0.0
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        But in this model it is easy to see that the term Document
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , which is found in all documents, has a value of 0. But term 3
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     in the third document has gained more weight than 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     in the same document, since 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     is also found in the first document . 
  The resulting matrices are very easy to use, for example, to solve the problem of clustering documents.  To do this, you will need the Clustering package .  We use the simplest k-means clustering algorithm, which needs to specify the number of desired clusters.  We divide our three documents into two clusters.  The input matrix for kmeans
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     is a feature matrix, where rows represent features and columns represent patterns.  Therefore, the matrices obtained above must be transposed. 
 julia> using Clustering julia> R = kmeans(tdm', 2; maxiter=200, display=:iter) Iters objv objv-change | affected ------------------------------------------------------------- 0 1.386722e-01 1 6.933608e-02 -6.933608e-02 | 0 2 6.933608e-02 0.000000e+00 | 0 K-means converged with 2 iterations (objv = 0.06933608051588186) KmeansResult{Array{Float64,2},Float64,Int64}( [0.0 0.16894379504506848; 0.5493061443340549 0.0; 0.0 0.1831020481113516; 0.0 0.0], [2, 1, 2], [0.03466804025794093, 0.0, 0.03466804025794093], [1, 2], [1, 2], 0.06933608051588186, 2, true) julia> c = counts(R) #    2-element Array{Int64,1}: 1 2 julia> a = assignments(R) #     3-element Array{Int64,1}: 2 1 2 julia> M = R.centers #     4×2 Array{Float64,2}: 0.0 0.168944 0.549306 0.0 0.0 0.183102 0.0 0.0
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        As a result, we see that the first cluster contains one document, cluster number 2 contains two documents.  Moreover, the matrix containing the centers of R.centers
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     clusters clearly shows that the first column is “attracted” by term 2
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  The second column is determined by the presence of terms 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and 3
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
  The Clustering.jl
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     package contains a typical set of clustering algorithms, among them: K-means, K-medoids, Affinity Propagation, Density-based spatial clustering of applications with noise (DBSCAN), Markov Clustering Algorithm (MCL), Fuzzy C-Means Clustering, Hierarchical Clustering (Single, Average, Complete, Ward's Linkage).  But an analysis of their applicability is beyond the scope of this article. 
  The TextAnalysis.jl
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     package is currently under active development, so some functions will be available only when installing the package directly from the git repository.  It is not difficult to do this, but it can only be advised to those who do not plan to put the solution into operation in the near future: 
 julia> ] (v1.2) pkg> add https://github.com/JuliaText/TextAnalysis.jl
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      However, you should not ignore these functions in the review. Therefore, we consider them too.
  One of the improvements is the use of the Okapi BM25 ranking function.  Similar to previous tf
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     models.  tf_idf
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , we use the bm_25(m)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     method.  Using the resulting matrix is similar to the previous cases. 
The analysis of the tonality of texts can be done using the methods:
model = SentimentAnalyzer(doc) model = SentimentAnalyzer(doc, handle_unknown)
  Moreover, doc
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     is one of the above document types.  handle_unknown
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     - function for processing unknown words.  Tonality analysis is implemented using the Flux.jl package based on the IMDB package.  The return value is in the range 0 to 1. 
  Document generalization can be implemented using the summarize(d, ns)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     method.  The first argument is the document.  The second is ns=
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     number of sentences in the end. 
 julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.") julia> summarize(s, ns=2) 2-element Array{SubString{String},1}: "Assume this Short Document as an example." "This has too foo sentences."
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        A very important component of any text analysis library is the syntactic parser that is currently under development, which distinguishes parts of speech - POS (part of speech).  There are several options for using it.  For details, see Parts of Speech Tagging. 
      
        
        
        
      
      .  Tagging
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     is called because for each word in the source text, a tag is formed that means part of the speech. 
Two options for implementation are under development. The first is Average Perceptron Algorithm. The second is based on the use of the neural network architecture LSTMs, CNN and the CRF method. Here is an example of a simple sentence markup.
 julia> pos = PoSTagger() julia> sentence = "This package is maintained by John Doe." "This package is maintained by John Doe." julia> tags = pos(sentence) 8-element Array{String,1}: "DT" "NN" "VBZ" "VBN" "IN" "NNP" "NNP" "."
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      The list of abbreviations meaning part of speech is taken from Penn Treebank . In particular, DT - Determiner, NN - Noun, singular or mass, VBZ - Verb, 3rd person singular present, Verb, past participle, IN - Preposition or subordinating conjunction, NNP - Proper noun, singular.
The results of this markup can also be used as additional features for the classification of documents.
Dimension Reduction Methods
TextAnalysis provides two options for reducing dimensionality by defining dependent terms. This is latent semantic analysis - LSA and latent Dirichlet placement - LDA.
The main task of the LSA is to obtain the decomposition of the term-document matrix (using TF-IDF) into 3 matrices, the product of which approximately corresponds to the original one.
 julia> crps = Corpus([StringDocument("this is a string document"), TokenDocument("this is a token document")]) julia> update_lexicon!(crps) julia> m = DocumentTermMatrix(crps) julia> tf_idf(m) |> collect 2×6 Array{Float64,2}: 0.0 0.0 0.0 0.138629 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.138629 julia> F2 = lsa(m) SVD{Float64,Float64,Array{Float64,2}}([1.0 0.0; 0.0 1.0], [0.138629, 0.138629], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 1.0])
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      , -, TF-IDF . SVD, , .
LDA . Example:
 julia> crps = Corpus([StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words")]) julia> update_lexicon!(crps) julia> m = DocumentTermMatrix(crps) julia> k = 2 # number of topics julia> iterations = 1000 # number of gibbs sampling iterations julia> α = 0.1 # hyper parameter julia> β = 0.1 # hyper parameter julia> ϕ, θ = lda(m, k, iterations, α, β) ( [2 , 1] = 0.333333 [2 , 2] = 0.333333 [1 , 3] = 0.222222 [1 , 4] = 0.222222 [1 , 5] = 0.111111 [1 , 6] = 0.111111 [1 , 7] = 0.111111 [2 , 8] = 0.333333 [1 , 9] = 0.111111 [1 , 10] = 0.111111, [0.5 1.0; 0.5 0.0])
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        k
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
        lda
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
       ,        .    ϕ
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      θ
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ,      ntopics × nwords
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
          ,  — ntopics × ndocs
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
          . 
           —    .        .          ,     .       NaiveBayesClassifier()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .    —    fit!()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     : 
 using TextAnalysis: NaiveBayesClassifier, fit!, predict m = NaiveBayesClassifier([:legal, :financial]) fit!(m, "this is financial doc", :financial) fit!(m, "this is legal doc", :legal)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
             predict
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     : 
 julia> predict(m, "this should be predicted as a legal document") Dict{Symbol,Float64} with 2 entries: :legal => 0.666667 :financial => 0.333333
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
       ,    ,       :legal
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
TextAnalysis.jl . , . MLJ.jl . AdaBoostClassifier, BaggingClassifier, BernoulliNBClassifier, ComplementNBClassifier, ConstantClassifier, XGBoostClassifier, DecisionTreeClassifier. - LSA, . .
TextAnalysis.jl CRF — Conditional Random Fields , Flux.jl, . .
     TextAnalysis.jl       — NER .  NERTagger()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
               : 
- PER:
- LOC:
- ORG:
- MISC:
- O:
:
 julia> sentence = "This package is maintained by John Doe." "This package is maintained by John Doe." julia> tags = ner(sentence) 8-element Array{String,1}: "O" "O" "O" "O" "O" "PER" "PER" "O"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
       NERTagger
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
           TextAnalysis.            . 
StringDistances.jl
. , , . . , StringDistances.jl . :
 using StringDistances compare("martha", "martha", Hamming()) #> 1.0 compare("martha", "marhta", Jaro()) #> 0.9444444444444445 compare("martha", "marhta", Winkler(Jaro())) #> 0.9611111111111111 compare("william", "williams", QGram(2)) #> 0.9230769230769231 compare("william", "williams", Winkler(QGram(2))) #> 0.9538461538461539
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
         compare
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      —   . , 1 —  . 0 —  . 
, Jaro-Winkler. , . RatcliffObershelp, , . , . .
  compare("mariners vs angels", "angels vs mariners", RatcliffObershelp()) #> 0.44444 compare("mariners vs angels", "angels vs mariners", TokenSort(RatcliffObershelp()) #> 1.0 compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro()) #> 0.559904 compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro())) #> 0.944444 compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp())) #> 0.855
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      , , . , TokenSort , . Julia — Julia, .
WordTokenizers.jl
WordTokenizers.jl . , , TextAnalysis.jl.
     —     . ,    tokenize(text)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
 julia> using WordTokenizers julia> text = "I cannot stand when they say \"Enough is enough.\""; julia> tokenize(text) |> print # Default tokenizer SubString{String}["I", "can", "not", "stand", "when", "they", "say", "``", "Enough", "is", "enough", ".", "''"]
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      WordTokenizers .
 julia> text = "The leatherback sea turtle is the largest, measuring six or seven feet (2 m) in length at maturity, and three to five feet (1 to 1.5 m) in width, weighing up to 2000 pounds (about 900 kg). Most other species are smaller, being two to four feet in length (0.5 to 1 m) and proportionally less wide. The Flatback turtle is found solely on the northerncoast of Australia."; julia> split_sentences(text) 3-element Array{SubString{String},1}: "The leatherback sea turtle is the largest, measuring six or seven feet (2 m) in length at maturity, and three to five feet (1 to 1.5 m) in width, weighing up to 2000 pounds (about900 kg). " "Most other species are smaller, being two to four feet in length (0.5 to 1 m) and proportionally less wide. " "The Flatback turtle is found solely on the northern coast of Australia." julia> tokenize.(split_sentences(text)) 3-element Array{Array{SubString{String},1},1}: SubString{String}["The", "leatherback", "sea", "turtle", "is", "the", "largest", ",", "measuring", "six" … "up", "to", "2000", "pounds", "(", "about", "900", "kg", ")", "."] SubString{String}["Most", "other", "species", "are", "smaller", ",", "being", "two", "to", "four" … "0.5", "to", "1", "m", ")", "and", "proportionally", "less", "wide", "."] SubString{String}["The", "Flatback", "turtle", "is", "found", "solely", "on", "the", "northern", "coast", "of", "Australia", "."]
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      :
-  Poorman's tokenizer —        .     ,   split
 
 
 
 .
- Punctuation space tokenize — . , .
- Penn Tokenizer — , Penn Treebank.
- Improved Penn Tokenizer — , NLTK.
- NLTK Word tokenizer — , NLTK, , UNICODE- .
- Reversible Tokenizer — , .
- TokTok Tokenizer — , .
- Tweet Tokenizer — , , , HTML- .
       set_tokenizer(nltk_word_tokenize)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
Embeddings.jl
  Embeddings.jl        .        ,      ,   ,   ,       ,      ,    .          Word2Vec.     ,   : king - man + woman = queen
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .    ,    ,      . ,   ,   ,   Wikipedia,         .  ,            .        «semantic space»,    ,   «semantic distance».         ,       ,    ,    «»  «»      .    ,             ,             ,        . 
, «embedding» , , , . , , , , , , . , -, . , . , . .
Embeddings.jl : Word2Vec, GloVe (English only), FastText. . , , . — , . , , word2vec, 8-16 . , .
, , DataDeps.jl . , (" "). , Embedding.jl , , . , .
 ENV["DATADEPS_ALWAYS_ACCEPT"] = true
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
         —        .      ~/.julia/datadeps
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
        . 
. — :
 using Embeddings const embtable = load_embeddings(Word2Vec) # or load_embeddings(FastText_Text) or ... const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab)) function get_embedding(word) ind = get_word_index[word] emb = embtable.embeddings[:,ind] return emb end
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      — :
 julia> get_embedding("blue") 300-element Array{Float32,1}: 0.01540828 0.03409082 0.0882124 0.04680265 -0.03409082 ...
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      WordTokenizers TextAnalysis, . , Julia:
 julia> a = rand(5) 5-element Array{Float64,1}: 0.012300397820243392 0.13543646950484067 0.9780602985106086 0.24647179461578816 0.18672770774122105 julia> b = ones(5) 5-element Array{Float64,1}: 1.0 1.0 1.0 1.0 1.0 julia> a+b 5-element Array{Float64,1}: 1.0123003978202434 1.1354364695048407 1.9780602985106086 1.2464717946157882 1.186727707741221
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Clustering.jl. , — . MLJ.jl. , https://github.com/JuliaStats/Distances.jl , :
- Euclidean distance
- Squared Euclidean distance
- Periodic Euclidean distance
- Cityblock distance
- Total variation distance
- Jaccard distance
- Rogers-Tanimoto distance
- Chebyshev distance
- Minkowski distance
- Hamming distance
- Cosine distance
- Correlation distance
- Chi-square distance
- Kullback-Leibler divergence
- Generalized Kullback-Leibler divergence
- Rényi divergence
- Jensen-Shannon divergence
- Mahalanobis distance
- Squared Mahalanobis distance
- Bhattacharyya distance
- Hellinger distance
- Haversine distance
- Mean absolute deviation
- Mean squared deviation
- Root mean squared deviation
- Normalized root mean squared deviation
- Bray-Curtis dissimilarity
- Bregman divergence
.
Transformers.jl
Transformers.jl — Julia «Transformers», BERT Google. , NER — , .
Transformers.jl Flux.jl , , , Julia- , . Flux.jl CPU GPU, , , , .
BERT , . :
 using Transformers using Transformers.Basic using Transformers.Pretrain using Transformers.Datasets using Transformers.BidirectionalEncoder using Flux using Flux: onehotbatch, gradient import Flux.Optimise: update! using WordTokenizers ENV["DATADEPS_ALWAYS_ACCEPT"] = true const FromScratch = false #use wordpiece and tokenizer from pretrain const wordpiece = pretrain"bert-uncased_L-12_H-768_A-12:wordpiece" const tokenizer = pretrain"bert-uncased_L-12_H-768_A-12:tokenizer" const vocab = Vocabulary(wordpiece) const bert_model = gpu( FromScratch ? create_bert() : pretrain"bert-uncased_L-12_H-768_A-12:bert_model" ) Flux.testmode!(bert_model) function vectorize(str::String) tokens = str |> tokenizer |> wordpiece text = ["[CLS]"; tokens; "[SEP]"] token_indices = vocab(text) segment_indices = [fill(1, length(tokens) + 2);] sample = (tok = token_indices, segment = segment_indices) bert_embedding = sample |> bert_model.embed collect(sum(bert_embedding, dims=2)[:]) end
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
          vectorize
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
         .       : 
 using Distances x1 = vectorize("Some test about computers") x2 = vectorize("Some test about printers") cosine_dist(x1, x2)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
         , wordpiece
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , tokenizer
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     —    .   12 —  . 768 —   .    . https://chengchingwen.github.io/Transformers.jl/dev/pretrain/ .   ,  Transformers.Pretrain.@pretrain_str,     pretrain"model-description:item"
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
            . 
, , Transformers.jl , .
Conclusion
, , Julia . . , , . , Julia , . , Julia.
, , , - «», . , , «open source» , , , . , , Julia . , Jupyter Notebook , , — Atom/Juno, VS Code, . , , Julia — 2-3 , ( , , ), C++ .
      ,      ,    Julia,  -,   .   ,         ,  ,  ,        .         ,        2-3   ,         ,     . Julia       .          -    ,        for
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
           .     —  «   ».   ,   C,    .  Julia,   ,   -  ,     ,    —  Julia-.  , Julia —             ,             . 
, , Julia . , , .
, - Julia — @JuliaLanguage, .
References
- TextAnalysis.jl —
- Languages.jl —
- WordTokenizers.jl —
- StringDistances.jl —
- Transformers.jl — Transformers BERT.
- Distances.jl — .
- Clustering.jl — .
- MLJ.jl — , , .
- Flux.jl — .