Why is Kaldi good for speech recognition?





Why am I (and I hope you) interested in speech recognition? Firstly, this direction is one of the most popular in comparison with other tasks of computer linguistics, since speech recognition technology is now used almost everywhere - from recognizing a simple yes / no in the bank’s automatic call center to the ability to support “small talk” in “Smart column” like “Alice”. Secondly, in order for the speech recognition system to be of high quality, it is necessary to find the most effective tools for creating and configuring such a system (this article is devoted to one of such tools). Finally, the undoubted “plus” of choosing a specialization in the field of speech recognition for me personally is that for research in this area it is necessary to have both programmer and linguistic skills. This is very stimulating, forcing to acquire knowledge in different disciplines.



Why Kaldi, after all, are there other frameworks for speech recognition?



To answer this question, it is worth considering the existing analogues and the algorithms and technologies used by them (the algorithms used in Kaldi are described further in the article):





In the article “Comparative analysis of open source speech recognition systems” ( https://research-journal.org/technical/sravnitelnyj-analiz-sistem-raspoznavaniya-rechi-s-otkrytym-kodom/ ), a study was conducted during which all the systems were trained in an English language case (160 hours) and applied in a small 10-hour test case. As a result, it turned out that Kaldi has the highest recognition accuracy, slightly outperforming its competitors in terms of speed. Also, the Kaldi system is able to provide the user with the richest selection of algorithms for different tasks and is very convenient to use. At the same time, emphasis is placed on the fact that work with documentation may be inconvenient for an inexperienced user, as It is designed for speech recognition professionals. But in general, Kaldi is more suitable for scientific research than its analogues.



How to install Kaldi



  1. Download the archive from the repository at https://github.com/kaldi-asr/kaldi :

  2. Unpack the archive, go to kaldi-master / tools / extras.
  3. We execute ./check_dependencies.sh:



    If after that you see not “all ok”, then open the file kaldi-master / tools / INSTALL and follow the instructions there.
  4. We execute make (being in kaldi-master / tools, not in kaldi-master / tools / extras):

  5. Go to kaldi-master / src.
  6. We run ./configure --shared, and you can configure the installation with or without CUDA technology by specifying the path to the installed CUDA (./configure --cudatk-dir = / usr / local / cuda-8.0) or change the initial value “yes "To" no "(./ configure —use-cuda = no) respectively.



    If at the same time you see:







    either you did not follow step 4, or you need to download and install OpenFst yourself: http://www.openfst.org/twiki/bin/view/FST/FstDownload .
  7. We make make depend.
  8. We execute make -j. It is recommended that you enter the correct number of processor cores that you will use when building, for example, make -j 2.
  9. As a result, we get:



An example of using a model with Kaldi installed



As an example, I used the kaldi-ru model version 0.6, you can download it from this link :



  1. After downloading, go to the file kaldi-ru-0.6 / decode.sh and specify the path to the installed Kaldi, it looks like this for me:





  2. We launch the model, indicating the file in which the speech is to be recognized. You can use the file decoder-test.wav, this is a special file for the test, it is already in this folder:





  3. And here is what the model recognized:



What algorithms are used, what underlies the work?



Full information about the project can be found at http://kaldi-asr.org/doc/ , here I will highlight a few key points:





About creating the kaldi-ru-0.6 model



For the Russian language, there is a pre-trained recognition model created by Nikolai Shmyryov, also known on many sites and forums as nsh .







Comparison with Google Speech API and Yandex Speech Kit



Surely, one of the readers, when reading the previous paragraphs, had a question: okay, that Kaldi is superior to its direct counterparts, we figured out, but what about recognition systems from Google and Yandex? Maybe the relevance of the frameworks described earlier is doubtful if there are tools from these two giants? The question is really good, so let's test, and as a dataset we will take notes and the corresponding text transcripts from the notorious VoxForge .



As a result, after each system recognized 3677 sound files, I received the following WER (Word Error Rate) values:







Summing up, we can say that all systems coped with the task at approximately the same level, and Kaldi was not much inferior to the Yandex Speech Kit and the Google Speech API. But at the same time, the Yandex Speech Kit and the Google Speech API are “black boxes” that work somewhere far, far away on other people's servers and are not accessible for tuning, but Kaldi can be adapted to the specifics of the task at hand - characteristic vocabulary (professionalism, jargon, colloquial slang), pronunciation features, etc. And all this for free and without SMS! The system is a kind of designer, which we can all use to create something unusual and interesting.



I work in the laboratory of LAPDiMO NSU:

Website: https://bigdata.nsu.ru/

VK Group: https://vk.com/lapdimo



All Articles