Speech Synthesis Technology Overview

Hello! My name is Vlad and I work as a data scientist in the Tinkoff team of speech technologies that are used in our voice assistant Oleg.







In this article, I would like to give a short overview of speech synthesis technologies used in the industry and share the experience of our team in building our own synthesis engine.







image







Speech synthesis



Speech synthesis is the creation of sound based on text. This problem today is solved by two approaches:









The speech of unit selection models is of high quality, low variability and requires a large amount of data for training. At the same time, for training parametric models, a much smaller amount of data is needed, they generate more diverse intonations, but until recently they suffered from an overall rather poor sound quality compared to the unit selection approach.







However, with the development of deep learning technologies, parametric synthesis models have achieved significant growth in all quality metrics and are able to create speech that is practically indistinguishable from human speech.







Quality metrics



Before talking about which speech synthesis models are better, you need to determine the quality metrics by which the algorithms will be compared.







Since the same text can be read in an infinite number of ways, a priori the right way to pronounce a specific phrase does not exist. Therefore, often the metrics for the quality of speech synthesis are subjective and depend on the perception of the listener.







The standard metric is the MOS (mean opinion score), an average assessment of the naturalness of speech, given by assessors for synthesized audio on a scale of 1 to 5. One means completely implausible sound, and five means speech that is indistinguishable from human. Real people records usually get a value of about 4.5, and a value greater than 4 is considered quite high.







How speech synthesis works



The first step to building any speech synthesis system is collecting data for training. Usually these are high-quality audio recordings on which the announcer reads specially selected phrases. The approximate size of the dataset required for training unit selection models is 10โ€“20 hours of pure speech [2], while for neural network parametric methods, the upper estimate is approximately 25 hours [4, 5].







We discuss both synthesis technologies.







Unit selection



image







Typically, the recorded speech of the speaker cannot cover all the possible cases in which the synthesis will be used. Therefore, the essence of the method is to split the entire audio base into small fragments called units, which are then glued together using minimal post-processing. Units are usually the minimum acoustic language units, such as half phones or diphons [2].

The entire generation process consists of two stages: the NLP frontend, which is responsible for extracting the linguistic representation of the text, and the backend, which calculates the unit penalty function for the given linguistic features. The NLP frontend includes:







  1. The task of normalizing the text is the translation of all non-letter characters (numbers, percent signs, currencies, and so on) into their verbal representation. For example, โ€œ5%โ€ should be converted to โ€œfive percentโ€.
  2. Extracting linguistic features from a normalized text: phoneme representation, stress, parts of speech and so on.


Usually, the NLP frontend is implemented using manually prescribed rules for a specific language, but recently there has been an increasing bias towards the use of machine learning models [7].







The penalty estimated by the backend subsystem is the sum of the target cost, or the correspondence of the acoustic representation of the unit for a particular phoneme, and the concatenation cost, that is, the appropriateness of connecting two neighboring units. To evaluate the fine functions, one can use the rules or the already trained acoustic model of parametric synthesis [2]. The selection of the most optimal sequence of units from the point of view of the above-defined penalties occurs using the Viterbi algorithm [1].







Approximate values โ€‹โ€‹of MOS unit selection models for the English language: 3.7-4.1 [2, 4, 5].







Advantages of the unit selection approach:









Disadvantages:









Parametric speech synthesis



The parametric approach is based on the idea of โ€‹โ€‹constructing a probabilistic model that estimates the distribution of acoustic features of a given text.

The process of speech generation in parametric synthesis can be divided into four stages:







  1. NLP frontend is the same stage of data preprocessing as in the unit selection approach, the result of which is a large number of context-sensitive linguistic features.
  2. Duration model predicting phoneme duration.
  3. An acoustic model that restores the distribution of acoustic features over linguistic ones. Acoustic features include fundamental frequency values, spectral representation of the signal, and so on.
  4. A vocoder translating acoustic features into a sound wave.


For training duration and acoustic models, hidden Markov models [3], deep neural networks, or their recurrent varieties [6] can be used. A traditional vocoder is an algorithm based on the source-filter model [3], which assumes that speech is the result of applying a linear noise filter to the original signal.

The overall speech quality of classical parametric methods is quite low due to the large number of independent assumptions about the structure of the sound generation process.







However, with the advent of deep learning technologies, it has become possible to train end-to-end models that directly predict acoustic signs by letter. For example, the neural networks Tacotron [4] and Tacotron 2 [5] input a sequence of letters and return the chalk spectrogram using the seq2seq algorithm [8]. Thus, steps 1-3 of the classical approach are replaced by a single neural network. The diagram below shows the architecture of the Tacotron 2 network, which achieves a fairly high sound quality.







image







Another factor of significant growth in the quality of synthesized speech was the use of neural network vocoders instead of digital signal processing algorithms.







The first such vocoder was the WaveNet neural network [9], which sequentially, step by step, predicted the amplitude of the sound wave.







Due to the use of a large number of convolutional layers with gaps to capture more context and skip connection in the network architecture, it was possible to achieve about 10% improvement in MOS compared to unit selection models. The diagram below shows the architecture of the WaveNet network.







image







The main disadvantage of WaveNet is the low speed associated with a sequential sampling circuit. This problem can be solved either by using engineering optimization for a specific iron architecture, or by replacing the sampling scheme with a faster one.

Both approaches have been successfully implemented in the industry. The first is at Tinkoff.ru, and as part of the second approach, Google introduced the Parallel WaveNet network [10] in 2017, the achievements of which are used in the Google Assistant.







Approximate values โ€‹โ€‹of MOS for neural network methods: 4.4โ€“4.5 [5, 11], that is, synthesized speech is practically no different from human speech.







Advantages of parametric synthesis:









Disadvantages:









How Tinkoff Speech Synthesis Works



As follows from the review, methods of parametric speech synthesis based on neural networks are currently significantly superior in quality to the unit selection approach and are much simpler to develop. Therefore, to build our own synthesis engine, we used them.

For training models, about 25 hours of pure speech of a professional speaker was used. The texts for reading were specially selected so as to most fully cover the phonetics of colloquial speech. In addition, in order to add more variety to the synthesis in intonation, we asked the announcer to read texts with an expression depending on the context.







The architecture of our solution conceptually looks like this:









Thanks to this architecture, our engine generates high-quality expressive speech in real time, does not require building a phoneme dictionary, and makes it possible to control stresses in individual words. Examples of synthesized audio can be heard by clicking on the link .







References:



[1] AJ Hunt, AW Black. Unit selection in a concatenative speech synthesis system using a large speech database, ICASSP, 1996.

[2] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N. Huddleston, M. Hunt, J. Li, M. Neeracher, K. Prahallad, T. Raitio , R. Rasipuram, G. Townsend, B. Williamson, D. Winarsky, Z. Wu, H. Zhang. Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System, Interspeech, 2017.

[3] H. Zen, K. Tokuda, AW Black. Statistical parametric speech synthesis, Speech Communication, Vol. 51, no. 11, pp. 1039-1064, 2009.

[4] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous . Tacotron: Towards End-to-End Speech Synthesis.

[5] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.

[6] Heiga Zen, Andrew Senior, Mike Schuster. Statistical parametric speech synthesis using deep neural networks.

[7] Hao Zhang, Richard Sproat, Axel H. Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, Brian Roark. Neural Models of Text Normalization for Speech Applications.

[8] Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Sequence to Sequence Learning with Neural Networks.

[9] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio.

[10] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman , Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, Demis Hassabis. Parallel WaveNet: Fast High-Fidelity Speech Synthesis.

[11] Wei Ping Kainan Peng Jitong Chen. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech.

[12] Dario Rethage, Jordi Pons, Xavier Serra. A Wavenet for Speech Denoising.








All Articles