Huge open dataset of Russian speech version 1.0

image







At the beginning of this year, for a number of reasons, we got the idea to create the largest open dataset in Russian speech. More about our motivation and how it all began

can be read in this article - A huge open dataset of Russian speech . Since then, our project has gone through a series of large-scale changes, we have tripled the amount of data, improved their quality, added labels for speakers and now we are finally ready to present you version 1.0.







We are also not ready to rest on our laurels and plan to continue to do intensive work on errors in future versions and improve the quality of already published data. We plan to devote version 1.1 to large-scale work on bugs.







Briefly about Open STT v1.0





Domain annotation Phrases Clock GB
Radio Alignment 8.3M 11,996 1367
Public speaking Alignment 1.7M 2,709 301
Youtube Subtitles 2.6M 2,117 346
Books Alignment / ASR 1.3M 1,632 180
Calls ASR 695K 819 91
Other datasets TTS, recitation 1.9M 835 95


More detailed statistics can be found in the project repository .









We have made every effort to improve the quality of the markup:









For what tasks can our dataset come in handy?





How do you plan to develop the dataset in the future?





You can learn more about new domains in the repository.








All Articles