At the beginning of this year, for a number of reasons, we got the idea to create the largest open dataset in Russian speech. More about our motivation and how it all began

can be read in this article - A huge open dataset of Russian speech . Since then, our project has gone through a series of large-scale changes, we have tripled the amount of data, improved their quality, added labels for speakers and now we are finally ready to present you version 1.0.

We are also not ready to rest on our laurels and plan to continue to do intensive work on errors in future versions and improve the quality of already published data. We plan to devote version 1.1 to large-scale work on bugs.

Briefly about Open STT v1.0

More than 20,000 hours (initially we set the bar at 10,000 hours) of audio of Russian speech, 2.3 Tb of data (in wav

format, in .mp3

format of course less);
A wide variety of domains: starting from audio recorded on a professional microphone, ending with phone calls:

Domain	annotation	Phrases	Clock	GB
Radio	Alignment	8.3M	11,996	1367
Public speaking	Alignment	1.7M	2,709	301
Youtube	Subtitles	2.6M	2,117	346
Books	Alignment / ASR	1.3M	1,632	180
Calls	ASR	695K	819	91
Other datasets	TTS, recitation	1.9M	835	95

More detailed statistics can be found in the project repository .

Now the data can be downloaded at high speed both in .wav

(mono, 16KHz, int16) format via torrent, or via a direct link in .mp3

;
Added a small manually labeled validation dataset (18 hours) for 3 main domains;

We have made every effort to improve the quality of the markup:

Improved model for aliasing new domains;
Used better and finer-tuned STT-models for alimentation;
Improved the algorithm for normalizing numbers and Latin letters;
Gradually re-partition / remove the "dirty" data from previous versions;
Cured a number of children's problems dataset such as:
- Dangling single letters at the beginning and end of sentences;
- Low yield of alignment due to low quality models;
- "Correct" work with punctuation marks during an alignment;
(Soon!) Real labels for speakers will appear;

For what tasks can our dataset come in handy?

Speech recognition;
Speech synthesis;
Denoising, eliminating noise in audio;
Voice identification;
Separation of speakers;

How do you plan to develop the dataset in the future?

Improve / reload existing datasets, clean markup;
Publish models for speech recognition and postprocessing;
Add markup with speaker id. For some of the new domains, there is a ready-made layout, but there is also the idea of adding speakers to the old datasets;
It is possible to switch to other languages;
It is possible to add several new domains;

You can learn more about new domains in the repository.

Huge open dataset of Russian speech version 1.0

Briefly about Open STT v1.0

For what tasks can our dataset come in handy?

How do you plan to develop the dataset in the future?

More articles: