New release of the Silero speech synthesis system

A new public release of the Silero Text-to-Speech neural network speech synthesis system is available. The project is primarily aimed at creating a modern high-quality speech synthesis system that is not inferior to commercial solutions from corporations and is available to everyone without the use of expensive server equipment.

Models are distributed under the GNU AGPL license, but the company developing the project does not disclose the mechanism for training models. To launch, you can use PyTorch and frameworks that support the ONNX format. Speech synthesis in Silero is based on the use of deeply modified modern neural network algorithms and digital signal processing methods.

It is noted that the main problem of modern neural network solutions for speech synthesis is that they are often available only as part of paid cloud solutions, and public products have high hardware requirements, lower quality or are not finished and ready-to-use products. For example, to seamlessly run one of the new popular end-to-end synthesis architectures, VITS, in synthesis mode (that is, not for model training), video cards with more than 16 gigabytes of VRAM are required.

Contrary to the current trend, Silero solutions run successfully even on 1 x86 thread of an Intel processor with AVX2 instructions. On 4 processor threads, synthesis allows you to synthesize from 30 to 60 seconds per second in 8 kHz synthesis mode, in 24 kHz mode - 15-20 seconds, and in 48 kHz mode - about 10 seconds.

Key features of the new Silero edition:

  • The size of the model is reduced by 2 times to 50 megabytes;
  • Models know how to pause;
  • 4 high-quality voices in Russian are available (and an infinite number of random ones). Pronunciation examples;
  • Models have become 10 times faster and, for example, in 24 kHz mode, they can synthesize up to 20 seconds of audio per second on 4 processor threads;
  • All voice options for one language are packed into one model;
  • Models can accept entire paragraphs of text as input, SSML tags are supported;
  • Synthesis works immediately in three sampling rates to choose from - 8, 24 and 48 kilohertz;
  • Solved "children's problems": instability and omission of words;
  • Added flags to control automatic placement of accents and placement of the letter "ё".

Now for the newest version of the synthesis, 4 voices in Russian are publicly available, but the next version will be published in the near future with the following changes:

  • The rate of synthesis will increase by another 2-4 times;
  • Synthesis models for the CIS languages ​​will be updated: Kalmyk, Tatar, Uzbek and Ukrainian;
  • Models for European languages ​​will be added;
  • Models for Indian languages ​​will be added;
  • Models for English will be added.

Some of the system breakdowns inherent in the Silero synthesis are:

  • Unlike more traditional synthesis solutions such as RHVoice, Silero synthesis does not have SAPI integration, easy-to-install clients, and Windows and Android integrations;
  • The speed, although unprecedentedly high for such a solution, may not be sufficient for on-the-fly synthesis on weak processors in high quality;
  • The automatic stress solution does not handle homographs (words like castle and castle) and still makes errors, but this flaw will be fixed in future releases;
  • The current version of the synthesis does not work on processors without AVX2 instructions (or you need to specifically change the PyTorch settings), because one of the modules inside the model is quantized;
  • The current version of the synthesis essentially has the only PyTorch dependency, all the stuffing is "hardwired" inside the model and JIT packages. Model sources are not published, as well as the code for running models from under PyTorch clients for other languages;
  • The libtorch available for mobile platforms is much more cumbersome than the ONNX runtime, but the ONNX version of the model is not yet provided.

Source: opennet.ru

Add a comment