Standard Intelligence has announced the publication of hertz-dev, the first open source AI model for full-duplex speech synthesis that can be used as the basis for real-time voice communication or conversational speech generation. The model generates speech that is close to the voice data it was trained on, providing a human-like experience without the lag that comes with a choppy phone call. The project's work is licensed under the Apache 2.0 license.
On a system with an NVIDIA GeForce RTX 4090 GPU, the average latency before generation is 120 ms (theoretically up to 65 ms), which is about twice as fast as existing publicly available models. The published version is built using a transformer architecture, covers 8.5 billion parameters, and is trained using 500 billion tokens. The size of the context taken into account by the model (the number of tokens that the model can process and remember when generating speech) is 2048 tokens, or about 4 minutes of speech.
Source: opennet.ru
