Whisper Speech Recognition and Translation System Code Opened

The OpenAI project, which develops public projects in the field of artificial intelligence, has published developments related to the Whisper speech recognition system. It is claimed that for speech in English, the system provides levels of reliability and accuracy of automatic recognition close to human recognition. The code of the reference implementation based on the PyTorch framework and a set of already trained models ready for use are open. The code is open source under the MIT license.

The model was trained using 680 hours of speech data collected from several collections covering different languages ​​and subject areas. About 1/3 of the speech data involved in training is in languages ​​other than English. The proposed system correctly handles situations such as pronunciation with an accent, the presence of background noise, and the use of technical jargon. In addition to transcribing speech into text, the system can also translate speech from an arbitrary language into English and detect the appearance of speech in the audio stream.

The models are formed in two views: a model for the English language and a multilingual model that supports, among other things, Russian, Ukrainian, and Belarusian. In turn, each view is divided into 5 options, differing in size and number of parameters covered in the model. The larger the size, the greater the accuracy and quality of recognition, but also the higher the requirements for the GPU video memory size and the lower the performance. For example, the minimum option includes 39 million parameters and requires 1 GB of video memory, while the maximum includes 1550 million parameters and requires 10 GB of video memory. The minimum variant is 32 times faster than the maximum.

Whisper Speech Recognition and Translation System Code Opened

The system uses the Transformer neural network architecture, which includes an encoder and a decoder interacting with each other. The audio is broken into 30-second fragments, which are converted into a log-Mel spectrogram and sent to the encoder. The result of the encoder's work is sent to the decoder, which predicts a text representation mixed with special tokens that allow solving such problems as language detection, accounting for the chronology of phrase pronunciation, transcription of speech in different languages ​​and translation into English in one general model.

Source: opennet.ru

Add a comment