
Canonical presented the project Myna - a new speech-to-text conversion system for Ubuntu Desktop. The project aims to provide integrated dictation: the user presses a hotkey, speaks, and the recognized text appears in the active application. The announcement emphasizes that Myna should feel like a natural part of the desktop. Ubuntu and at the same time work taking into account the user's privacy. The list of supported input languages was not announced at the time of publication.
The first goal of the project is Ubuntu 26.10At this stage, Canonical isn't attempting to develop a full-fledged voice assistant or voice-based desktop management system. The developers have deliberately limited the scope of the first version to basic, reliable dictation: pressing a key combination, speaking text, and receiving the result in the current input field. The primary environment being tested is Ubuntu Desktop on Wayland with GNOME, but the architecture is planned to be left open enough to support other environments in the future.
Myna is designed for local speech recognition. Once the necessary models are installed, dictation requires no internet connection. The microphone should only be used after explicit user activation. Audio is processed in memory and then discarded, and recordings are not sent to external services. The design specification also specifies that the solution should avoid storing audio by default and should not seamlessly switch to a cloud service.
The Myna code and documentation are published in the Canonical repository at GitHubThe project is described as a lightweight speech-to-text application for Ubuntu Desktop and is distributed under the GPL-3.0 license. However, the project is in its early stages: there are no published releases in the repository yet, and the architectural specification is listed as Proposed.
Key features and functions of Myna
Push-to-talk dictation. The user holds down a configurable hotkey, speaks, and the system inserts the recognized text into the selected input field. Dictation ends when the key is released.
Local speech recognition. Recognition is performed on the user's machine via a local inference stack. This reduces cloud dependency and allows for offline operation after model installation.
Private audio processing. The microphone is activated only during a user's dictation session. Audio is not written to disk by default; a limited memory buffer is used, which is cleared after the session ends.
Visual activity indicator. During recording and transcription, the user should see a clear status indicator. The specification mentions states such as Recording, Transcribing, Finalizing, and Error.
Insert only stable text. In the first implementation, intermediate recognition hypotheses should not be directly inserted into the application. Only the confirmed final text is sent to the target field.
Post-processing of text. The raw transcript may undergo normalization, punctuation, capitalization, formatting, and conversion of spoken forms to written forms, such as “twenty two” → “22”.
Selecting the dictation language. The system must support a customizable dictation language, defaulting to the user interface language if a suitable model is available for it.
Model quality profiles. The specification includes different model profiles: a lightweight version with lower resource consumption, a balanced default profile, and a higher-quality but heavier version.
Safe work with input focus. The target for text insertion is selected at the beginning of the session. If the window focus changes during dictation, the system should not silently send the text to another application.
Blocking in protected fields. Dictation should be blocked in password fields, authentication windows, and other secure areas if the application or toolkit allows this to be determined.
Integration with Wayland/GNOME. The first version is targeted at Wayland and GNOME. IBus is being considered for initial text insertion, and a more native Wayland approach via the input-method/text-input protocols is planned for the future.
User settings. The planned settings interface should include enabling/disabling STT, selecting a hotkey, dictation language, microphone, model profile, post-processing parameters, and an activity indicator.
The first iteration of the project leaves out wake-up on a key phrase, persistent background listening, cloud recognition, voice assistant, voice commands, desktop management, speech translation, speaker detection, automatic language detection, and dictation history. In other words, Canonical is starting not with an "AI assistant," but with a more down-to-earth feature: local voice input for text in regular apps. Ubuntu.
Source: linux.org.ru
