Stable Diffusion machine learning system adapted for music synthesis

The Riffusion project develops a variant of the Stable Diffusion machine learning system adapted to generate music instead of images. Music can be synthesized by a textual description in natural language or based on a suggested template. The music synthesis components are written in Python using the PyTorch framework and are available under the MIT license. The binding with the interface is implemented in the TypeScript language and is also distributed under the MIT license. The trained models are released under the Creative ML OpenRAIL-M permissive license for commercial use.

The project is interesting in that it continues to use the β€œtext-to-image” and β€œimage-to-image” models for music generation, but manipulates spectrograms as images. In other words, the classic Stable Diffusion is trained not on photographs and pictures, but on images of spectrograms that reflect the change in the frequency and amplitude of the sound wave over time. Accordingly, a spectrogram is also formed at the output, which is then converted into an audio representation.

Stable Diffusion machine learning system adapted for music synthesis

The method can also be used to modify existing sound compositions and sample music synthesis, similar to image modification in Stable Diffusion. For example, generation can set sample spectrograms with a reference style, combine different styles, perform a smooth transition from one style to another, or make changes to an existing sound to solve problems such as increasing the volume of individual instruments, changing the rhythm and replacing instruments. Patterns are also used to generate long-playing compositions, composed of a series of passages that are close to each other, varying slightly over time. Separately generated fragments are combined into a continuous stream by interpolating the internal parameters of the model.

Stable Diffusion machine learning system adapted for music synthesis

To create a spectrogram from sound, a windowed Fourier transform is used. When recreating sound from a spectrogram, there is a problem with determining the phase (only frequency and amplitude are present on the spectrogram), for the reconstruction of which the Griffin-Lim approximation algorithm is used.



Source: opennet.ru

Add a comment