Implementation of a machine learning system for synthesizing images from a text description

An open implementation of the DALL-E 2 machine learning system proposed by OpenAI has been published. It allows synthesizing realistic images and pictures based on a natural language text description, as well as applying natural language commands for image editing (for example, adding, deleting or moving objects in an image). ). The original DALL-E 2 models from OpenAI are not published, but an article is available detailing the method. Based on the existing description, independent researchers have prepared an alternative implementation written in Python, using the Pytorch framework and distributed under the MIT license.

Implementation of a machine learning system for synthesizing images from a text descriptionImplementation of a machine learning system for synthesizing images from a text description

Compared to the previously published implementation of the first generation of DALL-E, the new version provides a more accurate match of the image to the description, allows for greater photorealism and makes it possible to generate images at higher resolutions. The system requires large resources to train the model, for example, training the original version of DALL-E 2 requires 100-200 thousand hours of GPU computing, i.e. about 2-4 weeks of computing with 256 NVIDIA Tesla V100 GPUs.

Implementation of a machine learning system for synthesizing images from a text description

The same author also started developing an extended version - DALLE2 Video, aimed at synthesizing video from a text description. Separately, we can note the ru-dalle project developed by Sberbank, with an open implementation of the first generation of DALL-E, adapted for recognition of descriptions in Russian.

Source: opennet.ru

Add a comment