Microsoft's latest technology in Azure AI describes images as well as people


Microsoft researchers have created an artificial intelligence system capable of generating image captions that, in many cases, turn out to be more accurate than descriptions made by humans. This breakthrough marked a major milestone in Microsoft's commitment to making its products and services inclusive and accessible to all users.

“Image description is one of the main functions of computer vision, which makes a wide range of services possible,” said Xuedong Huang (Xuedong Huang), a Microsoft Technical Officer and CTO of Azure AI Cognitive Services in Redmond, Washington.

The new model is now available to consumers through Computer Vision at Azure Cognitive Services, which is part of Azure AI, and allows developers to use this feature to improve the availability of their services. It is also being included in the Seeing AI app and will be available later this year in Microsoft Word and Outlook for Windows and Mac, as well as PowerPoint for Windows, Mac and on the web.

Auto Description helps users access the important content of any image, whether it's a photo returned in search results or illustration for a presentation.

“The use of captions that describe the content of images (so-called alternative or alternative text) on web pages and documents is especially important for blind or visually impaired people,” said Saqib Sheikh (Saqib Shaikh), Software Manager at Microsoft's AI Platform Group in Redmond.

For example, his team is using an improved image description feature in the app for blind and visually impaired people. Seeing AI, which recognizes what the camera is capturing and tells about it. The app uses generated captions to describe photos, including on social media.

“Ideally, everyone should add alt text to all images in documents, on the web, on social networks, as this allows blind people to access the content and take part in the conversation. But, alas, people do not do this,” says the Sheikh. "However, there are a few apps that use the image description feature to add alternative text when it's missing."
  
Microsoft's latest technology in Azure AI describes images as well as people

Liruan Wang, general manager of research at Microsoft's Redmond Lab, led a research team that achieved and surpassed human results. Photo: Dan DeLong.

Description of new objects

“Describing images is one of the main tasks of computer vision, which requires an artificial intelligence system to understand and describe the main content or action presented in the image,” explained Liruan Wang (Lijuan Wang), general manager of research at Microsoft's Redmond lab.

“You need to understand what is going on, figure out what the relationships are between objects and actions, and then summarize and describe it all in a sentence in human-readable language,” she said.

Wang led the research team, which in benchmarking nocaps (novel object captioning at scale, a large-scale description of new objects) achieved a result comparable to a human one, and surpassed it. This testing allows you to evaluate how well AI systems generate descriptions of depicted objects that are not included in the data set on which the model was trained.

Typically, image description systems are trained on data sets that contain images accompanied by a textual description of these images, that is, on sets of signed images.

“The nocaps test shows how well the system is able to describe new objects not found in the training data,” says Wang.

To solve this problem, the Microsoft team pre-trained a large AI model on a large dataset containing word-tagged images, each associated with a specific object in the image.

Image sets with word tags instead of full captions are more efficient to create, allowing Wang's team to feed a lot of data into their model. This approach gave the model what the team calls a visual vocabulary.

As Huang explained, the pre-learning approach using visual vocabulary is similar to preparing children for reading: first, a picture book is used in which individual words are associated with images, for example, under a photo of an apple is written "apple" and under a photo of a cat is the word " cat".

“This pre-training with visual vocabulary is, in essence, the initial education needed to train the system. This is how we try to develop a kind of motor memory,” Huang said.

The pre-trained model is then refined with a dataset including labeled images. At this stage of training, the model learns to make sentences. If an image containing new objects appears, the AI ​​system uses the visual dictionary to create accurate descriptions.

“To work with new objects during testing, the system integrates what it learned during pre-training and during subsequent refinement,” says Wang.
According to the results research, when evaluated on the nocaps tests, the AI ​​system produced more meaningful and accurate descriptions than humans did for the same images.

Faster transition to the working environment 

Among other things, the new image description system is twice as good as the model used in Microsoft products and services since 2015, when compared to another industry benchmark.

Considering the benefits that all users of Microsoft products and services will receive from this improvement, Huang accelerated the integration of the new model into the Azure work environment.

“We are taking this disruptive AI technology to Azure as a platform to serve a wider range of customers,” he said. “And this is not just a breakthrough in research. The time it took to incorporate this breakthrough into the Azure production environment was also a breakthrough.”

Huang added that achieving human-like results continues a trend already established in Microsoft's cognitive intelligence systems.

“Over the past five years, we have achieved human-like results in five major areas: in speech recognition, in machine translation, in answering questions, in machine reading and text understanding, and in 2020, despite COVID-19, in image description ' Juan said.

By topic

Compare the results of the description of images that the system gave before and now using AI

Microsoft's latest technology in Azure AI describes images as well as people

Photo courtesy of Getty Images. Previous description: Close-up of a man preparing a hot dog on a cutting board. New description: A man makes bread.

Microsoft's latest technology in Azure AI describes images as well as people

Photo courtesy of Getty Images. Previous description: A man is sitting at sunset. New description: Bonfire on the beach.

Microsoft's latest technology in Azure AI describes images as well as people

Photo courtesy of Getty Images. Previous description: A man in a blue shirt. New description: Several people wearing surgical masks.

Microsoft's latest technology in Azure AI describes images as well as people

Photo courtesy of Getty Images. Previous description: A man on a skateboard flies up the wall. New description: A baseball player catches a ball.

Source: habr.com

Add a comment