Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Neural networks in computer vision are actively developing, many tasks are still far from being solved. To be on trend in your field, all you need to do is follow influencers on Twitter and read relevant articles on arXiv.org. But we got the opportunity to attend the International Conference on Computer Vision (ICCV) 2019. This year it is held in South Korea. Now we want to share with the readers of Habr what we saw and learned.

There were many of us from Yandex: the developers of an unmanned vehicle, researchers, those who deal with CV tasks in services arrived. But now we want to present a slightly subjective point of view of our team - Machine Intelligence Laboratory (Yandex MILAB). Other guys must have looked at the conference from their own angle.

What does the laboratory doWe are doing experimental projects related to the generation of images and music for entertainment purposes. We are especially interested in neural networks that allow you to change content from the user (for photos, this task is called image manipulation). Example the result of our work from the YaC 2019 conference.
There are a lot of scientific conferences, but top ones, the so-called A * conferences, stand out from them, at which articles about the most interesting and important technologies are usually published. There is no exact list of A * conferences, here is an approximate and incomplete one: NeurIPS (formerly NIPS), ICML, SIGIR, WWW, WSDM, KDD, ACL, CVPR, ICCV, ECCV. The last three specialize in the CV topic.

ICCV at glance: posters, tutorials, workshops, stands

1075 papers were accepted for the conference, there were 7500 participants. 103 people came from Russia, there were articles from employees of Yandex, Skoltech, Samsung AI Center Moscow and Samara University. Not so many top researchers visited ICCV this year, but here, for example, Alexey (Alyosha) Efros, who always gathers a lot of people:

Trends in computer vision. Highlights ICCV 2019

Statistics Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

At all such conferences, articles are presented in the form of posters (more about the format), and the best ones are also presented in the form of short reports.

Here is some of the work from Russia Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

On tutorials, you can immerse yourself in some subject area, it resembles a lecture at a university. It is read by one person, usually without talking about specific works. An example of a cool tutorial (Michael Brown, Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision):

Trends in computer vision. Highlights ICCV 2019

In workshops, on the contrary, they talk about articles. Usually these are papers in some narrow topic, stories from the heads of laboratories about all the latest work of students, or articles that were not accepted for the main conference.

Sponsor companies come to ICCV with stands. This year, Google, Facebook, Amazon and many other international companies arrived, as well as a large number of startups - Korean and Chinese. There were especially many startups that specialize in data markup. There are performances at the stands, you can take merch, ask questions. Sponsor companies have parties for hunting. You can get into them if you convince recruiters that you are interested and that you can potentially pass interviews. If you have published an article (or, moreover, made a speech), started or are finishing a PhD, this is a plus, but sometimes you can agree on the stand by asking interesting questions to the company's engineers.

Trends

The conference allows you to take a look at the entire area of ​​CV. By the number of posters of a particular topic, you can estimate how hot the topic is. Some conclusions are already suggested by the keywords:

Trends in computer vision. Highlights ICCV 2019

Zero-shot, one-shot, few-shot, self-supervised and semi-supervised: new approaches to long-studied problems

People are learning to use data more effectively. For example, in FUNIT it is possible to generate facial expressions of animals that were not in the training sample (in the application, by submitting several reference pictures). Deep Image Prior ideas have been developed, and now the GAN network can be trained on one picture - we will talk about this below in highlights. You can use self-supervision for pre-training (solving a problem for which you can synthesize aligned data, such as predicting the rotation angle of a picture) or learn from labeled and unlabeled data at the same time. In this sense, the article can be considered the crown of creation. S4L: Self-Supervised Semi-Supervised Learning. And here is the pretraining on ImageNet not always helps.

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

3D and 360Β°

Tasks that are mostly solved for photos (segmentation, detection) require additional research for 3D models and panoramic videos. We have seen many articles on converting RGB and RGB-D to 3D. Some tasks, such as pose estimation, are solved more naturally if we switch to three-dimensional models. But so far there is no consensus on how exactly to represent 3D models - in the form of a grid, point clouds, voxels or SDF. Here's another option:

Trends in computer vision. Highlights ICCV 2019

Convolutions on a sphere are actively developed in panoramas (see Fig. Orientation-aware Semantic Segmentation on Icosahedron Spheres) and search for key objects in the frame.

Trends in computer vision. Highlights ICCV 2019

Determination of posture and prediction of human movements

In order to determine the pose in 2D, there is already progress - now the focus has shifted towards working with multiple cameras and in 3D. Also, for example, you can determine the skeleton through the wall, tracking changes in the Wi-Fi signal as it passes through the human body.

Much work has been done in the field of hand keypoint detection. New datasets have appeared, including those based on video with dialogues between two people - now you can predict hand gestures based on the audio or text of a conversation! The same progress is made in gaze estimation tasks.

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

It is also possible to single out a large cluster of works related to the prediction of human movement (for example, Human Motion Prediction via Spatio-Temporal Inpainting or Structured Prediction Helps 3D Human Motion Modeling). The task is important and, based on conversations with the authors, it is most often used to analyze the behavior of pedestrians in autonomous driving.

Manipulations with people in photo and video, virtual fitting rooms

The main trend is to change face images according to interpretable parameters. Ideas: deepfake one image at a time, changing expressions based on face render (PuppetGAN), feedforward-changing parameters (for example, age). Style transfers have moved from the title of the topic to the application of the work. A separate story is virtual fitting rooms, they almost always work poorly, Here is an example demos.

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Sketch/graph generation

The development of the idea β€œLet the grid generate something based on previous experience” became another: β€œLet's show the grid which option we are interested in”.

SC-FEGAN allows you to do guided inpaint: the user can draw part of the face in the erased area of ​​the picture and get a restored picture depending on the ink.

Trends in computer vision. Highlights ICCV 2019

In one of the 25 Adobe articles for ICCV, two GANs are combined: one draws the sketch for the user, the other generates a photorealistic picture from the sketch (project page).

Trends in computer vision. Highlights ICCV 2019

Previously, graphs were not needed in image generation, but now they have been made a container of knowledge about the scene. The ICCV Best Paper Honorable Mentions award was won by an article Specifying Object Attributes and Relations in Interactive Scene Generation. In general, you can use them in different ways: generate graphs from pictures, or pictures and texts from graphs.

Trends in computer vision. Highlights ICCV 2019

Re-identification of people and cars, counting the number of crowds (!)

Many articles are devoted to tracking people and re-identifying people and cars. But what surprised us was a bunch of articles on counting people in a crowd, all from China.

Posters Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019
But Facebook, on the contrary, anonymizes the photo. And it does it in an interesting way: it trains the neural network to generate a face without unique details - similar, but not so much that face recognition systems correctly determine it.

Trends in computer vision. Highlights ICCV 2019

Protection against adversarial attacks

With the development of applications of computer vision in the real world (in unmanned vehicles, in face recognition), the question of the reliability of such systems is increasingly being raised. To make full use of CV, you need to be sure that the system is resistant to adversarial attacks - therefore, there were no fewer articles about protecting against them than about the attacks themselves. A lot of work has been about explaining network predictions (saliency map) and measuring confidence in the result.

Combined tasks

In most tasks with a single target, the possibilities for improving quality are almost exhausted, one of the new directions for further increasing quality is to teach neural networks to solve several similar tasks at the same time. Examples:
- action prediction + optical flow prediction,
β€” video representation + language representation (VideoBERT),
β€” super resolution + HDR.

There are also articles on segmentation, pose determination and animal re-identification!

Trends in computer vision. Highlights ICCV 2019

Trends in computer vision. Highlights ICCV 2019

Highlights

Almost all articles were known in advance, the text was available on arXiv.org. Therefore, the presentation of works such as Everybody Dance Now, FUNIT, Image2StyleGAN seems rather strange - these are very useful works, but by no means new. It seems that the classic process of scientific publications is failing here - science is developing too quickly.

It is very difficult to determine the best works - there are many of them, the topics are different. Several articles received awards and mentions.

We want to highlight works that are interesting in terms of image manipulation, as this is our topic. They turned out to be quite fresh and interesting for us (we do not pretend to be objective).

SinGAN (best paper award) and InGAN

Singan: project page, arXiv, code.
Ingan: project page, arXiv, code.

Development of the Deep Image Prior idea by Dmitry Ulyanov, Andrea Vedaldi and Viktor Lempitsky. Instead of training GANs on a dataset, networks learn from fragments of the same picture in order to remember statistics within it. The trained network allows you to edit and animate photos (SinGAN) or generate new images of any size from the textures of the original image, preserving the local structure (InGAN).

Singan:

Trends in computer vision. Highlights ICCV 2019

Ingan:

Trends in computer vision. Highlights ICCV 2019

Seeing What a GAN Cannot Generate

Project page.

Neural networks that generate images often take a random noise vector as input. In the trained network, many input vectors form a space, small movements in which lead to small changes in the picture. With the help of optimization, you can solve the inverse problem: for a picture from the real world, find a suitable input vector. The author shows that it is almost never possible to find a completely matching picture in a neural network. Some objects in the picture are not generated (apparently due to the high variability of these objects).

Trends in computer vision. Highlights ICCV 2019

The author hypothesizes that GAN does not cover the entire image space, but only some subset stuffed with holes like cheese. When trying to find photos from the real world in it, we will always fail, because GAN still generates not quite real photos. It is possible to overcome the differences between real and generated images only by changing the weights of the network, that is, retraining it for a specific photo.

Trends in computer vision. Highlights ICCV 2019

When the network is trained for a specific photo, you can try to carry out various manipulations with this image. In the example below, a window has been added to the photo, and the network has additionally generated reflections on the kitchen set. This means that the network, even after additional training for photography, has not lost the ability to see the connection between the objects of the scene.

Trends in computer vision. Highlights ICCV 2019

GANalyze: Toward Visual Definitions of Cognitive Image Properties

Project page, arXiv.

Using the approach from this work, you can visualize and analyze what the neural network has learned. The authors propose to train the GAN to create pictures for which the network will generate given predictions. Several networks were used as examples in the article, including MemNet, which predicts photo memorability. It turned out that for better memorability, the object in the photo should:

  • be closer to the center
  • have a more round or square shape and a simple structure,
  • be on a uniform background,
  • contain expressive eyes (at least for photos of dogs),
  • be brighter, richer, in some cases redder.

Trends in computer vision. Highlights ICCV 2019

Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis

Project page, arXiv, code.

Pipeline for generating photos of people from one photo. The authors show successful examples of transferring the movement of one person to another, transferring clothes between people and generating new angles of a person - all from one photograph. Unlike previous works, here, not key points in 2D (pose), but a 3D body mesh (pose + shape) are used to create conditions. The authors also figured out how to transfer information from the original image to the generated one (Liquid Warping Block). The results look decent, but the resolution of the resulting image is only 256x256. For comparison, vid2vid, which appeared a year ago, is able to generate at a resolution of 2048x1024, but it needs as much as 10 minutes of video recording as a dataset.

Trends in computer vision. Highlights ICCV 2019

FSGAN: Subject Agnostic Face Swapping and Reenactment

Project page, arXiv.

At first it seems that nothing unusual: deepfake with more or less normal quality. But the main achievement of the work is the substitution of faces in one picture. Unlike previous works, training was required on many photographs of a particular person. The pipeline turned out to be cumbersome (reenactment and segmentation, view interpolation, inpainting, blending) and with a lot of technical hacks, but the result is worth it.

Trends in computer vision. Highlights ICCV 2019

Detecting The Unexpected via Image Resynthesis

arXiv.

How can a drone understand that an object suddenly appeared in front of it that does not fall into any semantic segmentation class? There are several methods, but the authors propose a new, intuitive algorithm that works better than its predecessors. Semantic segmentation is predicted from the input road image. It is fed into the GAN (pix2pixHD), which tries to restore the original image only from the semantic map. Anomalies that do not fall into any of the segments will differ significantly in the output and the generated image. Then three images (original, segmentation and reconstructed) are fed into another network that predicts anomalies. The dataset for this was generated from the well-known Cityscapes dataset, randomly changing classes on semantic segmentation. Interestingly, in this setting, a dog standing in the middle of the road, but correctly segmented (which means there is a class for it), is not an anomaly, since the system was able to recognize it.

Trends in computer vision. Highlights ICCV 2019

Conclusion

Before the conference, it is important to know what your scientific interests are, what speeches you would like to attend, and who to talk to. Then everything will be much more productive.

ICCV is primarily networking. You understand that there are top institutes and top scientists, you start to understand this, get to know people. And you can read articles on arXiv - and by the way, it's very cool that you can not go anywhere for knowledge.

In addition, at the conference, you can immerse yourself in topics that are not close to you, see trends. Well, write out a list of articles to read. If you are a student, this is an opportunity for you to meet a potential scientific instructor, if you are from the industry, then with a new employer, and if you are a company, then show yourself.

Subscribe to @loss_function_porn! This is a personal project: we lead together with karfly. All the works that we liked during the conference, we posted here: @loss_function_live.

Source: habr.com

Add a comment