Neural networks in computer vision are actively developing, many tasks are still far from being solved. To be on trend in your field, all you need to do is follow influencers on Twitter and read relevant articles on arXiv.org. But we got the opportunity to attend the International Conference on Computer Vision (ICCV) 2019. This year it is held in South Korea. Now we want to share with the readers of Habr what we saw and learned.
There were many of us from Yandex: the developers of an unmanned vehicle, researchers, those who deal with CV tasks in services arrived. But now we want to present a slightly subjective point of view of our team - Machine Intelligence Laboratory (Yandex MILAB). Other guys must have looked at the conference from their own angle.
What does the laboratory doWe are doing experimental projects related to the generation of images and music for entertainment purposes. We are especially interested in neural networks that allow you to change content from the user (for photos, this task is called image manipulation).
There are a lot of scientific conferences, but top ones, the so-called A * conferences, stand out from them, at which articles about the most interesting and important technologies are usually published. There is no exact list of A * conferences, here is an approximate and incomplete one: NeurIPS (formerly NIPS), ICML, SIGIR, WWW, WSDM, KDD, ACL, CVPR, ICCV, ECCV. The last three specialize in the CV topic.
ICCV at glance: posters, tutorials, workshops, stands
1075 papers were accepted for the conference, there were 7500 participants. 103 people came from Russia, there were articles from employees of Yandex, Skoltech, Samsung AI Center Moscow and Samara University. Not so many top researchers visited ICCV this year, but here, for example, Alexey (Alyosha) Efros, who always gathers a lot of people:
Statistics
At all such conferences, articles are presented in the form of posters (
Here is some of the work from Russia
On tutorials, you can immerse yourself in some subject area, it resembles a lecture at a university. It is read by one person, usually without talking about specific works. An example of a cool tutorial (
In workshops, on the contrary, they talk about articles. Usually these are papers in some narrow topic, stories from the heads of laboratories about all the latest work of students, or articles that were not accepted for the main conference.
Sponsor companies come to ICCV with stands. This year, Google, Facebook, Amazon and many other international companies arrived, as well as a large number of startups - Korean and Chinese. There were especially many startups that specialize in data markup. There are performances at the stands, you can take merch, ask questions. Sponsor companies have parties for hunting. You can get into them if you convince recruiters that you are interested and that you can potentially pass interviews. If you have published an article (or, moreover, made a speech), started or are finishing a PhD, this is a plus, but sometimes you can agree on the stand by asking interesting questions to the company's engineers.
Trends
The conference allows you to take a look at the entire area of ββCV. By the number of posters of a particular topic, you can estimate how hot the topic is. Some conclusions are already suggested by the keywords:
Zero-shot, one-shot, few-shot, self-supervised and semi-supervised: new approaches to long-studied problems
People are learning to use data more effectively. For example, in
3D and 360Β°
Tasks that are mostly solved for photos (segmentation, detection) require additional research for 3D models and panoramic videos. We have seen many articles on converting RGB and RGB-D to 3D. Some tasks, such as pose estimation, are solved more naturally if we switch to three-dimensional models. But so far there is no consensus on how exactly to represent 3D models - in the form of a grid, point clouds, voxels or SDF. Here's another option:
Convolutions on a sphere are actively developed in panoramas (see Fig.
Determination of posture and prediction of human movements
In order to determine the pose in 2D, there is already progress - now the focus has shifted towards working with multiple cameras and in 3D. Also, for example, you can determine the skeleton through the wall, tracking changes in the Wi-Fi signal as it passes through the human body.
Much work has been done in the field of hand keypoint detection. New datasets have appeared, including those based on video with dialogues between two people - now you can predict hand gestures based on the audio or text of a conversation! The same progress is made in gaze estimation tasks.
It is also possible to single out a large cluster of works related to the prediction of human movement (for example,
Manipulations with people in photo and video, virtual fitting rooms
The main trend is to change face images according to interpretable parameters. Ideas: deepfake one image at a time, changing expressions based on face render (
Sketch/graph generation
The development of the idea βLet the grid generate something based on previous experienceβ became another: βLet's show the grid which option we are interested inβ.
In one of the 25 Adobe articles for ICCV, two GANs are combined: one draws the sketch for the user, the other generates a photorealistic picture from the sketch (
Previously, graphs were not needed in image generation, but now they have been made a container of knowledge about the scene. The ICCV Best Paper Honorable Mentions award was won by an article
Re-identification of people and cars, counting the number of crowds (!)
Many articles are devoted to tracking people and re-identifying people and cars. But what surprised us was a bunch of articles on counting people in a crowd, all from China.
Posters
But Facebook, on the contrary, anonymizes the photo. And it does it in an interesting way: it trains the neural network to generate a face without unique details - similar, but not so much that face recognition systems correctly determine it.
Protection against adversarial attacks
With the development of applications of computer vision in the real world (in unmanned vehicles, in face recognition), the question of the reliability of such systems is increasingly being raised. To make full use of CV, you need to be sure that the system is resistant to adversarial attacks - therefore, there were no fewer articles about protecting against them than about the attacks themselves. A lot of work has been about explaining network predictions (saliency map) and measuring confidence in the result.
Combined tasks
In most tasks with a single target, the possibilities for improving quality are almost exhausted, one of the new directions for further increasing quality is to teach neural networks to solve several similar tasks at the same time. Examples:
- action prediction + optical flow prediction,
β video representation + language representation (
β
There are also articles on segmentation, pose determination and animal re-identification!
Highlights
Almost all articles were known in advance, the text was available on arXiv.org. Therefore, the presentation of works such as Everybody Dance Now, FUNIT, Image2StyleGAN seems rather strange - these are very useful works, but by no means new. It seems that the classic process of scientific publications is failing here - science is developing too quickly.
It is very difficult to determine the best works - there are many of them, the topics are different. Several articles received
We want to highlight works that are interesting in terms of image manipulation, as this is our topic. They turned out to be quite fresh and interesting for us (we do not pretend to be objective).
SinGAN (best paper award) and InGAN
Singan:
Ingan:
Development of the Deep Image Prior idea by Dmitry Ulyanov, Andrea Vedaldi and Viktor Lempitsky. Instead of training GANs on a dataset, networks learn from fragments of the same picture in order to remember statistics within it. The trained network allows you to edit and animate photos (SinGAN) or generate new images of any size from the textures of the original image, preserving the local structure (InGAN).
Singan:
Ingan:
Seeing What a GAN Cannot Generate
Neural networks that generate images often take a random noise vector as input. In the trained network, many input vectors form a space, small movements in which lead to small changes in the picture. With the help of optimization, you can solve the inverse problem: for a picture from the real world, find a suitable input vector. The author shows that it is almost never possible to find a completely matching picture in a neural network. Some objects in the picture are not generated (apparently due to the high variability of these objects).
The author hypothesizes that GAN does not cover the entire image space, but only some subset stuffed with holes like cheese. When trying to find photos from the real world in it, we will always fail, because GAN still generates not quite real photos. It is possible to overcome the differences between real and generated images only by changing the weights of the network, that is, retraining it for a specific photo.
When the network is trained for a specific photo, you can try to carry out various manipulations with this image. In the example below, a window has been added to the photo, and the network has additionally generated reflections on the kitchen set. This means that the network, even after additional training for photography, has not lost the ability to see the connection between the objects of the scene.
GANalyze: Toward Visual Definitions of Cognitive Image Properties
Using the approach from this work, you can visualize and analyze what the neural network has learned. The authors propose to train the GAN to create pictures for which the network will generate given predictions. Several networks were used as examples in the article, including MemNet, which predicts photo memorability. It turned out that for better memorability, the object in the photo should:
- be closer to the center
- have a more round or square shape and a simple structure,
- be on a uniform background,
- contain expressive eyes (at least for photos of dogs),
- be brighter, richer, in some cases redder.
Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis
Pipeline for generating photos of people from one photo. The authors show successful examples of transferring the movement of one person to another, transferring clothes between people and generating new angles of a person - all from one photograph. Unlike previous works, here, not key points in 2D (pose), but a 3D body mesh (pose + shape) are used to create conditions. The authors also figured out how to transfer information from the original image to the generated one (Liquid Warping Block). The results look decent, but the resolution of the resulting image is only 256x256. For comparison, vid2vid, which appeared a year ago, is able to generate at a resolution of 2048x1024, but it needs as much as 10 minutes of video recording as a dataset.
FSGAN: Subject Agnostic Face Swapping and Reenactment
At first it seems that nothing unusual: deepfake with more or less normal quality. But the main achievement of the work is the substitution of faces in one picture. Unlike previous works, training was required on many photographs of a particular person. The pipeline turned out to be cumbersome (reenactment and segmentation, view interpolation, inpainting, blending) and with a lot of technical hacks, but the result is worth it.
Detecting The Unexpected via Image Resynthesis
How can a drone understand that an object suddenly appeared in front of it that does not fall into any semantic segmentation class? There are several methods, but the authors propose a new, intuitive algorithm that works better than its predecessors. Semantic segmentation is predicted from the input road image. It is fed into the GAN (pix2pixHD), which tries to restore the original image only from the semantic map. Anomalies that do not fall into any of the segments will differ significantly in the output and the generated image. Then three images (original, segmentation and reconstructed) are fed into another network that predicts anomalies. The dataset for this was generated from the well-known Cityscapes dataset, randomly changing classes on semantic segmentation. Interestingly, in this setting, a dog standing in the middle of the road, but correctly segmented (which means there is a class for it), is not an anomaly, since the system was able to recognize it.
Conclusion
Before the conference, it is important to know what your scientific interests are, what speeches you would like to attend, and who to talk to. Then everything will be much more productive.
ICCV is primarily networking. You understand that there are top institutes and top scientists, you start to understand this, get to know people. And you can read articles on arXiv - and by the way, it's very cool that you can not go anywhere for knowledge.
In addition, at the conference, you can immerse yourself in topics that are not close to you, see trends. Well, write out a list of articles to read. If you are a student, this is an opportunity for you to meet a potential scientific instructor, if you are from the industry, then with a new employer, and if you are a company, then show yourself.
Subscribe to
Source: habr.com