With a beard, in dark glasses and in profile: difficult situations for computer vision

With a beard, in dark glasses and in profile: difficult situations for computer vision

Technologies and models for our future computer vision system were created and improved gradually and in different projects of our company - in Mail, Cloud, Poisk. Ripened like a good cheese or cognac. Once we realized that our neural networks show excellent results in recognition, and decided to combine them into a single b2b product - Vision - which we now use ourselves and offer you to use.

Today, our computer vision technology on the Mail.Ru Cloud Solutions platform is successfully working and solving very complex practical problems. It is based on a number of neural networks that are trained on our data sets and specialize in solving applied problems. All services are running on our server facilities. You can integrate the Vision public API into your applications, through which all the features of the service are available. The API is fast - thanks to server GPUs, the average response time within our network is at the level of 100 ms.

Go under the cut, there is a detailed story and many examples of the work of Vision.

An example of a service in which we ourselves use the mentioned facial recognition technologies is Events. One of its components is Vision photo stands, which we install at various conferences. If you approach such a photo booth, take a picture with the built-in camera and enter your mail, the system will immediately find among the array of photos those in which you were captured by the staff photographers of the conference, and, if desired, send the found photos to your mail. And we are not talking about staged portrait shots - Vision recognizes you even in the very background in a crowd of visitors. Of course, it is not the photo stands themselves that recognize them, these are just tablets in beautiful stands that simply take pictures of guests with their built-in cameras and transmit information to the servers, where all the magic of recognition takes place. And we have seen more than once how surprising the effectiveness of the technology is even among image recognition specialists. Below we describe some examples.

1. Our Face Recognition Model

1.1. Neural network and processing speed

For recognition, we use a modification of the ResNet 101 neural network model. At the end, Average Pooling is replaced by a fully connected layer, similar to how it is done in ArcFace. However, the size of the vector representations is 128, not 512. Our training set contains about 10 million photos of 273 people.

The model runs very fast thanks to the carefully chosen architecture of the server configuration and GPU computing. It takes from 100 ms to receive a response from the API in our internal networks - this includes face detection (face detection in the photo), recognition and return of PersonID in the API response. With large amounts of incoming data - photos and videos - it will take much more time to transfer data to the service and to receive a response.

1.2. Evaluation of the effectiveness of the model

But determining the efficiency of neural networks is a very ambiguous task. The quality of their work depends on what data sets the models were trained on and whether they were optimized for working with specific data.

We started to evaluate the accuracy of our model with the popular LFW verification test, but it is too small and simple. After reaching 99,8% accuracy, it is no longer useful. There is a good competition for evaluating recognition models - Megaface on it, we gradually reached 82% rank 1. The Megaface test consists of a million photos - distractors - and the model should be able to distinguish well several thousand photos of celebrities from the Facescrub dataset from distractors. However, having cleared the Megaface test of errors, we found that on the cleaned version we achieve an accuracy of 98% rank 1 (photos of celebrities are generally quite specific). Therefore, they created a separate identification test, similar to Megaface, but with photographs of "ordinary" people. Then they improved the recognition accuracy on their datasets and went far ahead. In addition, we use a clustering quality test that consists of several thousand photographs; it simulates the markup of faces in the user's cloud. In this case, clusters are groups of similar faces, one group for each recognizable person. We checked the quality of the work on real groups (true).

Of course, any model has recognition errors. But such situations are often resolved by fine-tuning the thresholds for specific conditions (we use the same thresholds for all conferences, and, for example, for ACS, we have to greatly increase the thresholds so that there are fewer false positives). The vast majority of conference attendees were recognized correctly by our Vision photo stands. Sometimes someone would look at a cropped preview and say, "Your system is wrong, it's not me." Then we opened the whole photo, and it turned out that there really was this visitor in the photo, only they were not shooting him, but someone else, it was just that the person happened to be in the background in the blur zone. Moreover, the neural network often correctly recognizes even when part of the face is not visible, or the person is standing in profile, or even half-turned. The system can recognize a person even if the face is in the area of ​​optical distortion, say, when shooting with a wide-angle lens.

1.3. Examples of testing in difficult situations

Below are examples of how our neural network works. Photos are submitted as input, which she must mark with PersonID - a unique identifier of a person. If two or more images have the same identifier, then, according to the models, these photos show the same person.

We note right away that during testing, various parameters and thresholds of models are available to us, which we can adjust to achieve a particular result. The public API is optimized for maximum accuracy on common cases.

Let's start with the simplest, with full face recognition.

With a beard, in dark glasses and in profile: difficult situations for computer vision

Well, it was too easy. Let's complicate the task, add a beard and a handful of years.

With a beard, in dark glasses and in profile: difficult situations for computer vision

Someone will say that this was also not too difficult, because in both cases the whole face is visible, a lot of information about the face is available to the algorithm. Okay, let's turn Tom Hardy in profile. This task is much more difficult, and we spent a lot of effort on successfully solving it while maintaining a low level of errors: we selected the training sample, thought through the architecture of the neural network, perfected the loss functions, and improved the photo preprocessing.

With a beard, in dark glasses and in profile: difficult situations for computer vision

Let's put a hat on him:

With a beard, in dark glasses and in profile: difficult situations for computer vision

By the way, this is an example of a particularly difficult situation, since the face is heavily covered here, and in the lower picture there is also a deep shadow that hides the eyes. In real life, people very often change their appearance with the help of dark glasses. Let's do the same with Tom.

With a beard, in dark glasses and in profile: difficult situations for computer vision

Okay, let's try to upload photos from different ages, and this time we will put the experience on another actor. Let's take a much more complex example, when age-related changes are especially pronounced. The situation is not far-fetched, it occurs all the time when you need to compare a photo in a passport with the face of the bearer. After all, the first photo in the passport is glued when the owner is 20 years old, and by 45 people can change very much:

With a beard, in dark glasses and in profile: difficult situations for computer vision

Do you think that the main mission-impossible specialist has not changed much with age? I think that even few people would combine the top and bottom photos, so much the boy has changed over the years.

With a beard, in dark glasses and in profile: difficult situations for computer vision

Neural networks are faced with changes in appearance much more often. For example, sometimes women can greatly change their image with the help of cosmetics:

With a beard, in dark glasses and in profile: difficult situations for computer vision

Now let's complicate the task even more: let different parts of the face be covered in different photographs. In such cases, the algorithm cannot compare the entire samples. However, Vision handles situations like this well.

With a beard, in dark glasses and in profile: difficult situations for computer vision

By the way, there are a lot of faces in a photograph, for example, more than 100 people can fit in a general picture of a hall. This is a difficult situation for neural networks, since many faces can be lit differently, someone is out of focus. However, if the photo is taken with sufficient resolution and quality (at least 75 pixels per square covering the face), Vision will be able to identify and recognize it.

With a beard, in dark glasses and in profile: difficult situations for computer vision

A feature of reportage photographs and images from surveillance cameras is that people are often blurry because they were out of focus or were moving at that moment:

With a beard, in dark glasses and in profile: difficult situations for computer vision

Also, the light intensity can vary greatly from image to image. This also often turns into a stumbling block, many algorithms have great difficulty with the correct processing of too dark and too light images, not to mention accurate matching. Let me remind you that in order to achieve such a result, you need to set the thresholds in a certain way, this possibility is not yet available in the public domain. For all clients, we use the same neural network, it has thresholds that are suitable for most practical tasks.

With a beard, in dark glasses and in profile: difficult situations for computer vision

We recently rolled out a new version of the model that recognizes Asian faces with high accuracy. This used to be a big problem, which even got the name "racism of machine learning" (or "neural networks"). European and American neural networks recognized Caucasoid faces well, but everything was much worse with Mongoloid and Negroid faces. Probably, in the same China, the situation was exactly the opposite. It's all about training data sets that reflect the dominant types of faces in a particular country. However, the situation is changing, today this problem is far from being so acute. Vision does not experience any difficulties with representatives of different races.

With a beard, in dark glasses and in profile: difficult situations for computer vision

Facial recognition is just one of the many applications of our technology, Vision can be taught to recognize anything. For example, license plates, including in conditions that are difficult for algorithms: at sharp angles, dirty and hard to read numbers.

With a beard, in dark glasses and in profile: difficult situations for computer vision

2. Practical use cases

2.1. Physical access control: when two people walk on the same pass

With the help of Vision, you can implement systems for recording the arrival and departure of employees. The traditional system based on electronic badges has obvious drawbacks, for example, you can go through one badge together. If the access control system (ACS) is supplemented with Vision, it will honestly record who and when came / left.

2.2. Accounting for working hours

This Vision use case is closely related to the previous one. If we supplement the access system with our face recognition service, it will be able not only to notice violations of the access control, but also to register the actual presence of employees in the building or at the facility. In other words, Vision will help you honestly take into account who and what time came to work and left with it, and who skipped work at all, even if his colleagues covered him up in front of his superiors.

2.3. Video Analytics: People Tracking and Security

By tracking people with Vision, you can accurately assess the actual traffic of shopping areas, train stations, crossings, streets and many other public places. Also, our tracking can be of great help in controlling access, for example, to a warehouse or other important office premises. And of course, tracking people and faces helps solve security problems. Caught someone stealing from your store? Blacklist its PersonID, returned by Vision, to your video analytics software, and next time the system will immediately warn security if this type appears again.

2.4. In trade

Retail and various service businesses are interested in recognizing queues. With the help of Vision, you can recognize that this is not a random crowd of people, but a queue, and determine its length. And then the system informs those responsible about the appearance of the queue in order to understand the situation: either this is an influx of visitors and additional workers need to be called, or someone is messing with their work responsibilities.

Another interesting task is the separation of company employees in the hall from visitors. Usually, the system is trained to separate objects in certain clothes (dress code) or with some kind of distinguishing feature (company scarf, badge on the chest, and so on). This helps to more accurately assess attendance (so that employees do not β€œwind up” the statistics of people in the hall by their presence alone).

With the help of facial recognition, you can also evaluate your audience: what is the loyalty of visitors, that is, how many people return to your establishment and with what frequency. Calculate how many unique visitors come to you per month. To optimize acquisition and retention costs, you can also find out the change in attendance depending on the day of the week and even the time of day.

Franchisors and network companies can order an assessment of the branding quality of various retail outlets based on photographs: the presence of logos, signs, posters, banners, and so on.

2.5. On transport

Another example of security with the help of video analytics is the detection of left things in the halls of airports or train stations. Vision can be trained to recognize hundreds of classes of objects: pieces of furniture, bags, suitcases, umbrellas, various types of clothing, bottles, and so on. If your video analytics system detects an orphan object and recognizes it using Vision, then it sends a signal to the security service. A similar task is related to the automatic detection of non-standard situations in public places: someone became ill, or someone smokes in the wrong place, or a person fell on the rails, and so on - all these patterns can be recognized by video analytics systems through the Vision API.

2.6. Document flow

Another interesting future application of Vision that we are currently developing is document recognition and their automatic parsing into databases. Instead of manually entering (or even worse, entering) endless series, numbers, dates of issue, account numbers, bank details, dates and places of birth, and many other formalized data, it will be possible to scan documents and automatically send them over a secure channel via the API to the cloud, where the system will recognize these documents on the fly, parse and return a response with data in the required format for automatic entry into the database. Today, Vision already knows how to classify documents (including in PDF) - it distinguishes between passports, SNILS, TIN, birth certificates, marriage certificates, and others.

Of course, the neural network is not able to process all these situations out of the box. In each case, a new model is built for a specific customer, many factors, nuances and requirements are taken into account, data sets are selected, training-testing-configuration iterations are carried out.

3. API operation scheme

The "gateway" of Vision for users is the REST API. It can accept photos, video files and broadcasts from network cameras (RTSP streams) as input.

To use Vision, you need sign up in the Mail.ru Cloud Solutions service and get access tokens (client_id + client_secret). User authentication is performed using the OAuth protocol. The raw data in the POST request bodies is sent to the API. And in response, the client receives a recognition result from the API in JSON format, and the response is structured: it contains information about the objects found and their coordinates.

With a beard, in dark glasses and in profile: difficult situations for computer vision

Response Example

{
   "status":200,
   "body":{
      "objects":[
         {
            "status":0,
            "name":"file_0"
         },
         {
            "status":0,
            "name":"file_2",
            "persons":[
               {
                  "tag":"person9"
                  "coord":[149,60,234,181],
                  "confidence":0.9999,
                  "awesomeness":0.45
               },
               {
                  "tag":"person10"
                  "coord":[159,70,224,171],
                  "confidence":0.9998,
                  "awesomeness":0.32
               }
            ]
         }

         {
            "status":0,
            "name":"file_3",
            "persons":[
               {
               "tag":"person11",
               "coord":[157,60,232,111],
               "aliases":["person12", "person13"]
               "confidence":0.9998,
               "awesomeness":0.32
               }
            ]
         },
         {
            "status":0,
            "name":"file_4",
            "persons":[
               {
               "tag":"undefined"
               "coord":[147,50,222,121],
               "confidence":0.9997,
               "awesomeness":0.26
               }
            ]
         }
      ],
      "aliases_changed":false
   },
   "htmlencoded":false,
   "last_modified":0
}

The response has an interesting awesomeness parameter - this is the conditional "coolness" of the face in the photo, with its help we select the best face shot from the sequence. We trained a neural network to predict the probability that a picture will be liked on social networks. The better the picture and the more smiling the face, the more awesomness.

The Vision API uses such a thing as space. This is a tool for creating different sets of faces. Examples of spaces are black and white lists, lists of visitors, employees, clients, etc. Up to 10 spaces can be created for each token in Vision, each space can have up to 50 thousand PersonIDs, that is, up to 500 thousand per token . At the same time, the number of tokens per account is not limited.

Today the API supports the following detection and recognition methods:

  • Recognize / Set - face detection and recognition. Automatically assigns a PersonID to each unique face, returns the PersonID and coordinates of found faces.
  • Delete - deleting a specific PersonID from the person database.
  • Truncate - clearing the entire space from PersonID, useful if it was used as a test and you need to reset the base for production.
  • Detect - detection of objects, scenes, license plates, landmarks, queues, etc. Returns the class of found objects and their coordinates
  • Detect for documents - detects specific types of documents of the Russian Federation (distinguishes between passport, snils, inn, etc.).

We are also soon finishing work on methods for OCR, determining gender, age and emotions, as well as solving merchandising tasks, that is, for automatically controlling the display of goods in stores. Full API documentation can be found here: https://mcs.mail.ru/help/vision-api

4. Π—Π°ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΠ΅

Now, through a public API, you can access face recognition in photos and videos, it supports the definition of various objects, license plates, places of interest, documents, and entire scenes. Application scenarios - the sea. Come, test our service, set the most tricky tasks for it. The first 5000 transactions are free. Perhaps it will be the "missing ingredient" for your projects.

Access to the API can be instantly obtained upon registration and connection Vision. All habrausers - a promotional code for additional transactions. Write in a personal email address, which registered an account!

Source: habr.com

Add a comment