How do we moderate ads?

How do we moderate ads?

Each service, whose users can create their own content (UGC - User-generated content), is forced not only to solve business problems, but also to clean up the UGC. Poor or low-quality content moderation can ultimately reduce the attractiveness of the service for users, up to the termination of its operation.

Today we will tell you about the synergy between Yula and Odnoklassniki, which helps us to effectively moderate ads in Yula.

In general, synergy is a very useful thing, and in the modern world, when technologies and trends change very quickly, it can turn into a lifesaver. Why waste scarce resources and time on inventing something that has already been invented and perfected before you?

We thought the same when we were faced with the task of moderating user content - pictures, text and links. Every day, our users upload millions of pieces of content to Yula, and without automatic processing it is not realistic to moderate all this data manually.

Therefore, we used a ready-made moderation platform, which by that time our colleagues from Odnoklassniki had completed to the state of “almost perfection”.

Why Odnoklassniki?

Every day, tens of millions of users come to the social network, who publish billions of pieces of content: from photos to videos and texts. The Odnoklassniki moderation platform helps to check very large amounts of data and counteract spammers and bots.

The OK moderation team has accumulated a lot of experience as they have been improving their tool for 12 years. It is important that they could not only share their ready-made solutions, but also customize the architecture of their platform for our specific tasks.

How do we moderate ads?

Hereafter, for brevity, we will refer to the OK moderation platform as simply “platform”.

How it all works

Between Yula and Odnoklassniki, data exchange has been established through Apache Kafka.

Why we chose this tool:

  • In Yulia, all ads are post-moderated, so initially a synchronous response was not required.
  • If a fierce paragraph happens, and Yula or Odnoklassniki are unavailable, including due to some peak loads, then the data from Kafka will not disappear anywhere and can be read later.
  • The platform was already integrated with Kafka, so most of the security issues were resolved.

How do we moderate ads?

For each ad created or modified by the user in Yulia, a JSON with data is generated, which is put into Kafka for subsequent moderation. Ads are uploaded from Kafka to the platform, where they are judged automatically or manually. Bad ads are blocked with an indication of the reason, and those in which the platform did not find violations are marked as “good”. Then all solutions are sent back to Yula and applied in the service.

As a result, for Yula, everything comes down to simple actions: send an ad to the Odnoklassniki platform and get back the resolution “ok”, or why not “ok”.

Automatic processing

What happens to an ad after it hits the platform? Each declaration is broken down into several entities:

  • name,
  • description,
  • Photo,
  • user-selected category and subcategory of the ad,
  • цена.

How do we moderate ads?

Then, for each entity, the platform performs clustering to find duplicates. Moreover, text and photos are clustered according to different schemes.

Texts are normalized before clustering in order to throw out special characters, changed letters and other garbage. The received data is divided into N-grams, each of which is hashed. The result is a set of unique hashes. The similarity between texts is calculated by Jaccard measure between the two resulting sets. If the similarity is greater than the threshold, then the texts are glued into one cluster. MinHash and Locality-sensitive hashing are used to speed up the search for similar clusters.

For photos, various options for gluing images have been invented, from comparing pHash images to finding duplicates using a neural network.

The last way is the most "severe". To train the model, such triples of images (N, A, P) were selected in which N is not similar to A, and P is similar to A (it is a semi-duplicate). Then the neural network learned to make A and P as close as possible, and A and N as far as possible. This results in fewer false positives compared to simply taking embeddings from a pre-trained network.

When the neural network receives pictures as input, it generates an N(128)-dimensional vector for each of them and a request is made to estimate the proximity of the image. Next, a threshold is calculated at which close images are considered duplicates.

The model is adept at spotting spammers who purposely take pictures of the same product from different angles in order to bypass the pHash comparison.

How do we moderate ads?How do we moderate ads?
An example of spam photos glued together by a neural network as duplicates.

At the final stage, duplicate ads are searched for both text and image at the same time.

If two or more ads stick together in a cluster, the system starts automatic blocking, which, according to certain algorithms, chooses which duplicates to remove and which to leave. For example, if two users in an ad have the same photos, the system will block the more recent ad.

Once created, all clusters go through a series of automatic filters. Each filter gives the cluster a score: how likely it is to contain the threat that this filter detects.

For example, the system analyzes the description in an ad and selects potential categories for it. Then it takes the one with the highest probability and compares it with the category specified by the ad author. If they don't match, the ad is blocked for the wrong category. And since we are kind and honest, we directly tell the user which category he needs to choose in order for the ad to pass moderation.

How do we moderate ads?
Notification of blocking for the wrong category.

Our platform makes machine learning feel at home. For example, with its help, we look for goods prohibited in the Russian Federation in the names and descriptions. And neural network models meticulously “look at” the images, whether they have URLs, spam texts, phone numbers, and the same “prohibition”.

For cases when a prohibited product is being sold disguised as something legal, and there is no text in the title or description, we use image tagging. Each image can have up to 11 different tags that describe what is in the image.

How do we moderate ads?
They are trying to sell a hookah disguised as a samovar.

In parallel with complex filters, there are also simple ones that solve obvious problems related to text:

  • antimat;
  • URL and phone number detector;
  • mentioning messengers and other contacts;
  • low price;
  • ads that don't sell anything, etc.

Today every ad goes through a fine sieve of more than 50 automatic filters that try to find something bad in the ad.

If none of the detectors worked, then the answer is sent to Yula that with the announcement, “most likely”, complete order. We apply this answer at home, and users who subscribe to the seller receive a notification about the appearance of a new product.

How do we moderate ads?
Notification that the seller has a new product.

As a result, each ad is "overgrown" with metadata, some of which is generated when the ad is created (author's IP address, user-agent, platform, geolocation, etc.), and the rest is the score given by each filter.

Ad Queues

When an ad enters the platform, the system puts it in one of the queues. Each queue is formed using a mathematical formula that combines ad metadata in such a way as to detect some kind of bad pattern.

For example, you can create a queue of ads in the “Cell Phones” category from Yula users supposedly from St. Petersburg, but at the same time their IP addresses are from Moscow or other cities.

How do we moderate ads?
An example of ads posted by the same user in different cities.

Or you can form queues based on the points that the neural network assigns to ads, arranging them in descending order.

Each queue, according to its own formula, assigns a final score to the ad. Then you can proceed in different ways:

  • specify a threshold at which an ad will receive a certain type of blocking;
  • send all ads in the queue to moderators for manual verification;
  • or combine the previous options: specify an automatic blocking threshold and send to moderators those ads that have not reached this threshold.

How do we moderate ads?

Why are these queues needed? Let's say a user has uploaded a photo of a firearm. The neural network assigns it a score from 95 to 100 and determines with 99 percent accuracy that there is a weapon in the picture. But if the score value is below 95%, the accuracy of the model starts to decrease (this is a feature of neural network models).

As a result, a queue is formed based on the score model, and those ads that received from 95 to 100 are automatically blocked as “Prohibited goods”. Ads with a score below 95 are sent to moderators for manual processing.

How do we moderate ads?
Chocolate Beretta with cartridges. Only for manual moderation! 🙂

Manual moderation

At the beginning of 2019, about 94% of all ads in Yulia are moderated automatically.

How do we moderate ads?

If the platform cannot decide on some ads, it sends them for manual moderation. Odnoklassniki has developed their own tool: tasks for moderators immediately display all the necessary information to make a quick decision - is the ad suitable or should it be blocked with an indication of the reason.

And so that the quality of service does not suffer during manual moderation, the work of people is constantly monitored. For example, in the task stream, the moderator is shown "traps" - ads for which there are already ready-made solutions. If the moderator's decision does not match the prepared one, the moderator is credited with an error.

On average, a moderator spends 10 seconds to check one ad. At the same time, the number of errors is no more than 0,5% of all verified ads.

People's moderation

Colleagues from Odnoklassniki went even further, took advantage of the “help of the hall”: they wrote a game application for the social network in which you can quickly mark up a large amount of data, highlighting some kind of bad sign, - Odnoklassniki Moderator (https://ok.ru/app/moderator). A good way to take advantage of the help of OK users who are trying to make content more enjoyable.

How do we moderate ads?
A game where users tag photos that have a phone number.

Any ad queue in the platform can be redirected to the Odnoklassniki Moderator game. Everything that users of the game mark up is then sent to internal moderators for verification. This scheme allows you to block ads for which filters have not yet been created, and simultaneously create training samples.

Storing moderation results

We save all the decisions made during moderation so that later we do not re-process those ads that have already been decided on.

Every day, millions of clusters are created by ads. Over time, each cluster gets a "good" or "bad" mark. Each new announcement or its edition, getting into the cluster with a mark, automatically receives the resolution of the cluster itself. There are about 20 thousand such automatic resolutions per day.

How do we moderate ads?

If no new declarations arrive in the cluster, it is removed from memory, and its hash and solution are written to Apache Cassandra.

When the platform receives a new announcement, it first tries to find a similar cluster among those already created and take a solution from it. If there is no such cluster, the platform goes to Cassandra and looks there. Did you find it? Great, applies the solution to the cluster and sends it to Yula. There are an average of 70 such "repeated" decisions every day - 8% of the total.

Summing up

We have been using the Odnoklassniki moderation platform for two and a half years. We love the results:

  • We automatically moderate 94% of all ads per day.
  • The cost of moderating one ad was reduced from 2 rubles to 7 kopecks.
  • Thanks to the ready-made tool, we forgot about the problems of managing moderators.
  • Increased the number of manually processed ads by 2,5 times with the same number of moderators and budget. Also, the quality of manual moderation has increased due to automated control, and fluctuates around 0,5% errors.
  • We quickly cover new types of spam with filters.
  • We quickly connect new divisions to moderation Yula Verticals. Since 2017, real estate, vacancies and auto verticals have appeared in Yula.

Source: habr.com

Add a comment