Creation of an automatic system to combat intruders on the site (fraud)

For the last about six months, I have been creating a system for combating fraud (fraudulent activity, fraud, etc.) without any initial infrastructure for this. Today's ideas that we found and implemented in our system help us detect and analyze many fraudulent activities. In this article, I would like to talk about the principles that we followed and what we did to achieve the current state of our system, without delving into the technical part.

Principles of our system

When you hear terms like "automatic" and "fraud" you most likely start thinking about machine learning, Apache Spark, Hadoop, Python, Airflow, and other technologies in the Apache Foundation ecosystem and the Data Science field. I think there is one aspect of using these tools that is usually not mentioned: they require certain prerequisites to be in place on your enterprise system before you can start using them. In short, you need an enterprise data platform that includes a data lake and storage. But what if you don't have such a platform and still need to develop this practice? The following principles, which I describe below, have helped us get to the point where we can focus on improving our ideas, rather than finding a working one. However, this is not a "plateau" of the project. There are many more things in the plan from the technological and product points of view.

Principle 1: Business Value First

We put β€œbusiness value” at the forefront of all our efforts. In general, any automatic analysis system belongs to the group of complex systems with a high level of automation and technical complexity. Creating a complete solution will take a lot of time if you create it from scratch. We decided to put business value first and technological maturity second. In real life, this means that we do not accept advanced technology as a dogma. We choose the technology that works best for us at the moment. Over time, it may seem that we will have to re-implement some modules. This is the compromise we accepted.

Principle 2: Augmented intelligence

I bet most people who aren't deeply involved in developing machine learning solutions might think that human replacement is the goal. In fact, machine learning solutions are far from perfect and only in certain areas is replacement possible. We abandoned this idea from the start for several reasons: unbalanced data on fraudulent activity and the inability to provide an exhaustive list of features for machine learning models. In contrast, we opted for the augmented intelligence option. This is an alternative concept of artificial intelligence that focuses on the supporting role of AI, emphasizing the fact that cognitive technologies are designed to enhance human intelligence, not replace it. [1]

With this in mind, developing a complete machine learning solution from the start would require a huge amount of effort that would delay the creation of value for our business. We decided to build a system with an iteratively growing aspect of machine learning under the guidance of our domain experts. The tricky part of developing such a system is that it has to provide our analysts with case studies not only in terms of whether this is a fraudulent activity or not. In general, any anomaly in the behavior of customers is a suspicious case that specialists need to investigate and somehow respond. Only a few of these recorded cases can really be classified as fraud.

Principle 3: Rich Analytics Platform

The most difficult part of our system is end-to-end verification of the system's workflow. Analysts and developers should easily obtain historical datasets with all the metrics that were used for the analysis. In addition, the data platform should provide an easy way to supplement an existing set of indicators with a new one. The processes that we create, and these are not just software processes, should make it easy to recalculate previous periods, add new metrics and change the data forecast. We could achieve this by accumulating all the data that our production system generates. In such a case, the data would gradually become a hindrance. We would need to store the growing amount of data we don't use and protect it. In such a scenario, the data will become more and more irrelevant over time, but still requires our efforts to manage it. For us, data hoarding did not make sense, and we decided to use a different approach. We decided to organize real-time data warehouses around the target entities that we want to classify, and store only the data that allows us to check the most recent and up-to-date periods. The challenge with this effort is that our system is heterogeneous with multiple data stores and software modules that require careful planning to work in a consistent manner.

Design concepts of our system

We have four main components in our system: an ingestion system, a computational system, a BI analysis, and a tracking system. They serve specific isolated purposes, and we keep them isolated by following certain development approaches.

Creation of an automatic system to combat intruders on the site (fraud)

Contract based design

First of all, we agreed that components should only rely on certain data structures (contracts) that are passed between them. This makes it easy to integrate between them and not impose a specific composition (and order) of components. For example, in some cases this allows us to directly integrate the receiving system with the alert tracking system. In such a case, this will be done in accordance with the agreed notification contract. This means that both components will be integrated using a contract that any other component can use. We will not be adding an additional contract to add alerts to the tracking system from the input system. This approach requires the use of a predetermined minimum number of contracts and simplifies the system and communications. Basically, we're taking an approach called "Contract First Design" and applying it to streaming contracts. [2]

Streaming Everywhere

Saving and managing the state in the system will inevitably lead to complications in its implementation. In general, the state must be accessible from any component, it must be consistent and provide the most up-to-date value across all components, and it must be reliable with the correct values. In addition, having calls to persistent storage to get the latest state will increase the amount of I/O and the complexity of the algorithms used in our real-time pipelines. Because of this, we decided to remove state storage, if possible, completely from our system. This approach requires that all necessary data be included in the transmitted data unit (message). For example, if we need to calculate the total number of some observations (the number of operations or cases with certain characteristics), we calculate it in memory and generate a stream of such values. Dependent modules will use partitioning and batching to split the stream by entities and operate on the latest values. This approach eliminated the need to have persistent disk storage for such data. Our system uses Kafka as a message broker and it can be used as a database with KSQL. [3] But using it would strongly tie our solution to Kafka, and we decided not to use it. The approach we have chosen allows us to replace Kafka with another message broker without major internal changes to the system.

This concept does not mean that we do not use disk storage and databases. In order to check and analyze the performance of the system, we need to store a significant amount of data on disk, which represents various indicators and states. The important point here is that real-time algorithms do not depend on such data. In most cases, we use the saved data for offline analysis, debugging, and tracking of specific cases and results that the system produces.

Problems in our system

There are certain problems that we have solved to a certain level, but they require more thoughtful solutions. For now, I would just like to mention them here, because each item is worth its own article.

  • We still need to define processes and policies that help generate meaningful and relevant data for our automated analysis, discovery and exploration of data.
  • The introduction of the results of analysis by a person in the process of automatically tuning the system to update it with the latest data. This is not only an update to our model, but also an update to our processes and better understanding of our data.
  • Finding a balance between the deterministic approach of IF-ELSE and ML. Someone said: "ML is a tool for the desperate." This means that you will want to use ML when you no longer understand how to optimize and improve your algorithms. On the other hand, the deterministic approach does not allow the detection of anomalies that were not foreseen.
  • We need an easy way to test our hypotheses or correlations between metrics in the data.
  • The system must have multiple levels of true positive results. Fraud cases are only a fraction of all cases that can be considered positive for the system. For example, analysts want to receive all suspicious cases for review, and only a small fraction of them are fraudulent. The system must effectively provide analysts with all cases, whether it is real fraud or just suspicious behavior.
  • The data platform should be able to retrieve historical datasets with calculations created and calculated on the fly.
  • Simple and automatic deployment of any of the system components in at least three different environments: production, experimental (beta), and for developers.
  • And last but not least. We need to create an extensive benchmarking platform on which we can analyze our models. [4]

references

  1. What is Augmented Intelligence?
  2. Implementing an API-First Design Methodology
  3. Kafka Transforming Into β€œEvent Streaming Database”
  4. Understanding AUCβ€”ROC Curve

Source: habr.com

Add a comment