Scalable data classification for security and privacy

Scalable data classification for security and privacy

Classifying data based on content is an open task. Traditional data loss prevention (DLP) systems solve this problem by fingerprinting relevant data and monitoring endpoints for fingerprinting. Given the large number of ever-changing data resources on Facebook, this approach not only doesn't scale, it's also inefficient for determining where the data is. This article is about an end-to-end system built to detect sensitive semantic types in Facebook at scale and automatically enforce data storage and access control.

The approach described here is our first end-to-end privacy system that attempts to solve this problem by incorporating data signals, machine learning, and traditional fingerprinting techniques to display and classify all data on Facebook. The described system is operated in a production environment, achieving an average F2 score of 0,9+ for various privacy classes while processing a large number of data resources in dozens of storages. Introducing a translation of a Facebook post on ArXiv about scalable data classification for security and privacy based on machine learning.

Introduction

Organizations today collect and store large amounts of data in a variety of formats and locations [1], then the data is consumed in many places, sometimes copied or cached multiple times, resulting in valuable and sensitive business information dispersed across many corporate data stores. When an organization is required to comply with certain legal or regulatory requirements, such as complying with regulations in civil litigation, it becomes necessary to collect data on the location of the required data. When a privacy ordinance states that an organization must mask all Social Security Numbers (SSNs) when transferring personal information to unauthorized entities, the natural first step is to look up all SSNs in organization-wide data stores. Under such circumstances, data classification becomes critical [1]. The classification system will allow organizations to automatically enforce privacy and security policies such as enabling access control policy, data retention. Facebook introduces a system built by us at Facebook that uses multiple data signals, a scalable system architecture, and machine learning to detect sensitive semantic data types.

Data discovery and classification is about finding and labeling data so that relevant information can be quickly and efficiently retrieved when needed. The current process is more of a manual process and consists of examining the relevant laws or regulations, determining what types of information should be considered sensitive and what the different levels of sensitivity are, and then building classes and classification policies accordingly [1]. After the Data Loss Prevention (DLP) system, the data is fingerprinted and the downstream endpoints are monitored for fingerprints. When dealing with storage with a large number of assets and petabytes of data, this approach simply does not scale.

Our goal is to build a data classification system that scales for both robust and non-persistent user data, without any additional restrictions on data type or format. This is a bold goal, and, naturally, it is fraught with difficulties. A data entry can be thousands of characters long.

Scalable data classification for security and privacy
Figure 1. Online and offline prediction flows

Therefore, we must effectively represent it using a common set of features that can later be combined and easily moved. These features should not only provide accurate classification, but also provide the flexibility and extensibility to easily add and discover new data types in the future. Second, you need to deal with large stand-alone tables. Persistent data can be stored in tables that are many petabytes in size. This may slow down the scanning speed. Thirdly, we must comply with the strict SLA classification for unstable data. This forces the system to be highly efficient, fast and accurate. Finally, we need to provide low latency data classification for unstable data in order to perform real-time classification as well as for web use cases.

This article describes how we dealt with the problems above and presents a fast and scalable classification system that classifies data items of all types, formats, and sources based on a common set of features. We extended the system architecture and created a special machine learning model for fast classification of offline and online data. This article is organized as follows: Section 2 presents the overall design of the system. Section 3 discusses the parts of a machine learning system. Sections 4 and 5 describe the related work and outline the future direction of the work.

Architecture

To deal with the challenges of sustainable and online data at Facebook scale, the classification system has two separate streams, which we will discuss in detail.

sustainable data

Initially, the system must learn about a lot of Facebook information assets. For each store, some basic information is collected, such as the data center that contains that data, the system that holds that data, and the assets located in that particular data store. This forms a metadata catalog that allows the system to retrieve data efficiently without overloading clients and resources used by other engineers.

This metadata catalog provides a trusted source for all scanned assets and allows you to track the status of various assets. This information prioritizes scheduling based on the collected data and internal information from the system, such as when the asset was last successfully scanned and when it was created, and the past memory and processor requirements for that asset, if it was previously scanned. Then, for each data resource (as resources become available), the actual resource scan job is called.

Each job is a compiled binary that performs a Bernoulli sampling on the latest data available for each asset. The asset is split into separate columns, where the classification result of each column is processed independently. In addition, the system scans for any rich data within columns. JSON, arrays, encoded structures, URLs, base 64 serialized data, and more are all scanned. This can greatly increase the time it takes to complete a scan, as a single table can contain thousands of nested columns in a blob json.

For each row that is selected in the data asset, the classification system extracts float and text features from the content and links each feature back to the column it was taken from. The result of the feature extraction step is a map of all features for each column found in the data asset.

What are signs for?

The concept of features is key. Instead of float and text attributes, we can pass raw sample strings that are directly retrieved from each data resource. Also, machine learning models can be trained directly on each sample rather than hundreds of feature calculations that only try to approximate the sample. There are several reasons for this:

  1. Privacy First: Most importantly, the notion of features allows us to only store in memory the patterns we retrieve. This ensures that we store samples for a single purpose and never log them by our own efforts. This is especially important for unstable data because the service must maintain some classification state before providing a prediction.
  2. Memory: Some samples may be thousands of characters long. Storing such data and passing it to parts of the system unnecessarily consumes a lot of extra bytes. The two factors may combine over time given that there are many data resources with thousands of columns.
  3. Feature aggregation: Features provide a clear representation of the results of each scan through a set of features, allowing the system to aggregate the results of previous scans of the same data resource in a convenient way. This can be useful for aggregating scan results of the same data resource across multiple runs.

The features are then sent to a prediction service, where we use rule-based classification and machine learning to predict the data labels of each column. The service relies on both rule classifiers and machine learning and selects the best prediction given from each prediction object.

Rule classifiers are manual heuristics, it uses calculations and coefficients to normalize an object in the range of 0 to 100. Once such an initial score is generated for each type of data and the column name associated with this data does not fall into any "deny lists" , the rule classifier selects the highest normalized score among all data types.

Due to the complexity of classification, using purely manual heuristics results in poor classification accuracy, especially for unstructured data. For this reason, we have developed a machine learning system to work with the classification of unstructured data such as user content and address. Machine learning allowed us to start moving away from manual heuristics and apply additional data signals (e.g. column names, data origin), greatly improving detection accuracy. We will take a deep dive into our machine learning architecture later.

The prediction service stores the results for each column, along with metadata regarding the time and status of the scan. Any consumers and downstream processes that depend on this data can read it from the daily published data set. This set aggregates the results of all these scan jobs, or the data catalog real-time API. Published forecasts are the foundation for automatic enforcement of privacy and security policies.

Finally, after the prediction service records all data and all predictions are stored, our data catalog API can return all data type predictions for a resource in real time. Every day the system publishes a data set containing all the latest forecasts for each asset.

Unstable data

While the above process is designed for persistent assets, non-persistent traffic is also considered part of an organization's data and can be important. For this reason, the system provides an online API for generating real-time classification predictions for any erratic traffic. The real-time prediction system is widely used in classifying outgoing traffic, incoming traffic in machine learning models, and advertiser data.

Here the API takes two main arguments: the grouping key and the raw data to be predicted. The service performs the same feature extraction as described above and groups the features together for the same key. These features are also supported in persistent cache for failover. For each grouping key, the service ensures that it has seen enough samples before calling the prediction service, according to the process described above.

Optimization

To scan some storages, we use hot storage read optimization libraries and methods [2] and ensure that there are no crashes from other users accessing the same storage.

For extremely large tables (50+ petabytes), despite all the optimizations and memory efficiency, the system is working on scanning and computing everything before running out of memory. After all, the scan is completely computed in memory and is not saved during the scan. If large tables contain thousands of columns with unstructured clumps of data, the job may fail due to insufficient memory resources when making table-wide predictions. This will result in reduced coverage. To combat this, we have optimized the system to use scan speed as a proxy for how well the system is handling the current load. We use speed as a predictive mechanism to see memory issues and predictive feature map calculations. In doing so, we use less data than usual.

Data Signals

A classification system is only as good as the signals from the data. Here we will consider all the signals used by the classification system.

  • Content Based: Of course, the first and most important signal is content. Bernoulli sampling is performed on each data asset that we scan and extract features from the data content. Many features come from content. Any number of floating objects are possible, which represent calculations of how many times a particular type of pattern has been seen. For example, we might have signs of the number of emails seen in a sample, or signs of how many emojis were seen in a sample. These feature calculations can be normalized and aggregated across different scans.
  • Data Origins: An important signal that can help when the content has changed from the parent table. A common example is hashed data. When data in a child table is hashed, it often comes from the parent table, where it remains in the clear. Lineage data helps classify certain types of data when it is not read clearly or converted from an upstream table.
  • Annotations: Another high quality signal to help identify unstructured data. In fact, annotations and lineage data can work together to propagate attributes between different data assets. Annotations help identify the source of unstructured data, while provenance data can help track the flow of that data throughout the store.
  • Data injection is a technique where special, unreadable characters are intentionally injected into known sources with known data types. Then, whenever we scan content with the same unreadable character sequence, we can infer that the content comes from that known data type. This is another qualitative data signal similar to annotations. Except that content-based discovery helps discover the data entered.

Metric measurement

An important component is a rigorous methodology for measuring metrics. The main metrics of the classification improvement iteration are the accuracy and recall of each label, with the F2 score being the most important.

Calculating these metrics requires an independent data asset labeling methodology that is independent of the system itself, but can be used for direct comparison with it. Below we describe how we collect ground truth from Facebook and use it to train our classification system.

Collection of reliable data

We accumulate valid data from each source listed below into its own table. Each table is responsible for aggregating the latest observed values ​​from that particular source. Each source has a data quality check to ensure that the observed values ​​for each source are of high quality and contain the latest data type labels.

  • Logging Platform Configurations: Certain fields in the hive tables are populated with data that is of a certain type. The use and dissemination of these data serves as a reliable source of reliable data.
  • Manual labeling: Developers maintaining the system as well as external labelers are trained to label columns. This usually works well for all types of data in the store, and can be the primary source of confidence for some unstructured data such as post data or user-generated content.
  • Columns from parent tables can be marked or annotated as containing certain data, and we can track this data in the tables below.
  • Sampling threads: Facebook's threads carry data of a specific type. Using our scanner as a service architecture, we can sample streams that have known data types and send them through the system. The system promises not to store this data.
  • Sample Tables: Large hive tables known to contain the entire corpus of data can also be used as training data and passed through the scanner as a service. This is great for tables with a full range of data types, so that selecting a column at random is equivalent to selecting the entire set of that data type.
  • Synthetic data: We can even use libraries that generate data on the fly. This works well for simple, public data types like an address or GPS.
  • Data Stewards: Privacy programs typically use data stewards to manually attach policies to pieces of data. This serves as a highly accurate source of confidence.

We combine every major source of hard data into one corpus with all of that data. The biggest issue with validity is making sure it is representative of the data warehouse. Otherwise, classification engines may overfit. In combating this, all of the above sources are utilized to provide a balance when training models or calculating metrics. In addition, human tokens sample the various columns in the store evenly and label the data appropriately so that the collection of valid values ​​remains unbiased.

Continuous Integration

To ensure rapid iteration and improvement, it is important to always measure system performance in real time. We can measure every improvement in classification against today's system, so we can tactically target data for further improvements. Here we look at how the system completes the feedback loop that is provided by valid data.

When the scheduling system encounters an asset that has a label from a trusted source, we schedule two tasks. The first uses our manufacturing scanner and thus our manufacturing capabilities. The second task uses the latest build scanner with the latest features. Each task writes its output to its own table, tagging the versions along with the classification results.

This is how we compare the classification results of the release candidate and the production model in real time.

While the datasets are comparing RC and PROD features, many variations of the prediction service ML classification engine are being logged. The most recent machine learning model built, the current model in production, and any experimental models. The same approach allows us to "cut" different versions of the model (agnostic to our rule classifiers) and compare metrics in real time. It's so easy to determine when an ML experiment is ready to go into production.

Every night, the RC features calculated for that day are sent to the ML training pipeline, where the model is trained on the latest RC features and evaluates its performance against a valid dataset.

Every morning, the model completes training and is automatically published as an experimental model. It is automatically included in the experimental list.

Some results

More than 100 different types of data are marked with high accuracy. Well-structured types such as emails and phone numbers are classified with an f2 score greater than 0,95. Free data types like custom content and name also perform very well, with F2 scores of over 0,85.

A large number of distinct columns of robust and non-persistent data are classified daily across all repositories. Over 500 terabytes are scanned daily across over 10 data stores. The coverage of most of these repositories is over 98%.

Over time, classification has become very efficient, as classification jobs in a persistent offline stream take an average of 35 seconds from scanning an asset to computing predictions for each column.

Scalable data classification for security and privacy
Rice. 2. Diagram describing the continuous flow of integration to understand how RC objects are generated and sent to the model.

Scalable data classification for security and privacy
Figure 3. High-level diagram of a machine learning component.

Machine learning system component

In the previous section, we took a deep dive into the architecture of the entire system, highlighting scale, optimization, and offline and online data flows. In this section, we'll take a look at the prediction service and describe the machine learning system that powers the prediction service.

With over 100 data types and some unstructured content such as post data and user-generated content, using purely manual heuristics results in subparametric classification accuracy, especially for unstructured data. For this reason, we have also developed a machine learning system to deal with the complexities of unstructured data. Using machine learning allows you to start moving away from manual heuristics and work with features and additional data signals (eg column names, data origin) to improve accuracy.

The implemented model studies vector representations [3] over dense and sparse objects separately. They are then combined to form a vector that goes through a series of batch normalization [4] and non-linearity steps to produce the final result. The end result is a floating point number between [0-1] for each label, indicating the probability that the instance belongs to that sensitivity type. Using PyTorch for the model allowed us to move faster, allowing developers outside the team to quickly make and test changes.

When designing the architecture, it was important to model sparse (eg text) and dense (eg numeric) objects separately due to their intrinsic difference. For the final architecture, it was also important to perform parameter sweeping to find the optimal value for learning rate, batch size, and other hyperparameters. The choice of optimizer was also an important hyperparameter. We found that the popular optimizer Adamoften leads to overfitting, while the model with SGD more stable. There were additional nuances that we had to include directly in the model. For example, static rules that ensured that the model makes a deterministic prediction when a feature has a certain value. These static rules are defined by our clients. We found that incorporating them directly into the model resulted in a more self-contained and robust architecture, as opposed to implementing a post-processing step to handle these special edge cases. Also note that these rules are disabled during training so as not to interfere with the gradient descent training process.

Problems

One of the challenges was the collection of high quality, reliable data. The model needs per-class validity so that it can learn associations between objects and labels. In the previous section, we discussed data collection methods for both system measurement and model training. The analysis showed that data classes such as credit card and bank account numbers are not very common in our repository. This makes it difficult to collect large amounts of reliable data for model training. To solve this problem, we have developed processes for generating synthetic data for these classes. We generate such data for sensitive types, including SSN, credit card numbers ΠΈ IBAN-numbers for which the model could not predict previously. This approach allows sensitive data types to be handled without the privacy risk associated with hiding real sensitive data.

In addition to data credibility issues, there are open architectural issues we are working on such as change isolation ΠΈ early stop. Change isolation is important so that when various changes are made to different parts of the network, the impact is isolated to specific classes and does not have a wide impact on the overall prediction performance. Improving the early stopping criteria is also critical so that we can stop the training process at a stable point for all classes, and not at the point where some classes retrain and others do not.

Feature Importance

When a new feature is introduced into the model, we want to know its overall impact on the model. We also want to make sure that the predictions are human interpretable so that we can understand exactly what features are being used for each type of data. To this end, we have developed and implemented by class the importance of features for a PyTorch model. Note that this is different from general feature importance, which is usually supported because it does not tell us which features are important for a particular class. We measure the importance of an object by calculating the increase in prediction error after permuting the object. A feature is "important" when the permutation of the values ​​increases the error of the model, because in this case the model relied on the feature in its prediction. The sign is "not important" when the shuffling of its values ​​leaves the model error unchanged, since in this case the model ignored it [5].

Feature importance for each class allows us to make the model interpretable so that we can see what the model is paying attention to when predicting the label. For example, when we analyze ADDR, then we guarantee that the attribute associated with the address, such as AddressLinesCount, ranks high in the feature importance table for each class so that our human intuition aligns well with what the model has learned.

Evaluation

It is important to define a single metric for success. We chose F2 - balance between recall and accuracy (recall bias is slightly larger). Revocation is more important to the privacy use case than accuracy because it is critical for the team not to miss any sensitive data (while still maintaining reasonable accuracy). The actual F2 performance score of our model is beyond the scope of this article. However, with careful tuning we can achieve a high (0,9+) F2 score for the most important sensitive classes.

Related work

There are many algorithms for automatic classification of unstructured documents using various methods such as pattern matching, document similarity search, and various machine learning methods (Bayesian, decision trees, k-nearest neighbors, and many others) [6]. Any of these can be used as part of the classification. However, the problem is scalability. The classification approach in this article is biased towards flexibility and performance. This allows us to support new classes in the future and keep latency low.

There is also a lot of data fingerprinting work. For example, the authors in [7] described a solution that focuses on the problem of catching confidential data leaks. The underlying assumption is that a data fingerprint can be matched against a set of known sensitive data. The authors in [8] describe a similar privacy leak problem, but their solution is based on the specific Android architecture and is classified only if the user's actions resulted in the sending of personal information or if the underlying application leaked user data. The situation here is somewhat different, as user data can also be highly unstructured. Therefore, we need a more sophisticated technique than fingerprinting.

Finally, to deal with the lack of data for some types of sensitive data, we introduced synthetic data. There is a large body of literature on data augmentation, for example, the authors in [9] investigated the role of noise injection during training and observed positive results in supervised learning. Our approach to privacy is different because introducing noisy data can be counterproductive and instead we focus on high quality synthetic data.

Conclusion

In this article, we have presented a system that can classify a piece of data. This allows us to create systems to enforce privacy and security policies. We have shown that scalable infrastructure, continuous integration, machine learning, and high-quality data integrity data are key to the success of many of our privacy initiatives.

There are many areas for future work. This may include providing support for raw data (files), classifying not only the type of data but also the level of sensitivity, and using self-supervised learning directly during training by generating accurate synthetic examples. Which, in turn, will help the model to reduce losses by the greatest amount. Future work may also focus on the investigation workflow, where we go beyond detection and provide root cause analysis of various privacy breaches. This will help in cases like sensitivity analysis (i.e. whether the data type's privacy sensitivity is high (eg user IP) or low (eg Facebook internal IP)).

Bibliography

  1. David Ben-David, Tamar Domany, and Abigail Tarem. Enterprise data classification using semantic web technologies. In Peter F.Ï Patel-Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Jeff Z. Pan, Ian Horrocks, and Birte Glimm, editors, The Semantic Web - ISWC 2010, pages 66–81, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
  2. Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. f4: Facebook's warm BLOB storage system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 383–398, Broomfield, CO, October 2014. USENIX Association.
  3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In CJC Burges, L. Bottou, M. Welling, Z. Ghahramani, and KQ Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  4. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
  5. Leo Breiman. Random forests. Mach. learn., 45(1):5–32, October 2001.
  6. Thair Nu Phyu. Survey of classification techniques in data mining.
  7. X. Shu, D. Yao, and E. Bertino. Privacy-preserving detection of sensitive data exposure. IEEE Transactions on Information Forensics and Security, 10(5):1092–1103, 2015.
  8. Zhemin Yang, Min Yang, Yuan Zhang, Guofei Gu, Peng Ning, and Xiaoyang Wang. Appintent: Analyzing sensitive data transmission in android for privacy leakage detection. pages 1043–1054, 11 2013.
  9. Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data augmentation.

Scalable data classification for security and privacy
Find out the details of how to get a sought-after profession from scratch or Level Up in skills and salary by taking SkillFactory online courses:

More courses

Source: habr.com

Add a comment