How Google's BigQuery democratized data analysis. Part 2

Hey Habr! Enrollment for a new course stream is open at OTUS right now Data Engineer. In anticipation of the start of the course, we continue to share useful material with you.

Read the first part

How Google's BigQuery democratized data analysis. Part 2

Data management

Strong Data Governance is the core principle of Twitter Engineering. As we embed BigQuery into our platform, we are focusing on data discovery, access control, security and privacy.

To discover and manage data, we have extended our Data Access Layer - FROM) to provide tools for both on-premises and Google Cloud data, providing a single interface and API for our users. As Google data catalog moves towards general availability, we will include it in our projects to provide users with features such as column search.

BigQuery makes it easy to share and access data, but we needed some degree of control over this to prevent data exfiltration. Among other tools, we have chosen two functions:

  • Domain restricted sharing: A beta feature that prevents users from sharing BigQuery datasets with users outside of Twitter.
  • VPC service controls: A control that prevents data exfiltration and requires users to access BigQuery from known IP address ranges.

We have implemented the Authentication, Authorization, and Auditing (AAA) requirements for security as follows:

  • Authentication: We used GCP user accounts for ad hoc requests and service accounts for work requests.
  • Authorization: We required each dataset to have an owner service account and a reader group.
  • Audit: We exported the BigQuery stackdriver logs, which contained detailed query execution information, to a BigQuery dataset for easy analysis.

To ensure that Twitter users' personal data is properly handled, we must register all BigQuery datasets, annotate personal data, maintain proper storage, and delete (clean up) data that has been deleted by users.

We looked at Google Cloud Data Loss Prevention API, which uses machine learning to classify and edit sensitive data, but chose to manually annotate the dataset due to accuracy. We plan to use the Data Loss Prevention API to complement the custom annotation.

At Twitter, we have created four privacy categories for datasets in BigQuery, listed here in descending order of sensitivity:

  • Highly sensitive datasets are available as needed based on the principle of least privilege. Each data set has a separate group of readers, and we will track the usage of individual accounts.
  • Medium sensitivity datasets (one-way aliases using salted hashing) do not contain Personally Identifiable Information (PII) and are available to a larger group of employees. It's a good balance between privacy and usefulness considerations. This allows employees to perform analysis tasks, such as calculating the number of users who have used a feature, without knowing who the real users are.
  • Low sensitivity datasets with all user-identifying information. This is a good approach from a privacy standpoint, but cannot be used for user-level analysis.
  • Public datasets (released outside of Twitter) are available to all Twitter employees.

As for registration, we used Scheduled Tasks to enumerate the BigQuery datasets and register them in the Data Access Layer (FROM), the Twitter metadata repository. Users will annotate datasets with privacy information as well as specify a retention period. As for cleaning, we evaluate the performance and cost of two options: 1. Scraping datasets in GCS with tools like Scalding and loading them into BigQuery 2. Using BigQuery DML statements. We will likely use a combination of both methods to meet the requirements of different groups and data.

System functionality

Because BigQuery is a managed service, there was no need to involve the Twitter SRE team in systems management or housekeeping duties. It was easy to provide more capacity for both storage and computing. We could change slot reservations by filing tickets with Google support. We identified areas that could be improved, such as self-service for slot allocation and improvements to the monitoring dashboard, and forwarded those requests to Google.

Price

Our preliminary analysis showed that the cost of queries for BigQuery and Presto was on the same level. We purchased slots fixed price to have a stable monthly cost instead of paying by trebuwan per TB of processed data. This decision was also based on feedback from users who didn't want to think about costs before making each request.

Storing data in BigQuery brought costs in addition to the costs of GCS. Tools like Scalding require datasets in GCS, and to access BigQuery we had to load the same datasets into BigQuery format Capacitor. We are working on a Scalding connection to BigQuery datasets that will remove the need to store datasets in both GCS and BigQuery.

For rare cases that required infrequent queries of tens of petabytes, we decided that storing datasets in BigQuery was not cost effective and used Presto to directly access datasets in GCS. To do this, we are looking at BigQuery External Data Sources.

Next Steps

We have noticed a lot of interest in BigQuery since the release of the alpha. We are adding more datasets and more commands to BigQuery. We develop connectors for data analysis tools like Scalding to read and write to BigQuery storage. We are looking at tools like Looker and Apache Zeppelin for creating enterprise quality reports and notes using BigQuery datasets.

The collaboration with Google has been very productive and we are excited to continue and develop this partnership. We have worked with Google to implement our own Partner Issue Trackerto send requests to Google directly. Some of them, such as the BigQuery Parquet loader, are already implemented by Google.

Here are some of our high priority feature requests for Google:

  • Tools for easy data ingestion and support for the LZO-Thrift format.
  • Hourly segmentation
  • Access control improvements such as table, row, and column level permissions.
  • BigQuery External Data Sources with Hive Metastore integration and support for LZO-Thrift format.
  • Improved data catalog integration in the BigQuery user interface
  • Self-service for allocating and monitoring slots.

Conclusion

Democratizing data analysis, visualization, and machine learning in a safe way is a top priority for the Data Platform team. We have identified Google BigQuery and Data Studio as tools that can help achieve this goal, and released BigQuery Alpha for the entire company last year.

We found that queries in BigQuery were simple and efficient. We used Google's tools for ingesting and transforming data for simple pipelines, but for complex pipelines we had to create our own Airflow infrastructure. In the area of ​​data management, BigQuery services for authentication, authorization, and audit meet our needs. To manage metadata and maintain privacy, we needed a lot of flexibility and had to build our own systems. BigQuery, being a managed service, was easy to operate. Request costs were similar to existing tools. Storing data in BigQuery incurs costs in addition to those of GCS.

In general, BigQuery works well for general SQL analysis. We're seeing a lot of interest in BigQuery, and we're working on migrating more datasets, engaging more teams, and building more pipelines with BigQuery. Twitter uses a variety of data that will require a combination of tools such as Scalding, Spark, Presto and Druid. We intend to continue to build on our data analytics tools and provide clear guidance to our users on how to best use our offerings.

Words of gratitude

I'd like to thank my collaborators and teammates, Anju Jha and Will Pascucci, for their great collaboration and hard work on this project. I would also like to thank the engineers and managers from several teams at Twitter and Google who helped us and the BigQuery users on Twitter who provided valuable feedback.

If you are interested in working on these tasks, please see our vacancies in the Data Platform team.

Data Quality in DWH - Data Warehouse Consistency

Source: habr.com

Add a comment