How Google's BigQuery democratized data analysis. Part 1

Hey Habr! Enrollment for a new course stream is open at OTUS right now Data Engineer. In anticipation of the start of the course, we have traditionally prepared a translation of interesting material for you.

Every day, over one hundred million people visit Twitter to find out what's going on in the world and discuss it. Each tweet and any other user action generates an event that is available for internal data analysis within Twitter. Hundreds of employees analyze and visualize this data, and improving their experience is a top priority for the Twitter Data Platform team.

We believe that users with a wide range of technical skills should be able to find data and have access to well-functioning SQL-based analysis and visualization tools. This would allow a whole new group of less technical users, including data analysts and product managers, to extract insights from the data, allowing them to better understand and use the power of Twitter. This is how we democratize data analysis on Twitter.

As our tools and capabilities for internal data analysis have improved, we have seen the improvement of the Twitter service. However, there is still room for improvement. Current tools like Scalding require programming experience. SQL-based analysis tools such as Presto and Vertica have performance issues at large scale. We also have a problem with distributing data across multiple systems without constant access to it.

Last year we announced new collaboration with Google, within which we transfer parts of our data infrastructure on Google Cloud Platform (GCP). We concluded that Google Cloud tools Big Data can help us in our initiatives to democratize analysis, visualization and machine learning on Twitter:

  • BigQuery: enterprise data warehouse with SQL engine based Dremel, which is famous for its speed, simplicity and copes with machine learning.
  • data studio: big data visualization tool with collaboration features like Google Docs.

In this article, you will learn about our experience with these tools: what we have done, what we have learned and what we will do next. We will now focus on batch and interactive analytics. Real-time analytics will be discussed in the next article.

The History of Data Warehouses on Twitter

Before diving into BigQuery, it's worth briefly retelling the history of data warehouses on Twitter. In 2011, data analysis on Twitter was performed in Vertica and Hadoop. To create MapReduce Hadoop jobs, we used Pig. In 2012, we replaced Pig with Scalding, which had a Scala API with benefits such as the ability to create complex pipelines and ease of testing. However, for many data analysts and product managers who were more comfortable working with SQL, it was quite a steep learning curve. Around 2016, we started using Presto as our SQL front end for Hadoop data. Spark offered a Python interface which makes it a good choice for ad hoc data science and machine learning.

Since 2018, we have used the following tools for data analysis and visualization:

  • Scalding for production lines
  • Scalding and Spark for ad hoc data analytics and machine learning
  • Vertica and Presto for ad hoc and interactive SQL analysis
  • Druid for low interactive, exploratory and low latency access to time series metrics
  • Tableau, Zeppelin and Pivot for Data Visualization

We have found that while these tools offer very powerful features, we have had difficulty making these features available to a wider audience on Twitter. By extending our platform with Google Cloud, we are focusing on simplifying our analytics tools for all of Twitter.

Google's BigQuery Data Warehouse

Several teams at Twitter have already included BigQuery in some of their production pipelines. Using their experience, we began to evaluate the possibilities of BigQuery for all Twitter use cases. Our goal was to offer BigQuery to the entire company, and to standardize and support it within the Data Platform toolkit. This was difficult for many reasons. We needed to develop an infrastructure to reliably receive large amounts of data, support company-wide data management, ensure proper access controls, and ensure customer privacy. We also had to create systems for resource allocation, monitoring, and chargebacks so that teams could use BigQuery effectively.

In November 2018, we released an alpha release of BigQuery and Data Studio for the entire company. We've offered some of our most used personal data-cleared spreadsheets to Twitter staff. BigQuery has been used by over 250 users from various teams including engineering, finance and marketing. Most recently, they were running about 8 requests, processing about 100 PB per month, not counting scheduled requests. After receiving very positive feedback, we decided to move forward and offer BigQuery as the primary resource for interacting with data on Twitter.

Here is a diagram of the high-level architecture of our Google BigQuery data warehouse.

How Google's BigQuery democratized data analysis. Part 1
We copy data from local Hadoop clusters to Google Cloud Storage (GCS) using the internal Cloud Replicator tool. We then use Apache Airflow to create pipelines that use "bq_loadΒ» to load data from GCS into BigQuery. We use Presto to query Parquet or Thrift-LZO datasets in GCS. BQ Blaster is an internal Scalding tool for loading HDFS Vertica and Thrift-LZO datasets into BigQuery.

In the following sections, we will discuss our approach and expertise in ease of use, performance, data management, system health, and cost.

Ease of use

We found that it was easy for users to get started with BigQuery as it did not require software installation and users could access it through an intuitive web interface. However, users needed to become familiar with some of the GCP features and concepts, including resources such as projects, datasets, and tables. We have developed tutorials and tutorials to help users get started. With a basic understanding gained, it is easy for users to navigate datasets, view schema and table data, run simple queries, and visualize results in Data Studio.

Our goal with data entry in BigQuery was to provide seamless loading of HDFS or GCS datasets with a single click. We considered Cloud Composer (managed by Airflow) but were unable to use it due to our "Domain Restricted Sharing" security model (more on this in the Data Management section below). We experimented with using Google Data Transfer Service (DTS) to organize BigQuery load tasks. While DTS was quick to set up, it was not flexible for building pipelines with dependencies. For our alpha release, we have created our own Apache Airflow environment in GCE and are preparing it for production and the ability to support more data sources such as Vertica.

To transform data into BigQuery, users create simple SQL data pipelines using scheduled queries. For complex multi-stage pipelines with dependencies, we plan to use either our own Airflow framework or Cloud Composer along with cloud dataflow.

Performance

BigQuery is designed for general purpose SQL queries that process large amounts of data. It is not designed for the low latency, high throughput queries required by a transactional database, or the low latency time series analysis implemented Apache Druid. For interactive analytic queries, our users expect a response time of less than one minute. We had to design the use of BigQuery to meet these expectations. In order to provide predictable performance for our users, we have used BigQuery functionality, which is available to customers on a fixed fee basis, which allows project owners to reserve minimum slots for their queries. slot BigQuery is a unit of computing power required to execute SQL queries.

We analyzed over 800 queries processing about 1 TB of data each and found that the average execution time was 30 seconds. We also learned that performance is highly dependent on the use of our slot in various projects and tasks. We had to clearly separate our production and ad hoc slot reserves in order to maintain performance for production use cases and interactive analysis. This greatly influenced our design for slot reservations and project hierarchies.

We will talk about data management, functionality and cost of systems in the coming days in the second part of the translation, and now we invite everyone to free live webinar, where you can learn more about the course, as well as ask questions to our expert - Egor Mateshuk (Senior Data Engineer, MaximaTelecom).

Read more:

Source: habr.com

Add a comment