File formats in big data: a brief educational program

File formats in big data: a brief educational program
Weather Deity by Remarin

Team Mail.ru Cloud Solutions offers article translation engineer Rahul Bhatia of Clairvoyant on what are the file formats in big data, what are the most common features of Hadoop formats, and what is the best format to use.

Why different file formats are needed

A major performance bottleneck in HDFS-enabled applications such as MapReduce and Spark is the time to search, read, and write data. These problems are exacerbated by the difficulty in managing large datasets if we have an evolving schema, rather than a fixed one, or if there are some storage restrictions.

Big data processing increases the load on the storage subsystem - Hadoop stores data redundantly to achieve fault tolerance. In addition to disks, the processor, network, I / O system, and so on are loaded. As the amount of data grows, so does the cost of processing and storing it.

Various file formats in Hadoop designed to solve these very problems. Choosing the right file format can provide some significant benefits:

  1. Faster reading time.
  2. Faster recording time.
  3. Shared files.
  4. Support for schema evolution.
  5. Extended compression support.

Some file formats are intended for general use, others for more specific uses, and some are designed for specific data characteristics. So the choice is really quite large.

Avro File Format

For data serialization widely use Avro is string-based, i.e. string, data storage format in Hadoop. It stores the schema in JSON format, making it easy for any program to read and interpret. The data itself is in binary format, compact and efficient.

The Avro serialization system is language neutral. Files can be processed by different languages, currently C, C++, C#, Java, Python and Ruby.

A key feature of Avro is its robust support for data schemas that change over time, i.e. evolve. Avro understands schema changes - removing, adding or changing fields.

Avro supports a variety of data structures. For example, you can create an entry that contains an array, an enumerated type, and a subentry.

File formats in big data: a brief educational program
This format is ideal for writing to the landing (transition) zone of the data lake (data lake, or data lake - a collection of instances for storing various types of data in addition to data sources directly).

So, for writing to the landing zone of the data lake, this format is best suited for the following reasons:

  1. Data from this zone is usually read in its entirety for further processing by downstream systems - and a row-based format is more efficient in this case.
  2. Downstream systems can easily retrieve schema tables from filesβ€”no need to store schemas separately in external meta-storage.
  3. Any change to the original schema is easily handled (schema evolution).

Parquet File Format

Parquet is an open source file format for Hadoop that stores nested data structures in flat columnar format.

Compared to the traditional inline approach, Parquet is more efficient in terms of storage and performance.

This is especially useful for queries that read specific columns from a wide (multi-column) table. Due to the file format, only the necessary columns are read, so I/O is kept to a minimum.

A small digression-explanation: To better understand the Parquet file format in Hadoop, let's see what the column-based - i.e. columnar - format is. In this format, the same type of values ​​​​of each column are stored together.

For example, the record includes the ID, Name, and Department fields. In this case, all the values ​​of the ID column will be stored together, as well as the values ​​of the Name column, and so on. The table will look something like this:

ID
Name
Department

1
emp1
d1

2
emp2
d2

3
emp3
d3

In string format, the data will be stored as follows:

1
emp1
d1
2
emp2
d2
3
emp3
d3

In columnar file format, the same data will be stored like this:

1
2
3
emp1
emp2
emp3
d1
d2
d3

The column format is more efficient when you need to query multiple columns from a table. It will only read the required columns because they are in the neighborhood. Thus, I / O operations are minimized.

For example, you only want the NAME column. IN string format each record in the dataset needs to be loaded, parsed into fields, and then the NAME data is extracted. The column format allows you to jump directly to the Name column because all the values ​​for that column are stored together. You don't have to scan the entire record.

Therefore, the columnar format improves query performance because it takes less lookup time to get to the required columns and reduces the number of I/Os by reading only the required columns.

One of the unique features parquet is that in this format it can store data with nested structures. This means that in a Parquet file, even nested fields can be read individually without having to read all the fields in the nested structure. To store nested structures, Parquet uses the shredding and assembly algorithm.

File formats in big data: a brief educational program
To understand the Parquet file format in Hadoop, you need to know the following terms:

  1. Group of lines (row group): logical horizontal splitting of data into rows. A row group consists of a slice of each column in the dataset.
  2. Column Fragment (column chunk): a chunk of a particular column. These column fragments live in a specific group of rows and are guaranteed to be contiguous in the file.
  3. Page (page): Column fragments are divided into pages written one after the other. The pages have a common title, so you can skip the ones you don't need while reading.

File formats in big data: a brief educational program
Here the title just contains the magic number PAR1 (4 bytes) that identifies the file as a Parquet file.

The following is written in the footer:

  1. File metadata that contains the start coordinates of each column's metadata. When reading, you must first read the file's metadata to find any column fragments of interest. Then the column fragments should be read sequentially. Other metadata includes the format version, schema, and any additional key-value pairs.
  2. Metadata length (4 bytes).
  3. magic number PAR1 (4 bytes).

ORC file format

Optimized row-column file format (Optimized Row Column, ORC) offers a very efficient way of storing data and was designed to overcome the limitations of other formats. Stores data in a perfectly compact form, allowing you to skip unnecessary details - without the need to build large, complex or manually maintained indexes.

Advantages of the ORC format:

  1. One file per task output, which reduces the load on the NameNode (name node).
  2. Support for Hive data types, including DateTime, decimal and complex data types (struct, list, map and union).
  3. Simultaneous reading of the same file by different RecordReader processes.
  4. Ability to split files without scanning for markers.
  5. Estimation of the maximum possible allocation of heap memory for read/write processes based on the information in the file footer.
  6. The metadata is stored in a binary Protocol Buffers serialization format, which allows fields to be added and removed.

File formats in big data: a brief educational program
The ORC stores collections of strings in a single file, and inside the collection, string data is stored in a columnar format.

An ORC file stores groups of lines called stripes and ancillary information in the footer of the file. Postscript at the end of the file contains the compression options and the size of the compressed footer.

The default stripe size is 250 MB. Due to such large stripes, reading from HDFS is performed more efficiently: in large contiguous blocks.

The footer of the file contains the list of lanes in the file, the number of lines per lane, and the data type of each column. The resulting value of count, min, max and sum for each column is also written there.

The footer of the strip contains a directory of stream locations.

Row data is used when scanning tables.

The index data includes the minimum and maximum values ​​for each column and the positions of the rows in each column. ORC indexes are only used to select bands and row groups, not to answer queries.

Comparison of different file formats

Avro versus Parquet

  1. Avro is a row-by-row storage format, while Parquet stores data by columns.
  2. Parquet is better suited for analytical queries, which means that reading and querying data is much more efficient than writing.
  3. Write operations in Avro are more efficient than in Parquet.
  4. Avro is more mature with circuit evolution. Parquet only supports schema addition, while Avro supports multifunctional evolution, i.e. adding or changing columns.
  5. Parquet is ideal for querying a subset of columns in a multi-column table. Avro is suitable for ETL operations where we query all columns.

ORC versus Parquet

  1. Parquet is better at storing nested data.
  2. ORC is better suited for predicate pushdown.
  3. ORC supports ACID properties.
  4. ORC compresses data better.

What else to read on the topic:

  1. Big data analytics in the cloud: how companies can become data-centric.
  2. The Humble Guide to Database Schemas.
  3. Our telegram channel about digital transformation.

Source: habr.com

Add a comment