🥇How the video codec works. Part 2. What, why, how

First part: Fundamentals of working with video and images

What? A video codec is a piece of software/hardware that compresses and/or decompresses digital video.

For what? Despite certain limitations, both in terms of throughput and
and in terms of storage space, the market is demanding more and more quality video. Remember how in the last post we calculated the required minimum for 30 frames per second, 24 bits per pixel, with a resolution of 480x240? We got 82,944 Mbps without compression. Compression is currently the only way to stream HD/FullHD/4K to TV screens and the Internet at all. How is this achieved? Let's take a quick look at the main methods.

The translation was made with the support of EDISON Software.

We are engaged in integration of video surveillance systemsand developing a microtomograph.

Codec vs Container

A common newbie mistake is to confuse a digital video codec with a digital video container. A container is a certain format. A wrapper containing video (and possibly audio) metadata. The compressed video can be thought of as the payload of a container.

Typically, a video file extension indicates its kind of container. For example, the video.mp4 file is most likely a container MPEG-4 Part 14, and the file named video.mkv is most likely matryoshka. To be completely sure of the codec and container format, you can use ffmpeg or Media Info.

A bit of history

Before we move on to How?, let's dive into history a bit to understand some older codecs a little better.

Video codec H.261 appeared in 1990 (technically - in 1988) and was created to work with a data transfer rate of 64 Kbps. It already used ideas like chroma subsampling, macroblocks, and so on. The video codec standard was published in 1995. H.263, which developed until 2001.

In 2003, the first version was completed H.264 / AVC. In the same year, TrueMotion released their free video codec that compresses lossy video called VP3. In 2008, Google bought this company by releasing VP8 in the same year. In December 2012, Google released VP9, and it's supported in about ¾ of the browser market (including mobile devices).

AV1 is a new free and open source video codec developed by Alliance for Open Media (AO Media), which includes well-known companies such as: Google, Mozilla, Microsoft, Amazon, Netflix, AMD, ARM, NVidia, Intel and Cisco. The first version of the 0.1.0 codec was published on April 7, 2016.

Birth of AV1

In early 2015, Google was working on VP10, Xiph (which is owned by Mozilla) was working on Daala, and Cisco made their free video codec called Thor.

Then MPEGLA first announced annual limits for HEVC (H.265) and a fee 8 times higher than for H.264, but soon they changed the rules again:

no annual limit
content fee (0,5% of revenue) and
unit charges are about 10 times higher than H.264.

Alliance for Open Media was created by companies from different fields: hardware manufacturers (Intel, AMD, ARM, Nvidia, Cisco), content providers (Google, Netflix, Amazon), browser makers (Google, Mozilla) and others.

The companies had a common goal - a royalty-free video codec. Then appears AV1 with a much simpler patent license. Timothy B. Terryberry made a stunning presentation that became the source of the current AV1 concept and its license model.

You will be surprised to know that it is possible to parse the AV1 codec through a browser (those who are interested can go to aomanalyzer.org).

Universal codec

Let's analyze the main mechanisms underlying the universal video codec. Most of these concepts are useful and used in modern codecs such as VP9, AV1 и HEVC. I warn you that many of the things explained will be simplified. Sometimes real examples will be used (as in the case of H.264) to demonstrate technologies.

1st step - splitting the image

The first step is to divide the frame into several sections, subsections, and so on.

For what? There are many reasons. When we subdivide the picture, we can more accurately predict the motion vector by using small sections for small moving parts. While for a static background, you can limit yourself to larger sections.

Typically, codecs organize these sections into sections (or chunks), macroblocks (or coding tree blocks), and multiple subsections. The maximum size of these partitions varies, HEVC sets 64x64 while AVC uses 16x16 and subsections can be split up to 4x4.

Do you remember the types of frames from the last article ?! The same can be applied to blocks, so we can have an I-tile, a B-block, a P-macroblock, and so on.

For those who want to practice, see how the image is divided into sections and subsections. To do this, you can use the already mentioned in the last article Intel Video Pro Analyzer (the one that is paid, but with a free trial that has a limit on the first 10 frames). Sections analyzed here VP9:

2nd step - forecasting

As soon as we have sections, we can make astrological forecasts for them. For INTER forecasting must be transferred motion vectors and the remainder, and for INTRA forecasting, forecast direction and the remainder.

3rd step - transformation

After we get the residual block (predicted section → real section), it is possible to transform it in such a way that we know which pixels can be discarded while maintaining the overall quality. There are some conversions that provide the exact behavior.

Although there are other methods, consider in more detail discrete cosine transform (DCT From discrete cosine transform). Main functions of DCT:

Converts blocks of pixels into equally sized blocks of frequency coefficients.
Compacts power to help eliminate spatial redundancy.
Provides reversibility.

February 2, 2017 Sinttra R.J. (Cintra, RJ) and Bayer F.M. (Bayer FM) published an article on a DCT-like transform for image compression requiring only 14 pads.

Don't worry if you don't understand the benefits of each item. Now, on specific examples, we will see their real value.

Let's take this 8x8 block of pixels:

This block is rendered into the following 8 by 8 pixel image:

Let's apply DCT to this block of pixels and get a block of coefficients with a size of 8x8:

And if we render this block of coefficients, we get the following image:

As you can see, it doesn't look like the original image. It can be seen that the first coefficient is very different from all the others. This first coefficient is known as the DC coefficient, representing all the samples in the input array, something like an average.

This block of coefficients has an interesting property: it separates the high-frequency components from the low-frequency ones.

In an image, most of the power is concentrated at lower frequencies, so if you convert the image to its frequency components and discard the higher frequency coefficients, you can reduce the amount of data needed to describe the image without sacrificing too much image quality.

Frequency refers to how fast the signal changes.

Let's try to apply the knowledge gained in the test case by converting the original image to its frequency (block of coefficients) using DCT and then discarding some of the least important coefficients.

First, we convert it to the frequency domain.

Next, we discard a part (67%) of the coefficients, mainly the lower right part.

Finally, we restore the image from this discarded block of coefficients (remember, it must be reversible) and compare with the original.

We see that it resembles the original image, but there are many differences from the original. We threw out 67,1875% and still got something reminiscent of the original source. It was possible to more thoughtfully discard the coefficients to get an even better image, but that's the next topic.

Each coefficient is formed using all pixels

Important: each coefficient is not directly mapped to a single pixel, but is a weighted sum of all pixels. This amazing graph shows how the first and second coefficients are calculated using weights unique to each index.

You can also try to visualize the DCT by looking at a simple imaging based on it. For example, here is the symbol A generated using each coefficient weight:

4th step - quantization

After throwing out some coefficients in the previous step, in the last step (transformation), we produce a special form of quantization. At this stage, it is acceptable to lose information. Or, more simply, we will quantize the coefficients to achieve compression.

How can you quantize a block of coefficients? One of the simplest methods would be uniform quantization, when we take a block, divide it by one value (by 10) and round it up.

Can we reverse this block of coefficients? Yes, we can by multiplying by the same value by which we divided.

This approach is not the best because it does not take into account the importance of each coefficient. It would be possible to use a matrix of quantizers instead of a single value, and this matrix can use the DCT property, quantizing the majority of the lower right and the minority of the upper left.

Step 5 - Entropy Encoding

After we have quantized the data (image blocks, fragments, frames), we can still compress them losslessly. There are many algorithmic ways to compress data. We are going to take a brief look at some of them, for a deeper understanding you can read the book Understanding Compression: Data Compression for Modern Developers (“Understanding Compression: Data Compression for Modern Developers").

Video encoding with VLC

Let's say we have a stream of characters: a, e, r и t. The probability (between 0 and 1) of how often each character occurs in the stream is presented in this table.

	a	e	r	t
Chance	0,3	0,3	0,2	0,2

We can assign unique binary codes (preferably small ones) to the most likely ones, and larger codes to the less likely ones.

	a	e	r	t
Chance	0,3	0,3	0,2	0,2
binary code	0	10	110	1110

We compress the stream, assuming that we end up spending 8 bits per character. Without compression, a character would take 24 bits. If each character is replaced by its code, then savings are obtained!

The first step is to encode the character e, which is 10 and the second character is a, which is added (not in a mathematical way): [10] [0], and finally the third character t, which makes our final compressed bitstream equal to [10] [0] [1110] or 1001110, which requires only 7 bits (3,4 times less space than the original).

Please note that each code must be a unique code with a prefix. Huffman algorithm help you find those numbers. Although this method is not without flaws, there are video codecs that still offer this algorithmic method for compression.

Both encoder and decoder must have access to the symbol table with their binary codes. Therefore, it is also necessary to send in the input data and the table.

Arithmetic coding

Let's say we have a stream of characters: a, e, r, s и t, and their probability is presented in this table.

	a	e	r	s	t
Chance	0,3	0,3	0,15	0,05	0,2

With this table, we will build ranges containing all possible characters, sorted by the largest number.

Now let's encode a stream of three characters: eat.

First select the first character e, which is in the sub-range from 0,3 to 0,6 (not including). We take this subrange and again divide it in the same proportions as before, but for this new range.

Let's continue coding our stream eat. Now take the second character a, which is in the new sub-range from 0,3 to 0,39, and then we take our last character t and by repeating the same process again, we get the last sub-range from 0,354 to 0,372.

We just need to pick a number in the last sub-range from 0,354 to 0,372. Let's choose 0,36 (but you can choose any other number in this subrange). Only with this number will we be able to restore our original flow. It's as if we were drawing a line within the ranges to encode our stream.

The reverse operation (i.e. decoding) is just as simple: with our number of 0,36 and our original range, we can run the same process. But now, using this number, we reveal the stream encoded with this number.

With the first range, we notice that our number corresponds to a slice, hence our first character. Now split this subrange again, following the same process as before. Here you can see that 0,36 corresponds to the symbol a, and after repeating the process, we arrived at the last character t (forming our original encoded stream eat).

Both the encoder and the decoder must have a symbol probability table, so you need to send it in the input as well.

Pretty elegant, right? Whoever came up with this solution was pretty damn smart. Some video codecs use this technique (or at least offer it as an option).

The idea is to losslessly compress the quantized bitstream. There are probably tons of details, reasons, compromises, etc. missing from this article. But you, if you are a developer, should know more. Newer codecs try to use different entropy coding algorithms such as ANS.

Step 6 - Bitstream Format

After doing all this, it remains to decompress the compressed frames in the context of the steps performed. It is necessary to explicitly inform the decoder about the decisions made by the encoder. The decoder must be provided with all the necessary information: bit depth, color space, resolution, prediction information (motion vectors, directional INTER prediction), profile, level, frame rate, frame type, frame number, and more.

We'll take a quick look at the bitstream H.264. Our first step is to create a minimal H.264 bitstream (FFmpeg adds all encoding options by default, such as SEI NAL - a little further we will find out what it is). We can do this using our own repository and FFmpeg.

./s/ffmpeg -i /files/i/minimal.png -pix_fmt yuv420p /files/v/minimal_yuv420.h264

This command will generate a raw bitstream H.264 with one frame, resolution 64×64, with color space YUV420. This uses the next image as the frame.

H.264 bit stream

Standard CVA (H.264) specifies that the information will be sent in macro frames (in the network's sense) called nal (This is such a level of abstraction of the network). The main purpose of NAL is to provide a "network friendly" presentation of video. This standard should work on TVs (stream based), on the Internet (packet based).

There is a synchronization marker to define the boundaries of NAL elements. Each synchronization marker contains a value 0x00 0x00 0x01, except for the first one, which is 0x00 0x00 0x00 0x01. If we run hexdump for a generated H.264 bitstream, then identify at least three NAL patterns at the beginning of the file.

As mentioned, the decoder needs to know not only the image data, but also the details of the video, the frame, the colors, the parameters used, and more. The first byte of each NAL defines its category and type.

NAL type identifier	Description
0	unknown type
1	Encoded image fragment without IDR
2	Slice data section encoded A
3	Slice data section encoded B
4	Slice data section encoded C
5	Encoded IDR fragment of an IDR picture
6	More information about the SEI extension
7	SPS sequence parameter set
8	PPS Image Parameter Set
9	Access separator
10	End of sequence
11	End of stream
...	...

Typically, the first NAL of a bitstream is SPS. This NAL type is responsible for informing about common encoding variables such as profile, level, resolution, and so on.

If we skip the first sync marker, we can decode the first byte to find out which NAL type is first.

For example, the first byte after the sync token is 01100111, where the first bit (0) is in the field forbidden_zero_bit. Next 2 bits (11) tells us the field nal_ref_idc, which indicates whether this NAL is a reference field or not. And the remaining 5 bits (00111) tells us the field nal_unit_type, in this case it is the SPS block (7) NAL.

second byte (binary=01100100, hex=0x64, December=100) in the SPS NAL is the field profile_idc, which shows the profile that the encoder was using. In this case, a restricted high profile was used (i.e. high profile without bi-directional B-segment support).

If you read the bitstream specification H.264 for SPS NAL, we find many values for parameter name, category, and description. For example, let's look at the fields pic_width_in_mbs_minus_1 и pic_height_in_map_units_minus_1.

Parameter name	Category	Description
pic_width_in_mbs_minus_1	0	ue(v)
pic_height_in_map_units_minus_1	0	ue(v)

If we perform some mathematical operations with the values of these fields, we will get permission. It is possible to represent 1920 x 1080 using pic_width_in_mbs_minus_1 with value 119 ((119 + 1) * macroblock_size = 120 * 16 = 1920). Again, to save space, instead of encoding 1920, they did it with 119.

If we continue to check our created video in binary form (for example: xxd -b -c 11v/minimal_yuv420.h264), then you can jump to the last NAL, which is the frame itself.

Here we see its first 6 byte values: 01100101 10001000 10000100 00000000 00100001 11111111. Since the first byte is known to indicate the NAL type, in this case (00101) is an IDR fragment (5), and then it will be possible to further explore it:

Using the specification information, it will be possible to decode the fragment type (slice_type) and frame number (frame_num) among other important fields.

To get the values of some fields (ue(v), me(v), se(v) Or te(v)), we need to decode the fragment using a special decoder based on exponential Golomb code. This method is very efficient for encoding variable values, especially when there are many default values.

The values slice_type и frame_num of this video are 7 (I-fragment) and 0 (first frame).

A bitstream can be thought of as a protocol. If you want to know more about the bitstream, you should refer to the specification ITU H.264. Here is a macro diagram showing where the image data resides (YUV in compressed form).

Other bitstreams can be explored, such as VP9, H.265 (HEVC) or even our new best bitstream AV1. Are they all similar? No, but having dealt with at least one, it is much easier to understand the rest.

Do you want to practice? Explore the H.264 bit stream

You can generate a single frame video and use MediaInfo to examine the bitstream H.264. In fact, nothing prevents you from even looking at the source code that analyzes the bit stream H.264 (CVA).

For practice, you can use Intel Video Pro Analyzer (I already said that the program is paid, but is there a free trial version, with a limit of 10 frames?).

Review

Note that many modern codecs use the same model that we just studied. Here, let's take a look at the block diagram of the video codec Thor. It contains all the steps we have gone through. The whole point of this post is to at least give you a better understanding of the innovations and documentation in this area.

Previously, we calculated that 139 GB of disk space would be required to store a video file lasting one hour at 720p and 30 fps. If we use the methods discussed in this article (inter-frame and internal predictions, transformation, quantization, entropy coding, etc.), then we can achieve (based on the fact that we spend 0,031 bits per pixel) a video of quite satisfactory quality, occupying only 367,82 MB, not 139 GB of memory.

How does H.265 achieve better compression than H.264?

Now that we know more about how codecs work, it's easier to understand how new codecs are able to deliver higher resolution with fewer bits.

If you compare CVA и HEVC, you should not forget that this is almost always a choice between more CPU load and compression ratio.

HEVC has more options for sections (and subsections) than CVA, more directions for internal prediction, improved entropy coding, and more. All these improvements have made H.265 capable of compressing 50% more than H.264.

First part: Fundamentals of working with video and images

Source: habr.com

How a video codec works. Part 2. What, why, how

First part: Fundamentals of working with video and images

Codec vs Container

A bit of history

Birth of AV1

Universal codec

1st step - splitting the image

2nd step - forecasting

3rd step - transformation

Each coefficient is formed using all pixels

4th step - quantization

Step 5 - Entropy Encoding

Video encoding with VLC

Arithmetic coding

Step 6 - Bitstream Format

H.264 bit stream

Do you want to practice? Explore the H.264 bit stream

Review

How does H.265 achieve better compression than H.264?

First part: Fundamentals of working with video and images

Add a comment Отменить ответ