How the JPEG format works

JPEG images are ubiquitous in our digital lives, but behind that cloak of awareness are algorithms that remove details that the human eye can't see. The result is the highest visual quality in the smallest file size - but how exactly does it all work? Let's see what exactly our eyes don't see!

How the JPEG format works

It's easy to take for granted the ability to send a photo to a friend and not worry about what device, browser, or operating system they're using - but that wasn't always the case. By the early 1980s, computers could store and display digital images, but there were many competing ideas about the best way to do this. You couldn't just send an image from one computer to another and hope it worked.

To solve this problem, a committee of experts from around the world was assembled in 1986 under the name "Joint Photographic Experts Group” (Joint Photographic Experts Group, JPEG), founded as part of the joint work of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), two international standards organizations headquartered in Geneva (Switzerland).

A group of people called JPEG created the JPEG digital image compression standard in 1992. Anyone who has used the Internet has probably encountered JPEG-encoded images. This is the most common way to encode, send and store images. From web pages to email to social media, JPEG is used billions of times a day - almost every time we view or send an image online. Without JPEGs, the web would be less bright, slower, and probably have fewer cat pictures!

This article is about how to decode a JPEG image. In other words, about what is required to convert compressed data stored on a computer into an image that appears on the screen. This is worth knowing about, not only because it is important for understanding the technology we use on a daily basis, but also because by revealing the levels of compression, we will better know perception and vision, as well as what details our eyes are most sensitive to.

Besides, it is very interesting to play with images in this way.

How the JPEG format works

Looking inside a JPEG

On a computer, everything is stored as a sequence of binary numbers. Usually these bits, zeros and ones, are grouped in eights, making up bytes. When you open a JPEG image on your computer, something (browser, operating system, whatever) has to decode the bytes, restoring the original image as a list of colors that can be displayed.

If you download this cute photo of a cat and open it in a text editor, you'll see a bunch of jumbled characters.

How the JPEG format works
Here I'm using Notepad++ to inspect the contents of the file, because common text editors like Notepad from Windows will mess up the binary after saving it and it won't fit the JPEG format.

Opening an image in a text editor is confusing your computer, just like you confuse your brain when you rub your eyes and start seeing colored spots!

These spots you see are known as phosphenes, and are not the result of exposure to a light stimulus or hallucinations generated by the mind. They occur because your brain thinks that any electrical signals in the optic nerves carry information about light. The brain needs to make such assumptions, because there is no way to know whether the signal is a sound, a vision, or something else. All nerves in the body transmit exactly the same electrical impulses. When you apply pressure to your eyes, you send signals that are not visual, but activate receptors in the eye, which your brain interprets—in this case, incorrectly—as something visual. You can literally see the pressure!

It's funny to think about how similar computers are to the brain, but it's also a useful analogy, illustrating how much the meaning of data—whether carried through the body by nerves or stored in a computer—depends on how it's interpreted. All binary data is made up of XNUMXs and XNUMXs, basic components capable of conveying any kind of information. Your computer often guesses how to interpret them using clues such as file extensions. Now we're making it interpret them as text, because that's what the text editor expects.

To understand how to decode a JPEG, we need to see the original signals themselves - the binary data. This can be done with a hex editor, or directly on web page of the original article! There is an image, next to which in the text field all its bytes (except for the header) are presented in decimal form. You can change them and the script will recode and produce a new image on the fly.

How the JPEG format works

You can learn a lot just by playing around with this editor. For example, can you tell in what order the pixels are stored?

In this example, the strange thing is that changing some numbers does not affect the image at all, and, for example, if you replace the number 17 with 0 in the first line, then the photo will be completely ruined!

How the JPEG format works

Other changes, such as changing the 7 on line 1988 to 254, change the color, but only subsequent pixels.

How the JPEG format works

Perhaps the strangest thing is that some numbers change not only the color, but also the shape of the image. Change 70 on line 12 to 2 and look at the top row of the image to see what I mean.

How the JPEG format works

And no matter what JPEG image you use, you'll always find those cryptic chess patterns when editing bytes.

Playing with the editor, it's hard to figure out how to recreate a photo from these bytes, since JPEG compression consists of three different technologies that are applied sequentially in levels. We will study each of them separately to uncover the mysterious behavior we observe.

Three levels of JPEG compression:

  1. Color subsampling.
  2. Discrete Cosine Transform and Discretization.
  3. Run length coding, delta и huffman

To give you an idea of ​​the scale of the compression, note that the image above represents 79 numbers, which is about 819 KB. If we were to store it without compression, we would need three numbers for each pixel - for the red, green and blue components. This would amount to 79 numbers, or ca. 917 Kb. As a result of JPEG compression, the final file has decreased by more than 700 times!

In fact, this image can be compressed much more. Below are two images side by side - the photo on the right was compressed to 16 KB, that is, 57 times less than the uncompressed version!

How the JPEG format works

If you look closely, you will see that these images are not identical. Both of them are pictures with JPEG compression, but the right one is much smaller in volume. It also looks a little worse (look at the background color squares). Therefore, JPEG is also called lossy compression; during the compression process, the image changes and loses some details.

1. Color subsampling

Here is an image with only the first level of compression applied.

How the JPEG format works
(Interactive version in the original articles). Removing one number destroys all colors. However, if exactly six numbers are removed, it has little to no effect on the image.

Now the numbers are a little easier to decipher. This is almost a simple list of colors, each byte changes exactly one pixel, but it is already half the size of an uncompressed image (which would take approx. 300 KB in such a reduced size). Guess why?

You can see that these numbers do not represent the standard red, green, and blue components, because if we replace all the numbers with zeros, we get a green image (not a white one).

How the JPEG format works

This is because these bytes stand for Y (brightness),

How the JPEG format works

Cb (relative blue),

How the JPEG format works

and Cr (relative redness) pictures.

How the JPEG format works

Why not use RGB? After all, this is how most modern screens work. Your monitor can display any color, including red, green, and blue at different intensities for each pixel. White is obtained by turning all three on at full brightness, and black is turning them off.

How the JPEG format works

It is also very similar to how the human eye works. The color receptors in our eyes are called "cones“, and are divided into three types, each of which is more sensitive either to red, or to green, or to blue colors [S-type cones are sensitive in violet-blue (S from the English. Short - short-wavelength spectrum), M-type - in green-yellow (M from English. Medium - medium wave), and L-type - in yellow-red (L from English. Long - long-wave) parts of the spectrum. The presence of these three types of cones (and rods, sensitive in the emerald green part of the spectrum) gives a person color vision. / approx. transl.]. Sticks, the other type of photoreceptor in our eyes, is able to detect changes in brightness but is much more sensitive to color. Our eyes have about 120 million rods and only 6 million cones.

Therefore, our eyes notice changes in brightness much better than changes in color. If you separate the color from the brightness, you can remove a little color, and no one will notice anything. Chroma subsampling is the process of representing the color components of an image at a lower resolution than the luminance components. In the example above, each pixel has exactly one Y component, and each individual group of four pixels has exactly one Cb and one Cr component. Therefore, the image contains four times less color information than the original.

The YCbCr color space is not only used in JPEGs. It was originally invented in 1938 for TV shows. Not everyone has a color TV, so separating color and brightness allowed everyone to get the same signal, and TVs without color just used only the brightness component.

Therefore, removing one number from the editor completely destroys all colors. Components are stored in the form YYYY Cb Cr (in fact, not necessarily in this order - the storage order is specified in the file header). Removing the first number will cause the first value of Cb to be interpreted as Y, Cr as Cb, and in general, a domino effect will be obtained, switching all the colors of the picture.

The JPEG specification does not require you to use YCbCr. But in most files it is used because it gives better quality images after subsampling compared to RGB. But you don't have to take my word for it. See for yourself in the table below how subsampling each individual component would look like in both RGB and YCbCr.

How the JPEG format works
(Interactive version in the original articles).

The removal of blue is not as noticeable as red or green. That's because of the six million cones in your eyes, about 64% are sensitive to red, 32% to green, and 2% to blue.

The subsampling of the Y component (bottom left) is best seen. Even a small change is noticeable.

Converting an image from RGB to YCbCr does not reduce the file size, but it does make it easier to find less noticeable details that can be removed. Lossy compression occurs in the second stage. It is based on the idea of ​​presenting data in a more compressible form.

2. Discrete cosine transform and discretization

This level of compression, for the most part, defines the essence of JPEG. After converting the colors to YCbCr, the components are compressed individually, so we can only concentrate on the Y component from now on. And here is what the bytes of the Y component look like after applying this level.

How the JPEG format works
(Interactive version in the original articles). In the interactive version, clicking on a pixel scrolls the editor to the line that represents it. Try removing numbers from the end or adding a few zeros to a certain number.

At first glance, it looks like very bad compression. There are 100 pixels in an image, and it takes 000 numbers to indicate their brightness (Y-components) - that's worse than not compressing anything at all!

However, note that most of these numbers are zero. Moreover, all these zeros at the end of the lines can be removed without changing the image. There are about 26 numbers left, which is almost 000 times less!

This level contains the secret of chess patterns. Unlike other effects we've seen, the appearance of these patterns is not a glitch. They are the building blocks of the whole image. Each line of the editor contains exactly 64 numbers, discrete cosine transform (DCT) coefficients corresponding to the intensities of 64 unique patterns.

These patterns are formed based on the cosine plot. Here's what some of them look like:

How the JPEG format works
8 out of 64 odds

Below is an image showing all 64 patterns.

How the JPEG format works
(Interactive version in the original articles).

These patterns are of particular importance as they form the basis of the 8x8 images. If you are unfamiliar with linear algebra, then this means that any 8x8 image can be obtained from these 64 patterns. DCT is the process of breaking images into 8x8 blocks and converting each block into a combination of these 64 coefficients.

The fact that any image can be composed of 64 specific patterns seems like magic. However, this is the same as saying that any place on Earth can be described by two numbers - latitude and longitude [indicating the hemispheres / approx. transl.]. We often think of the Earth's surface as two-dimensional, so we only need two numbers. An 8x8 image has 64 dimensions, so we need 64 numbers.

It is not yet clear how this helps us in terms of compression. If we need 64 numbers to represent an 8x8 image, why would this be better than just storing 64 luminance components? We do this for the same reason we turned three RGB numbers into three YCbCr numbers: it allows us to remove subtle details.

It's hard to see exactly what details are being removed at this stage because JPEG applies DCT to 8x8 blocks. However, no one forbids us to apply it to the whole picture. Here's what the DCT looks like for the Y component when applied to the whole picture:

How the JPEG format works

More than 60 numbers can be removed from the end with virtually no noticeable changes in the photo.

How the JPEG format works

Note, however, that if we zero out the first five numbers, the difference will be obvious.

How the JPEG format works

The numbers at the beginning represent low frequency changes in the image, and our eyes pick them up the best. Numbers towards the end indicate high frequency changes that are harder to notice. To "see what the eye can't see" we can isolate these high frequency details by zeroing out the first 5000 numbers.

How the JPEG format works

We see all areas of the image where the greatest change occurs from pixel to pixel. The eyes of the cat, his whiskers, the terry blanket and the shadows in the lower left corner stand out. You can go further by zeroing out the first 10 numbers:

How the JPEG format works

20 000:

How the JPEG format works

40 000:

How the JPEG format works

60 000:

How the JPEG format works

These high-frequency details are removed by JPEG during the compression stage. Converting colors to DCT coefficients is lossless. Losses are formed at the sampling step, where values ​​of high frequency or close to zero are removed. When you lower the quality of JPEG saving, the program increases the threshold for the number of values ​​to be removed, which reduces the file size, but makes the picture more pixelated. So the image in the first section, which was 57 times smaller, looked like this. Each 8x8 block represented a much smaller number of DCT coefficients compared to the higher quality version.

You can do something as cool as gradually streaming images. You can display a blurry picture that becomes more and more detailed as more coefficients are downloaded.

Here, just for fun, what happens when using only 24 numbers:

How the JPEG format works

Or just 5000:

How the JPEG format works

Very blurry but recognizable!

3. Coding of run lengths, delta and Huffman

So far, all stages of compression have been lossy. The last stage, on the contrary, goes without loss. It does not remove information, but significantly reduces the file size.

How can you compress something without discarding information? Imagine how we would describe a simple black 700 x 437 rectangle.

JPEG uses 5000 numbers for this, but much better results can be achieved. Can you imagine an encoding scheme that describes such an image in as few bytes as possible?

The minimal scheme I could come up with uses four: three for the color, and a fourth for how many pixels that color has. The idea of ​​representing repeated values ​​in such a compressed way is called run-length coding. It is lossless because we can recover the encoded data in its original form.

The size of a JPEG file with a black rectangle is much larger than 4 bytes - remember that at the DCT level, compression is applied to blocks of 8x8 pixels. Therefore, at a minimum, we need one DCT coefficient for every 64 pixels. We need one because instead of storing a single DCT coefficient followed by 63 zeros, run-length encoding allows us to store a single number and denote "all others are zeros".

Delta encoding is a technique whereby each byte contains a difference from some value rather than an absolute value. Therefore, editing certain bytes changes the color of all other pixels. For example, instead of storing

12 13 14 14 14 13 13 14

We could start with 12 and then just write down how much to add or subtract to get the next number. And this sequence in delta coding takes the form:

12 1 1 0 0 -1 0 1

The converted data is not smaller than the original data, but it is easier to compress it. Applying delta encoding before run-length encoding can help a lot while still being lossless compression.

Delta encoding is one of the few techniques used outside of 8x8 blocks. Of the 64 DCT coefficients, one is simply a constant wavefunction (solid color). It represents the average brightness of each block for the luminance components, or the average blueness for the Cb components, and so on. The first value of each DCT block is called the DC value, and each DC value is delta encoded with respect to the previous ones. Therefore, changing the brightness of the first block will affect all blocks.

The last mystery remains: how does changing the singular number completely spoil the whole picture? So far, compression levels did not have such properties. The answer lies in the JPEG header. The first 500 bytes contain metadata about the image - width, height, etc., and so far we have not worked with them.

Without a header, it is almost impossible (well, very difficult) to decode a JPEG. It will look like I'm trying to describe a picture to you, and I'm starting to invent words to convey my impression. The description will probably be very concise, since I can invent words with exactly the meaning I want to convey, but for everyone else they will not make sense.

Sounds silly, but that's how it happens. Each JPEG image is compressed with codes specific to it. The code dictionary is stored in the header. This technique is called "Huffman code" and the dictionary is called Huffman table. In the header, the table is marked with two bytes - 255 and then 196. Each color component can have its own table.

Table changes will drastically affect any image. A good example is to change 15 to 1 on the 12th line.

How the JPEG format works

This is because the tables specify how individual bits are to be read. So far, we have only worked with binary numbers in decimal form. But this hides from us the fact that if you want to store the number 1 in a byte, then it will look like 00000001, because each byte must have exactly eight bits, even if only one of them is needed.

This is potentially a big waste of space if you have a lot of small numbers. Huffman code is a technique that allows us to relax this requirement that each number must occupy eight bits. This means that if you see two bytes:

234 115

Then, depending on the Huffman table, it can be three numbers. To extract them, you first need to split them into individual bits:

11101010 01110011

Then we turn to the table to understand how to group them. For example, it could be the first six bits, (111010), or 58 in decimal, followed by five bits (10011), or 19, and finally the last four bits (0011), or 3.

Therefore, it is very difficult to understand the bytes at this stage of compression. The bytes don't represent what they seem. I will not go into the details of working with the table in this article, but materials on this issue online enough.

One of the cool tricks you can do with this knowledge is to separate the header from the JPEG and store it separately. In fact, it turns out that only you can read the file. Facebook does this to further reduce files.

What else can be done is to change the Huffman table quite a bit. For others, it will look like a spoiled picture. And only you will know the magic option to fix it.

To sum up: so what is needed to decode a JPEG? Necessary:

  1. Extract the Huffman table(s) from the header and decode the bits.
  2. Extract the discrete cosine transform coefficients for each color and luminance component for each 8x8 block by inversely transforming run-length encoding and delta.
  3. Combine cosines based on coefficients to get pixel values ​​for each 8x8 block.
  4. Scale color components if subsampling was performed (this information is in the header).
  5. Convert the resulting YCbCr values ​​for each pixel to RGB.
  6. Bring the image to the screen!

Serious work for simply viewing a photo with a cat! However, what I like about it is that it shows how human-centric JPEG technology is. It is based on the features of our perception, which allows us to achieve much better compression than conventional technologies. And now, understanding how JPEG works, you can imagine how these technologies can be transferred to other areas. For example, delta encoding in video can result in a significant reduction in file size, since there are often entire areas that do not change from frame to frame (for example, the background).

Code used in the article, is open, and contains instructions for replacing the pictures with your own.

Source: habr.com

Add a comment