E-books and their formats: DjVu - its history, pros, cons and features

In the early 70s, American writer Michael Hart managed to get unlimited access to a Xerox Sigma 5 computer installed at the University of Illinois. To make good use of the machine's resources, he decided to create the first e-book by reprinting the US Declaration of Independence.

Today, digital literature has become widespread, largely due to the development of portable devices (smartphones, readers, laptops). This has led to the emergence of a large number of e-book formats. Let's try to understand their features and tell the story of the most popular of them - let's start with the DjVu format.

E-books and their formats: DjVu - its history, pros, cons and features
/flickr/ Lane Pearman / CC

The emergence of the format

DjVu was developed in 1996 by AT&T Labs with the sole purpose of giving web developers a tool for distributing high-resolution images over the Internet.

The fact is that at that time 90% of all information is still kept on paper, and many of the important documents had color images and photographs. To preserve the readability of the text and the quality of the pictures, high-resolution scans had to be made.

The classic web formatsβ€”JPEG, GIF, and PNGβ€”enabled these images, but at the cost of space. In the case of JPEG, to text was read on the monitor screen, I had to scan a document with a resolution of 300 dpi. The color page of the magazine at the same time occupied about 500 KB. Downloading files of this size from the Internet at that time was a rather laborious process.

An alternative was to digitize paper documents using text recognition technologies, but 20 years ago their accuracy was far from ideal - after processing, the final result had to be seriously edited by hand. At the same time, graphics and images remained β€œoverboard”. And even if it was possible to embed the scanned image into a text document, some visual details were lost, for example, the color of the paper, its texture, and these are important components of historical documents.

To solve these problems, AT&T developed DjVu. He allowed to compress scanned color documents with a resolution of 300 dpi to 40-60 KB, with an original size of 25 MB. Sizes of black-and-white pages DjVu reduced to 10-30 KB.

How DjVu Compresses Documents

DjVu can work with both paper scanned documents and other digital formats such as PDF. At the heart of DjVu is a technology that breaks an image into three components: foreground, background, and black and white (bitmap) mask.

The mask is saved with the resolution of the original file and contains the image of the text and other clear details - thin lines and diagrams - as well as contrasting pictures.

It has a resolution of 300 dpi to keep thin lines and letter outlines sharp, and is compressed using the JB2 algorithm, which is a variation of AT&T's JBIG2 fax algorithm. Feature of JB2 is that it looks for repeated characters on the page and saves their image only once. Thus, in multi-page documents, every few consecutive pages share a common "vocabulary".

The background contains the page texture and illustrations, and its resolution is less than that of the mask. A lossless background for perception is preserved at a resolution of 100 dpi.

Foreground stores color information about the mask, and its resolution is usually reduced even more, since in most cases the color of the text is black and the same for one printed character. To compress the foreground and background, use wavelet compression.

The final step in creating a DjVu document is entropy encoding, when an adaptive arithmetic encoder turns sequences of identical characters into a binary value.

Advantages of the format

DjVu's mission was save "properties" of a paper document in digital form, allowing even weak computers to work with such documents. Therefore, DjVu viewer software has a "fast rendering" capability. Thanks to her in memory loading only that piece of the DjVu page that should be displayed on the screen.

It also makes it possible to view "under-downloaded" files, that is, individual pages of a multi-page DjVu document. In this case, progressive rendering of image details is used, when the components seem to β€œappear” as the file is uploaded (as in JPEG).

20 years ago, when this format was introduced, page loading took place in three stages: first, the text component was loaded, after a couple of seconds, the first versions of images and the background were loaded. After that, the whole page of the book β€œmanifested”.

The presence of a three-level structure also allows you to search through scanned books (since there is a special text layer). This proved to be handy when working with technical literature and reference books, so DjVu became the basis for several libraries of scientific books. For example, in 2002 he was chosen Internet Archive as one of the formats (along with TIFF and PDF) for a project to save scanned books from open sources.

Format Disadvantages

However, like all technologies, DjVu has its downsides. For example, when encoding book scans into DjVu format, some characters in the document can be replaced with others that look similar. Most often this happens with the letters "i" and "n", which is why this problem received called the yin problem. It does not depend on the language of the text and affects, among other things, numbers and other small repeating characters.

Its cause is character classification errors in the JB2 encoder. It "splits" the scans into groups of 10-20 pieces and forms a dictionary of common symbols for each of the groups. The dictionary contains samples of common letters and numbers with pages and coordinates of their appearance. When you view a DjVu book, the characters from the dictionary are substituted in the right places.

This allows you to reduce the size of the DjVu file, however, if the mappings of two letters are visually similar, the encoder can either confuse them or take them for the same. Sometimes this results in corrupted formulas in a technical document. To solve this problem, you can abandon the compression algorithms, but this will increase the size of the digital copy of the book.

Another disadvantage of the format is that it is not supported by default in many modern operating systems (including mobile ones). Therefore, to work with it, you need to install third-party Action, such as DjVuReader, WinDjView, Evince, etc. However, I would like to note here that some electronic readers (for example, ONYX BOOX) support the DjVu format out of the box - since the necessary applications are already installed there.

By the way, we talked about what else applications for readers based on Android can do in one of the previous materials.

E-books and their formats: DjVu - its history, pros, cons and features
Reader ONYX BOOX Chronos

Another format problem manifests itself when working with DjVu documents on small screens of mobile devices - smartphones, tablets, readers. Sometimes DjVu files are presented in the form of a scan of a spread of a book, and professional literature and working papers are often A4 format, so you have to β€œmove” the image in search of information.

However, we note that this problem is also solvable. The easiest way, of course, is to look for a document in a different format - but if this option is not possible (for example, you need to work with a large amount of technical literature in DjVu), then you can use electronic readers with a large diagonal from 9,7 to 13,3 inches, which specially β€œsharpened” for working with such documents.

For example, in the ONYX BOOX line, such devices are Chronos ΠΈ MAX 2 (by the way, we have prepared a review of this reader model, and will soon publish it on our blog), as well as Footnotes, which has a 10,3-inch E Ink Mobius Carta screen with increased resolution. Such devices allow you to calmly view all the details of the illustrations in the original size and are suitable for those who often have to read educational or technical literature. To view DjVu and PDF files used NEO Reader, which allows you to adjust the contrast and thickness of digitized fonts.

Despite the shortcomings of the format, today DjVu remains one of the most popular formats for "saving" literary works. This is largely due to the fact that is open, and some technological limitations today allow modern technologies and developments to bypass it.

In the following materials, we will continue the story about the history of the appearance of e-book formats and the features of their work.

PS Several sets of ONYX BOOX readers:



Source: habr.com

Add a comment