Unreal Features of Real Types, or Be Careful with REAL

After the publication Articles about the peculiarities of typing in PostgreSQL, the first comment was about the difficulties of working with real numbers. I decided to skim through the SQL queries available to me to see how often they use the REAL type. Quite often used, as it turned out, and developers do not always understand the dangers behind it. And this is despite the fact that there are a lot of good articles on the Internet and on Habré about the features of storing real numbers in machine memory and about working with them. Therefore, in this article I will try to apply such features to PostgreSQL, and I will try to “fine-tip” the troubles associated with them, so that it would be easier for SQL query developers to avoid them.

The PostgreSQL documentation contains the succinct phrase: "Managing such errors and their propagation during computations is the subject of a whole branch of mathematics and computer science, and is not considered here" (while prudently referring the reader to the IEEE 754 standard). What are the errors here? Let's discuss them in order, and it will soon become clear why I took up the pen again.

Take, for example, a simple query:

********* ЗАПРОС *********
SELECT 0.1::REAL;
**************************
float4
--------
    0.1
(1 строка)

As a result, we will not see anything special - we will get the expected 0.1. But now let's compare it with 0.1:

********* ЗАПРОС *********
SELECT 0.1::REAL = 0.1;
**************************
?column?
----------
f
(1 строка)

Not equal! What miracles! But further, more. Someone will say, they say, I know that REAL behaves badly with fractions, well, so I will enter integers there, everything will definitely be fine with them. Ok, let's cast 123 to REAL:

********* ЗАПРОС *********
SELECT 123456789::REAL::INT;
**************************
   int4   
-----------
123456792
(1 строка)

And it turned out to be 3 more! Everything, the base has finally forgotten how to count! Or are we misunderstanding something? Let's figure it out.

To begin with, let's remember the materiel. As you know, any decimal number can be expanded in powers of ten. So, the number 123.456 will be equal to 1*102 + 2*101 + 3*100 + 4*10-1 + 5*10-2 + ​​6*10-3. But the computer operates with numbers in binary form, therefore they have to be represented as an expansion in powers of two. Therefore, the number 5.625 in binary is represented as 101.101 and will be equal to 1*22 + 0*21 + 1*20 + 1*2-1 + 0*2-2 + 1*2-3. And if positive powers of two always give whole decimal numbers (1, 2, 4, 8, 16, etc.), then everything is more difficult with negative ones (0.5, 0.25, 0.125, 0,0625, etc.). The problem is that not every decimal fraction can be represented as a final binary fraction. So, our notorious 0.1 in the form of a binary fraction appears as a periodic value of 0.0 (0011). Consequently, the final value of this number in the machine memory will vary depending on the capacity.

Now is the time to remember how real numbers are stored in computer memory. Generally speaking, a real number consists of three main parts - sign, mantissa, and exponent. The sign can be either plus or minus, so one bit is assigned to it. But the number of bits of the mantissa and exponent is determined by the real type. So, for the REAL type, the length of the mantissa is 23 bits (one bit equal to 1 is implicitly added to the beginning of the mantissa, and it turns out 24), and the exponent is 8 bits. The total is 32 bits, or 4 bytes. And for the DOUBLE PRECISION type, the length of the mantissa will already be 52 bits, and the exponent - 11 bits, totaling 64 bits, or 8 bytes. PostgreSQL does not support higher precision for floating point numbers.

Let's pack our number 0.1 in decimal form into both REAL and DOUBLE PRECISION. Since the sign and value of the exponent are the same, let's focus on the mantissa (I deliberately miss the non-obvious features of storing exponent values ​​and zero real values, since they make it harder to understand and distract from the essence of the problem, if interested - see the IEEE 754 standard). What will we get? In the top line, I will give the "mantissa" for the REAL type (taking into account the rounding of the last bit in 1 to the nearest representable number, otherwise it will be 0.099999 ...), and in the bottom line - for the DOUBLE PRECISION type:

0.000110011001100110011001101
0.00011001100110011001100110011001100110011001100110011001

Obviously, these are two completely different numbers! Therefore, when comparing, the first number will be padded with zeros and, therefore, will be greater than the second (taking into account rounding - marked in bold one). This explains the ambiguity in our examples. In the second example, the explicitly specified number 0.1 is cast to the DOUBLE PRECISION type, after which it is compared with a number of the REAL type. Both are cast to the same type, and we have exactly what we see above. Modify the query so that everything falls into place:

********* ЗАПРОС *********
SELECT 0.1::REAL > 0.1::DOUBLE PRECISION;
**************************
?column?
----------
t
(1 строка)

And indeed, by double casting the number 0.1 to REAL and DOUBLE PRECISION, we get the answer to the riddle:

********* ЗАПРОС *********
SELECT 0.1::REAL::DOUBLE PRECISION;
**************************

      float8       
-------------------
0.100000001490116
(1 строка)

This also explains the third example of the above. The number 123 456 789 is simple impossible to fit into 24 bits of the mantissa (23 explicit + 1 implied). The maximum integer that can fit into 24 bits is 224-1 = 16. Therefore, our number 777 is rounded up to the nearest representable 215. By changing the type to DOUBLE PRECISION, we no longer see this scenario:

********* ЗАПРОС *********
SELECT 123456789::DOUBLE PRECISION::INT;
**************************
   int4   
-----------
123456789
(1 строка)

That's all. It turns out there are no miracles. But everything described is a good reason to think about how much you really need the REAL type. Perhaps the biggest plus of its use is the speed of calculations with a known loss of accuracy. But would this be a generic scenario that would justify such frequent use of this type? Don't think.

Source: habr.com

Add a comment