Correct rounding of decimal numbers in binary code

One of the serious problems with decimal numbers represented in binary code is the problem of rounding a binary number to the value of a representable decimal number closest to a correctly rounded decimal number. Below we discuss this problem and give a simple algorithm for proper rounding. The operation of the algorithm is illustrated by a test program in C ++.



Recall that a representable decimal number is a number whose value can be accurately represented by binary code. So, the number 0.125 is exactly equal to the binary number 0.001. It can also be argued that the value of any binary number y is equal to some representable decimal number x . For example, the value of the binary number y = 1.001101 * 2 ^ -3 is equal to the value of the decimal representable number x = 0.150390625.

We say that the binary number y is equivalent to the decimal number x . Conversely, the decimal number x is equivalent to the binary number y .



Let's see what happens with a decimal number when it is entered from the console, or when it appears in the program as a constant.

Any decimal number is converted (converted) by the compiler into the format specified by the programmer. Since binary numbers are processed most quickly in a computer, the input decimal number is converted, most often, either to a fixed-point format (including int), or to one of the floating-point formats - single, double, or to another binary format .



According to the IEEE754 standard, real numbers in the internal format of a machine are represented in normalized binary form. So, a binary positive real number y can be represented as y = b0.b1 ... bp * 2 ^ e (b0 ≠ 0).

The same number can be represented as 0.b0b1 ... bp * 2 ^ (e + 1) (b0 ≠ 0) if e + 1> 0 and 0.b0b1 ... bp * 2 ^ e (b0 ≠ 0) if e <0 .

Further we will adhere to the second view, since in our C ++ test program below, there is a function q = frexp (x, & e), which allows us to determine the value of exponent e in the binary number represented as b0.b1 ... bp * 2 ^ e .



So, the decimal equivalent of any binary normalized number y = 0.b0b1 ... bp * 2 ^ e (b0 ≠ 0) is equal to some normalized decimal number x = 0.d0d1 ... dN ... dn * 10 ^ E (d0 ≠ 0).

For example, the number y = 0.1001101 * 2 ^ -2 is equivalent to the representable decimal number x = 0.150390625 .

To get the number Xr from x equal to the number rounded to N significant digits, X must be multiplied by a factor of 10 ^ k , where k = NE . This is necessary so that the resulting number contains an integer part with N significant digits, which can then be rounded to the nearest integer. The rounded integer must then be reduced to the previous scale by multiplying it by 10 ^ -k . Mathematically, this can be written with the following formula:

X = [x * 10 ^ k + 0.5] * 10 ^ -k = [y * 10 ^ k + 0.5] * 10 ^ -k, where [] is the integer part of the number.



It can be shown that a binary integer B containing m binary digits is equal to the value of the decimal number D , which contains n = ⌊m * log2⌋ decimal digits, provided that B <10 ^ n. And equal to n = n + 1 , provided that B ≥ 10 ^ n . In the calculations, we can take log2≈0.301 .



We determine the value of k based on the information available in the binary representation of y . In the formula for k, the rounding accuracy of N is known, so we need to determine the exponent E.

Exponent E is determined from the relation: E = ⌊e * 0.301⌋ .

It remains to take into account the following circumstance. If x * 10 ^ k = X> 10 ^ N , then you need to multiply it by 10 ^ (- 1) and adjust the coefficient k . We get X = X * 10 ^ (- 1), k = k-1 .

The rounded number will be equal to Xr = X * 10 ^ (- k) .



As a result, we obtain the following simple algorithm for the correct decimal rounding of binary real numbers. The algorithm is suitable for any binary number format with an arbitrarily specified decimal rounding precision N.

At the entrance:

Decimal rounding accuracy N ;

Is a binary number in the format y = 0.b0b1 ... bp * 2 ^ e (b0 ≠ 0) .

Output: Rounded decimal number X = 0.d0d1 ... dN * 10 ^ E.

- 1. Determine the exponent e of the binary number y;

2. E = ⌊e * 0.3⌋;

3. k = NE;

4. X = x * 10 ^ k;

5. If X <10 ^ N, then item 8;

6. X = X * 10 ^ -1;

7. k = k-1;

8. Xr = [X + 0.5] * 10 ^ (- k);

End.

- In the above algorithm, the number Xr is the representable decimal number closest to the number, which is the correct rounding of the number x , which in turn is the decimal equivalent of the number y .

Since we are used to working with decimal numbers, then, as a rule, the input stream is exactly decimal numbers. At the output, we want to get decimal numbers too. For example, as in Excel. But, after converting decimal numbers to binary code, they, as a rule, are irreversibly transformed. As a result of this, rounding converted to binary numbers may not coincide with the correct rounding of numbers printed on the console. This applies primarily to cases where we try to round a decimal number to the maximum possible number of significant digits. For single, this is 7, and for double it is 15.

This can be well investigated on the test program below, which the author wrote in C ++.



In the test program, upon request, the following is entered into the console:

- the accuracy of N decimal rounding of the number X , which is the nearest representable number of the binary equivalent of the number x ;

Is the number x in arbitrary form.



Under the entered decimal number x, the number X is printed, which is the representable number closest to x (see screenshot below).

Rounding is performed on the number X , because the exact value of x is lost in binary conversion.

Returns:

- decimal number in the format Xr = M * 10 ^ (N + e), where M is an integer decimal number with N significant digits;

Is the number xr , which is equal to the representable number closest to the normalized binary equivalent of the number X.

image

In the screenshot:



N = 15 - the number of decimal significant digits to which the input decimal number is rounded.

x = 7.123456789098765321 e-89 is the decimal number that we would like to round to 15 significant digits.

X = 7.12345678909876559 e-089 - a representable decimal number, the value of which is equal to a binary number, which is obtained by converting the number x to the format p = 53.

Xr = x = 712345678909877e-103 - integer mantissa number obtained by rounding the number X.

xr = x = 7.12345678909877e-089 - the number obtained by normalizing the binary equivalent of the decimal number Xr. It is the closest to Xr.



Below is the code of the test program for the correct rounding of decimal numbers represented in binary code in C ++.



#include <iostream> #include <math.h> #include <stdlib.h> #include <iomanip> using namespace std; int main() { double q,x,xr,X; unsigned long long int Xr; int N,p,E,e,k; cout <<"Input a binary precision p="; cin>>p; cout <<"Input a decimal precision N="; cin>>N; cout <<endl<<"Input a number and press ENTER:"<<"\n"<<"x= "; cin>>x; cout<<"X= "<< setprecision(18)<<x << '\n'; q=frexp (x, &e); E =static_cast <int> (e*0.301); k=NE; if (E<0) //for format xr=d0.d1...dN*10^E (d0≠0). k=k+1; X=x*pow(10,k); if (X > pow (10,N)){ X=X/10; k=k-1; } X=X+0.5; Xr=static_cast <unsigned long long int> (X); xr=Xr*pow(10,-k); cout<<endl <<"Xr= "<<Xr<<"e"<<-k<<'\n'; cout<<"xr="<<xr<<'\n'; system("pause"); return 0; }
      
      







The program uses the rather expensive pow (10, k) function. But, since the number of values ​​of k is small, it is quite possible to use a table of pre-calculated values ​​of 10 ^ k , or better 5 ^ k .



All Articles