There were already several articles on Habré (
one ,
two ,
two and a half ) devoted to the new Posit floating point format, the authors of which present it as superior to the standard IEEE 754 float in all respects. The new format also found critics (
one ,
two ) who claim that Posit's shortcomings outweigh its advantages. But what if we really have a new revolutionary format, and the criticism is simply caused by the envy and incompetence of those who criticize? Well, the best way to find out is to take and calculate yourself.
Introduction
The advantages of the new format are demonstrated by simple examples with addition / multiplication of numbers of a similar order, which result in an accuracy of one or two digits higher. However, in real calculations, plus / minus one bit of the error of a single operation does not matter, since we calculate it with limited accuracy anyway. The
accumulation of error as a result of the sequence of operations matters, when after some time the errors in the lower digits begin to lead to the error in the older ones. This is what we will try to experience.
Training
For testing, I took the Posit implementation with the pathos name
from here . To compile it in Visual Studio, we had to add the #define CLZ (n) __lzcnt (n) line in the util.h file and replace 0.f / 0.f in the posit.cpp file with std :: numeric_limits <float> :: quiet_NaN (). By the way, implementations for elementary mathematical functions (well, except for the root) were also not found in this library - this is another reason to suspect something was amiss.
Test 1. Multiplication of complex numbers
The Fourier transform, calculated using complex arithmetic, is used wherever possible. At first, I wanted to test Posit on the Fourier transform; but since its accuracy is quite dependent on the implementation, for correct testing you will have to consider all the basic algorithms, which is somewhat time-consuming; therefore, you can start with a simpler operation - multiplication of complex numbers.
If we take a certain vector and rotate it 360 times by 1 °, then by the end we should get the same original vector. In fact, the result will be slightly different due to the accumulation of errors - and the larger the number of turns, the greater the error. So, using this simple code
complex<T> rot(cos(a), sin(a)); complex<T> vec(length, 0); for (long i = 0; i < count; i++) { vec *= rot; } cout << "error: " << stdev(vec.real() - length, vec.imag()) << endl;
we rotate the vector with different types of data, and we will consider the error as the mean square deviation of the resulting vector from the original (which can also be interpreted as the length of the difference vector).
First, take the unit vector as the most supportive of Posit:
There is no obvious leader here yet - the advantage is two times that of one or the other. Increase the length of the rotated vector to 1000:
The error values are almost equal. Go ahead - 1,000,000:
Here Posit is already confidently lagging behind, and the double error starts to creep into the float. Take an even longer length - 10
10 to fully appreciate the advantages of floating point formats:
Here the most interesting thing at the beginning, at 4 iterations - when the float gives an error commensurate with double, and Posit already has a completely incorrect result.
Test 2. Calculation of a rational polynomial
Since there were no mathematical functions in the original library, we’ll try to do something on our own. Many functions are poorly approximated by expanding in a Taylor series, and they are more convenient to calculate through approximation by a rational polynomial. This approximation can be obtained in various ways, including purely analytically - through the
Padé approximation . We will take it for testing, moreover, with sufficiently large coefficients so that they also undergo rounding before calculation.
Using Wolfram Mathematica and the PadeApproximant command [Sin [x], {x, 0, {11, 11}}]
we get such a rational polynomial for approximating the sine, which provides double accuracy in the range from about -2 to 2:
\ frac {- \ frac {481959816488503 x ^ {11}} {363275871831577908403200} + \ frac {23704595077729 x ^ 9} {42339845201815607040} - \ frac {2933434889971 x ^ 7} {33603051747472704} ^ 33603051747472704} } {617703157122660} - \ frac {109061004303 x ^ 3} {722459832892} + x} {\ frac {37291724011 x ^ {10}} {11008359752472057830400} + \ frac {3924840709 x ^ 8} {2016183104848362240} x ^ 6} {168015258737363520} + \ frac {1679739379 x ^ 4} {13726736824948} + \ frac {34046903537 x ^ 2} {2167379498676} +1}
Horner's scheme is usually used directly for calculations in order to save on calculations. In our case (using HornerForm) it will look like
template< typename T > T padesin(T x) { T xx = x*x; return (x*(T(363275871831577908403200.) + xx*(-T(54839355237791393068800.) + xx*(T(2120649063015013090560.) + xx*(-T(31712777908498486800.) + xx*(T(203385425766914820.) - T(481959816488503.) * xx)))))) / (T(363275871831577908403200.) + xx*(T(5706623400804924998400.) + xx*(T(44454031219351353600.) + xx* (T(219578286347980560.) + xx*(T(707177798947620.) + T(1230626892363.) * xx))))); }
We'll see:
As you can see, the situation with Posit here looks deplorable - barely barely two significant numbers are dialed.
Conclusion
Unfortunately, a miracle did not happen and the revolution is canceled. The advantage of Posit demonstrated on single calculations is nothing more than a trick, the price of which is a catastrophic decrease in accuracy in "heavy" real calculations. The only reason it makes sense to use Posit instead of IEEE 754 float or fixed point is religious. Using the magic format, the accuracy of which is ensured by the holy faith of its creators, can bring many wonders to your programs!
PS
source code for verification and criticism .
PPS
Continued .