Adult Posit Challenges

There were already several articles on Habré ( one , two , two and a half ) devoted to the new Posit floating point format, the authors of which present it as superior to the standard IEEE 754 float in all respects. The new format also found critics ( one , two ) who claim that Posit's shortcomings outweigh its advantages. But what if we really have a new revolutionary format, and the criticism is simply caused by the envy and incompetence of those who criticize? Well, the best way to find out is to take and calculate yourself.



Introduction



The advantages of the new format are demonstrated by simple examples with addition / multiplication of numbers of a similar order, which result in an accuracy of one or two digits higher. However, in real calculations, plus / minus one bit of the error of a single operation does not matter, since we calculate it with limited accuracy anyway. The accumulation of error as a result of the sequence of operations matters, when after some time the errors in the lower digits begin to lead to the error in the older ones. This is what we will try to experience.



Training



For testing, I took the Posit implementation with the pathos name from here . To compile it in Visual Studio, we had to add the #define CLZ (n) __lzcnt (n) line in the util.h file and replace 0.f / 0.f in the posit.cpp file with std :: numeric_limits <float> :: quiet_NaN (). By the way, implementations for elementary mathematical functions (well, except for the root) were also not found in this library - this is another reason to suspect something was amiss.



Test 1. Multiplication of complex numbers



The Fourier transform, calculated using complex arithmetic, is used wherever possible. At first, I wanted to test Posit on the Fourier transform; but since its accuracy is quite dependent on the implementation, for correct testing you will have to consider all the basic algorithms, which is somewhat time-consuming; therefore, you can start with a simpler operation - multiplication of complex numbers.



If we take a certain vector and rotate it 360 times by 1 °, then by the end we should get the same original vector. In fact, the result will be slightly different due to the accumulation of errors - and the larger the number of turns, the greater the error. So, using this simple code



complex<T> rot(cos(a), sin(a)); complex<T> vec(length, 0); for (long i = 0; i < count; i++) { vec *= rot; } cout << "error: " << stdev(vec.real() - length, vec.imag()) << endl;
      
      





we rotate the vector with different types of data, and we will consider the error as the mean square deviation of the resulting vector from the original (which can also be interpreted as the length of the difference vector).



First, take the unit vector as the most supportive of Posit:

iterations four one hundred 1000 10,000 100,000
double 0 0 0 0 0
float 0 0.00000036 0.00001038 0.0001858 0.0001961
posit 0 0.00000073 0.00000534 0,0000411 0,0004468


There is no obvious leader here yet - the advantage is two times that of one or the other. Increase the length of the rotated vector to 1000:

iterations four one hundred 1000 10,000 100,000
double 0 0 0 0 0
float 0 0,00028 0.0103 0.18 0.19
posit 0 0.00213 0.0188 0.16 2.45


The error values ​​are almost equal. Go ahead - 1,000,000:

iterations four one hundred 1000 10,000 100,000
double 0 0 0.00000002 0.00000042 0.0000036
float 0 0.33 12.0 185.8 198.1
posit 0 8.12 71.0 769.2 10706.8


Here Posit is already confidently lagging behind, and the double error starts to creep into the float. Take an even longer length - 10 10 to fully appreciate the advantages of floating point formats:

iterations four one hundred 1000 10,000 100,000
double 0.00000245 0.00001536 0,0002041 0,0040951 0,03621497
float 0.00000245 6003.8 88111.8 1 836 254.0 1965083.0
posit 9216.0 1287208.7 14443543,7 202630144.4 1784050328.2


Here the most interesting thing at the beginning, at 4 iterations - when the float gives an error commensurate with double, and Posit already has a completely incorrect result.



Test 2. Calculation of a rational polynomial



Since there were no mathematical functions in the original library, we’ll try to do something on our own. Many functions are poorly approximated by expanding in a Taylor series, and they are more convenient to calculate through approximation by a rational polynomial. This approximation can be obtained in various ways, including purely analytically - through the Padé approximation . We will take it for testing, moreover, with sufficiently large coefficients so that they also undergo rounding before calculation.



Using Wolfram Mathematica and the PadeApproximant command [Sin [x], {x, 0, {11, 11}}]

we get such a rational polynomial for approximating the sine, which provides double accuracy in the range from about -2 to 2:





\ frac {- \ frac {481959816488503 x ^ {11}} {363275871831577908403200} + \ frac {23704595077729 x ^ 9} {42339845201815607040} - \ frac {2933434889971 x ^ 7} {33603051747472704} ^ 33603051747472704} } {617703157122660} - \ frac {109061004303 x ^ 3} {722459832892} + x} {\ frac {37291724011 x ^ {10}} {11008359752472057830400} + \ frac {3924840709 x ^ 8} {2016183104848362240} x ^ 6} {168015258737363520} + \ frac {1679739379 x ^ 4} {13726736824948} + \ frac {34046903537 x ^ 2} {2167379498676} +1}







Horner's scheme is usually used directly for calculations in order to save on calculations. In our case (using HornerForm) it will look like



 template< typename T > T padesin(T x) { T xx = x*x; return (x*(T(363275871831577908403200.) + xx*(-T(54839355237791393068800.) + xx*(T(2120649063015013090560.) + xx*(-T(31712777908498486800.) + xx*(T(203385425766914820.) - T(481959816488503.) * xx)))))) / (T(363275871831577908403200.) + xx*(T(5706623400804924998400.) + xx*(T(44454031219351353600.) + xx* (T(219578286347980560.) + xx*(T(707177798947620.) + T(1230626892363.) * xx))))); }
      
      







We'll see:

x = 0.5 x = 1 x = 2
sin (x) 0.479425538604203 0.8414709848078965 0,9092974268256817
double 0.479425538604203 0.8414709848078965 0,909297426825681 6
float 0.4794255 495071411 0.84147095680 23682 0,9092974 066734314
posit 0.47 88961037993431 0.84 24437269568443 0.9 110429435968399


x = 3 x = 4 x = 5
sin (x) 0.1411200080598672 -0.7568024953079282 -0.9589242746631385
double 0.14112000805 85958 -0.75680249 60833886 -0.958924 3758030122
float 0.1411200 165748596 -0.7568024 396896362 -0.9589243 531227112
posit 0.14 44759201258421 -0.7 614213190972805 -0.9 691629931330681


As you can see, the situation with Posit here looks deplorable - barely barely two significant numbers are dialed.



Conclusion



Unfortunately, a miracle did not happen and the revolution is canceled. The advantage of Posit demonstrated on single calculations is nothing more than a trick, the price of which is a catastrophic decrease in accuracy in "heavy" real calculations. The only reason it makes sense to use Posit instead of IEEE 754 float or fixed point is religious. Using the magic format, the accuracy of which is ensured by the holy faith of its creators, can bring many wonders to your programs!



PS source code for verification and criticism .



PPS Continued .



All Articles