Modern processors are superscalar, that is, they are able to execute several instructions simultaneously. For example, some processors can process from four to six instructions per cycle. Moreover, many such processors are capable of initiating instructions out of order: they can start working with instructions located in the code much later.
At the same time, code often contains branches (
if–then
). Such branches are often implemented as "transitions", in which the processor either proceeds to execute instructions below the code or continues the current path.
With superscalar execution of commands out of order, branching is difficult. For this, processors have sophisticated branch prediction blocks. That is, the processor is trying to predict the future. When he sees a branch, which means a transition, he tries to guess which way the program will go.
Very often this works quite well. For example, most loops are implemented as branches. At the end of each iteration of the loop, the processor must predict whether the next iteration will be performed. It is often safer for the processor to predict that the cycle will continue (forever). In this case, the processor erroneously predicts only one branch per cycle.
There are other common examples. If you access the contents of an array, then many programming languages add “bound checking” - a hidden check of the correctness of the index before accessing the value of the array. If the index is incorrect, an error is generated, otherwise the code continues to execute in the usual way. Border checks are predictable, because in a normal situation all access operations should be correct. Consequently, most processors should almost perfectly predict the outcome.
What happens if branching is difficult to predict?
Inside the processor, all instructions that have been executed but are on an incorrectly predicted branch must be canceled, and calculations must be started anew. It should be expected that for each branch prediction error we pay more than 10 cycles. Because of this, the execution time of the program can increase significantly.
Let's look at a simple code in which we write random integers into an output array:
while (howmany != 0) { out[index] = random(); index += 1; howmany--; }
We can generate a suitable random number on average for 3 cycles. That is, the total delay of the random number generator can be equal to 10 cycles. But our processor is superscalar, that is, we can perform several random number calculations simultaneously. Therefore, we can generate a new random number approximately every 3 cycles.
Let's change the function a bit so that only odd numbers are written to the array:
while (howmany != 0) { val = random(); if( val is an odd integer ) { out[index] = val; index += 1; } howmany--; }
You might naively think that this new feature might be faster. And in fact, because we need to record on average only one of two integers. There is a branch in the code, but to check the parity of an integer, just check one bit.
I benchmarked these two functions in C ++ on a Skylake processor:
Record all random numbers | 3.3 cycles on integer |
Writing only odd random numbers | 15 cycles on integer |
The second function works about five times longer!
Can anything be fixed here? Yes, we can just eliminate the branching. An odd integer can be characterized in such a way that it is a bitwise logical AND with a value of 1 equal to one. The trick is to increment the array index by one only if the random value is odd.
while (howmany != 0) { val = random(); out[index] = val; index += (val bitand 1); howmany--; }
In this new version, we always write a random value to the output array, even if it is not required. At first glance, this is a waste of resources. However, it saves us from mistakenly predicted branches. In practice, the performance is almost the same as the original code, and much better than the version with branches:
Record all random numbers | 3.3 cycles on integer |
writing only odd random numbers | 15 cycles on integer |
with branch eliminated | 3.8 cycles per integer |
Could the compiler solve this problem on its own? In general, the answer is no. Sometimes compilers have options for completely eliminating branching, even if there is an
if-then
in the source code. For example, branching can sometimes be replaced with “conditional move” or other arithmetic tricks. However, such tricks are unsafe for use in compilers.
An important conclusion: erroneously predicted branching is not an insignificant problem, it has a great influence.
My source code is on Github .
Creating benchmarks is a difficult task: processors learn to predict branching
[Note transl .: this part was a separate article from the author, but I combined it with the previous one, because they have a common theme.]
In the previous part, I showed that most of the execution time of a program can be caused by incorrect branch prediction. My benchmark was writing 64 million random integer values to an array. When I tried to record only odd random numbers, performance due to erroneous predictions greatly decreased.
Why did I use 64 million integers rather than, say, 2000? If you run just one test, then it will not matter. However, what will happen if we make many attempts? The number of erroneously predicted branches will quickly drop to zero. The performance of the Intel Skylake processor speaks for itself:
Number of tests | Incorrectly Predicted Branches (Intel Skylake) |
---|---|
one | 48% |
2 | 38% |
3 | 28% |
4 | 22% |
5 | 14% |
As can be seen from the graphs below, the "training" continues further. Gradually, the proportion of erroneously predicted branches drops to about 2%.
That is, if we continue to measure the time taken by the same task, then it becomes less and less, because the processor learns to better predict the result. The quality of “training” depends on the specific processor model, but it is expected that newer processors should learn better.
The latest AMD server processors learn to almost perfectly predict branching (within 0.1%) in less than 10 attempts.
Number of tests | Incorrectly Predicted Branches (AMD Rome) |
---|---|
one | 52% |
2 | eighteen% |
3 | 6% |
4 | 2% |
5 | one% |
6 | 0.3% |
7 | 0.15% |
8 | 0.15% |
nine | 0.1% |
This ideal prediction on AMD Rome disappears when the number of values in the problem is increased from 2000 to 10,000: the best prediction changes from a fraction of errors of 0.1% to 33%.
You should probably avoid benchmarking code with branching for small tasks.
My github code .
Acknowledgment : AMD Rome values provided by Velu Erwan.
Additional reading : A case for (partially) TAgged GEometric history length branch prediction (Seznec et al.)