Whether Richard Hendricks stupid, or Linear search versus binary







I think, on Habré there are fans of the Silicon Valley series . This week, for the first time in all six seasons, they showed a large code - of course, I immediately want to discuss it here.







Wanting to humiliate the main character, Richard Hendrix, his former boss shows at a meeting a fragment of his old code. There, a linear search is applied to the already sorted data - so the task will be completed, but it looks very inefficient.







Richard himself does not argue that the code is bad. However, among the viewers of the series, his decision suddenly found defenders, and now I wonder what Habr thinks about their position.







Richard’s known code snippet looks like this:







int index = 0; while (!element.equals(sortedList.get(index)) && sortedList.size() > ++index); return index < sortedList.size() ? index : -1;
      
      





Here, a linear search turns in turn to each element of some sorted list, until it gets to the right one. As a rule, they prefer binary search on sorted data, which each time divides the set in half, discarding the unsuitable half as a whole (because with an increase in the volume of data, the number of iterations in the linear increases much faster than in the binary). But in the subreddit / r / SiliconValleyHBO the following comment appeared:







“I want to study a little and point out that Richard’s“ mistake ”in using linear search instead of binary search on sorted data actually turns out to be more productive in many cases. With giant datasets (I think the threshold is on millions of elements) binary search is faster. But in general, if your dataset is not gigantic, a linear search will be better cached by the processor, better suited for branch predictor, and your algorithm can also be vectorized. Linear searches require more iterations, but each one is insanely faster than a binary search iteration. This is counterintuitive and contradicts everything that you were taught at the university, but it is.



This report is very interesting and shows some amazing results of real performance measurements. ”

And other members of the thread supported the commentator: yes, in theory, all iterations are equivalent, but on real hardware with real optimizations, everything is completely different. Like, the creator of the series, Mike Judge, worked in the Valley back in the 80s, when all of these L1 caches and branch prediction weren’t particularly pronounced, so the CPU behavior was much closer to the ideal model - such an example is given in the series.







For me, as the commentary says, it all sounds counterintuitive, but it became interesting to figure out if Richard could be right. Of course, it interferes with the fact that not the whole context is given in the series: for example, we have no idea how much data iterated over. On the one hand, Richard then worked for the Internet giant Hooli, where he probably had to deal with millions of records, but on the other, it was his first working day, and he might not have been immediately dumped into millions. We pose the question this way: even if binary search is clearly better for most tasks in Hooli, is it likely that Richard made the right decision for his conditions, and other characters laugh in vain at him, not knowing the context?







To understand, I opened a report cited by Reddit. As promised, it turned out to be interesting (not surprising, given that this is a report by Andrei Alexandrescu ), but after looking at a part and clicking through the rest, I did not see comparative measurements of binary and linear search there.







But I remembered that at our DotNext conference, the same Alexandrescu also talked about performance. I opened the text version of his report, which we made for Habr, and searched for the word "linear". It turned out, among other things, it just gave an example of a curious scenario in which this search would be much more effective than a binary one (searching for matching elements of two sets in the case when these sets are identical) - but this is a very specific case, for which there is no general conclusion " linear search is underestimated. "







Googled what modern Internet say about this - but basically found answers to Stack Overflow, where they simply write "use binary, less iterations." There were also cases where they tried to benchmark, but which also did not look very convincing to me.







Here, of course, the option begs "you have to benchmark yourself to see everything yourself on real hardware."







But if during all my visits to DotNext I learned something from two performance-conscious Andreevs (Alexandrescu and Akinshina ), it’s a realization of how often people benchmark incorrectly and how much they don’t take into account. Therefore, I have a low confidence in random Internet posts with benchmarks, but to myself even lower.







Fortunately, there are people on Habr who understand much more than me (for example, the same Andrey DreamWalker Akinshin, who wrote a whole book about benchmarking). Therefore, if you understand the topic - please tell us in the comments how everything really is. To what size data can a linear approach still be a good choice? How likely is it that Richard did everything right, even if then he himself is not ready to defend it?







And if there are no knowledgeable commentators, I will have to attach Akinshin to the battery on the next DotNext and make the benchmark.








All Articles