How not to make the fastest strlen and find a flaw in Visual Studio 2019 Community

I was prompted by an article about using the “strange” popcount instruction in modern processors . This is not about counting the number of units, but about detecting the sign of the end of a C line (a null-terminated line ).

A null-terminated string is a way of representing strings in programming languages ​​in which instead of introducing a special string type, an array of characters is used, and the end of the string is the first special null character encountered (NUL from ASCII code, with a value of 0).


To determine the length of such a term, a standard function is used.



size_t __cdecl strlen(char const* str)
      
      





The operation algorithm of which can be described in C as:



 size_t strlen_algo(const char* str) { size_t length = 0; while (*str++) length++; return length; }
      
      





Let's see what the MS Visual Studio 2019 community compiler (Release, x86) turns it into:



 08811F7h: mov al,byte ptr [ecx] inc ecx test al,al jne main+0D7h (08811F7h)
      
      





That is, one byte is loaded from memory and compared with zero. The same code is substituted for strlen calls if you build the project in Release, the algorithm is correct, but the speed, it seems to me, is not sufficient. What happens if you compile the code with a call to standard strlen in Debug? - The strlen library function will be called, as expected, but written by a person manually on assembler.



Code spoiler of the standard strlen function in MS Visual Studio
 ;strlen.asm - contains strlen() routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; strlen returns the length of a null-terminated string, ; not including the null byte itself. ; ;******************************************************************************* .xlist include cruntime.inc .list page ;*** ;strlen - return the length of a null-terminated string ; ;Purpose: ; Finds the length in bytes of the given string, not including ; the final null character. ; ; Algorithm: ; int strlen (const char * str) ; { ; int length = 0; ; ; while( *str++ ) ; ++length; ; ; return( length ); ; } ; ;Entry: ; const char * str - string whose length is to be computed ; ;Exit: ; EAX = length of the string "str", exclusive of the final null byte ; ;Uses: ; EAX, ECX, EDX ; ;Exceptions: ; ;******************************************************************************* CODESEG public strlen strlen proc \ buf:ptr byte OPTION PROLOGUE:NONE, EPILOGUE:NONE .FPO ( 0, 1, 0, 0, 0, 0 ) string equ [esp + 4] mov ecx,string ; ecx -> string test ecx,3 ; test if string is aligned on 32 bits je short main_loop str_misaligned: ; simple byte loop until string is aligned mov al,byte ptr [ecx] add ecx,1 test al,al je short byte_3 test ecx,3 jne short str_misaligned add eax,dword ptr 0 ; 5 byte nop to align label below align 16 ; should be redundant main_loop: mov eax,dword ptr [ecx] ; read 4 bytes mov edx,7efefeffh add edx,eax xor eax,-1 xor eax,edx add ecx,4 test eax,81010100h je short main_loop ; found zero byte in the loop mov eax,[ecx - 4] test al,al ; is it byte 0 je short byte_0 test ah,ah ; is it byte 1 je short byte_1 test eax,00ff0000h ; is it byte 2 je short byte_2 test eax,0ff000000h ; is it byte 3 je short byte_3 jmp short main_loop ; taken if bits 24-30 are clear and bit ; 31 is set byte_3: lea eax,[ecx - 1] mov ecx,string sub eax,ecx ret byte_2: lea eax,[ecx - 2] mov ecx,string sub eax,ecx ret byte_1: lea eax,[ecx - 3] mov ecx,string sub eax,ecx ret byte_0: lea eax,[ecx - 4] mov ecx,string sub eax,ecx ret strlen endp end
      
      







Table 1 - runtime of the strlen benchmark in seconds (MS VS 2019 community, C ++ cl version: 19.22.27905)



Big block, 1K Big block, 1K, * strlen call Small block, 10 elements Small block, 10 elements, * strlen call
Debug, x86 7.25 7.25 3.06 3.06
Release, x86 9.0 3.9 0.15 0.12
Debug, x64 6.0 6.0 3.4 3.4
Release x64 8.5 2.3 0.15 0.11


* force the compiler to call the strlen library function



Thus, we can conclude that substitution of byte comparison by the MS VS compiler is inefficient even on small strings, and on large strings, Debug is ahead of Release!



Bench Code
 #include <iostream> #include <string> #include <vector> #include <chrono> #include <nmmintrin.h> inline size_t strlen_algo(const char* str) { size_t length = 0; while (*str++) length++; return length; } inline size_t strlen_sse4(const char* str) { size_t length = 0; int res = 0; //align to 16 bytes while (((size_t(str+length)) & (sizeof(__m128i) - 1)) != 0) { if (str[length] == 0) return length; length++; } __m128i z128 = _mm_setzero_si128(); for(; ; length += 16) { __m128i data = _mm_load_si128((__m128i*)(str + length)); if ((res = _mm_cmpistri(z128, data, _SIDD_CMP_EQUAL_EACH)) != 16) break; } /*while (str[length]) length++;*/ return length + res; } #define _DISABLE_ASM_BSF //https://www.strchr.com/sse2_optimised_strlen #ifndef WORDS_BIGENDIAN #if 0 #elif 0 #else static inline int count_bits_to_0(unsigned int x) // counting trailing zeroes, by Nazo, post: 2009/07/20 03:40 { // this is current winner for speed static const unsigned char table[256] = { 7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, }; if ((unsigned char)x) return table[(unsigned char)x]; return table[x >> 8] + 8; // t[x / 256] + 8 } #endif #else #if 0 static inline int count_bits_to_0(unsigned int x) // counting trailing zeroes { register int i = 0; if (!(x & (1 << 15))) i++; else return i; if (!(x & (1 << 14))) i++; else return i; if (!(x & (1 << 13))) i++; else return i; if (!(x & (1 << 12))) i++; else return i; if (!(x & (1 << 11))) i++; else return i; if (!(x & (1 << 10))) i++; else return i; if (!(x & (1 << 9))) i++; else return i; if (!(x & (1 << 8))) i++; else return i; if (!(x & (1 << 7))) i++; else return i; if (!(x & (1 << 6))) i++; else return i; if (!(x & (1 << 5))) i++; else return i; if (!(x & (1 << 4))) i++; else return i; if (!(x & (1 << 3))) i++; else return i; if (!(x & (1 << 2))) i++; else return i; if (!(x & (1 << 1))) i++; else return i; if (!(x & (1 << 0))) i++; return i; } #else static inline int count_bits_to_0(unsigned int x) // counting trailing zeroes { // http://www.hackersdelight.org/: nlz1() shortened for 16-bit mask register int n = 0; if (x <= 0x000000FFU) { n = n + 8; x = x << 8; } if (x <= 0x00000FFFU) { n = n + 4; x = x << 4; } if (x <= 0x00003FFFU) { n = n + 2; x = x << 2; } if (x <= 0x00007FFFU) { n = n + 1; } return n; } #endif #endif size_t strlen2(const char* str) { register size_t len = 0; // align to 16 bytes while ((((intptr_t)str) & (sizeof(__m128i) - 1)) != 0) { if (*str++ == 0) return len; ++len; } // search for 0 __m128i xmm0 = _mm_setzero_si128(); __m128i xmm1; int mask = 0; for (;;) { xmm1 = _mm_load_si128((__m128i*)str); xmm1 = _mm_cmpeq_epi8(xmm1, xmm0); if ((mask = _mm_movemask_epi8(xmm1)) != 0) { // got 0 somewhere within 16 bytes in xmm1, or within 16 bits in mask // find index of first set bit #ifndef _DISABLE_ASM_BSF // define it to disable ASM #if (_MSC_VER >= 1300) // make sure <intrin.h> is included unsigned long pos; _BitScanForward(&pos, mask); len += (size_t)pos; #elif defined(_MSC_VER) // earlier MSVC's do not have _BitScanForward, use inline asm __asm bsf edx, mask; edx = bsf(mask) __asm add edx, len; edx += len __asm mov len, edx; len = edx #elif ((__GNUC__ >= 4) || ((__GNUC__ == 3) && (__GNUC_MINOR__ >= 4))) // modern GCC has built-in __builtin_ctz len += __builtin_ctz(mask); #elif defined(__GNUC__) // older GCC shall use inline asm unsigned int pos; asm("bsf %1, %0" : "=r" (pos) : "rm" (mask)); len += (size_t)pos; #else // none of choices exist, use local BSF implementation len += count_bits_to_0(mask); #endif #else len += count_bits_to_0(mask); #endif break; } str += sizeof(__m128i); len += sizeof(__m128i); } return len; } int main() { std::vector<std::string> vstr; const int str_num = 1024; const int str_size = 1024; size_t len_result = 0; srand(0); for (int i = 0; i < str_num; i++) { std::string str1; for (int j = 0; j < str_size; j++) { str1.push_back('0' + rand() % 78); } vstr.push_back(std::move(str1)); } auto strlen_func = strlen; //auto strlen_func = strlen_algo; //auto strlen_func = strlen_sse4; //auto strlen_func = strlen2; auto time_std = std::chrono::steady_clock::now(); for (int k = 0; k < 10*1000; k++) { for (int i = 0; i < str_num; i++) { const char* str_for_test = vstr[i].c_str(); len_result += strlen_func(str_for_test); //len_result += strlen(str_for_test); } for (int i = 0; i < str_num; i++) { const char* str_for_test = vstr[i].c_str(); len_result -= strlen_func(str_for_test); //len_result -= strlen(str_for_test); } } auto finish = std::chrono::steady_clock::now(); double elapsed_seconds = std::chrono::duration_cast<std::chrono::duration<double>>(finish - time_std).count(); std::cout << "\n" << len_result; std::cout << "\n\nTime: " << elapsed_seconds; return 0; }
      
      







Line
 len_result += strlen(str_for_test);
      
      



compiles

Debug in: calling the strlen library function;

Release in: byte comparison.



If you comment and write it



 len_result += strlen_func(str_for_test);
      
      





Where



 auto strlen_func = strlen;
      
      





we will force the compiler to always call the library function.



Due to what the acceleration of the library function was achieved before byte comparison by 2.3 times (for Release, x86, 1k)?



Due to the comparison not by one byte, but immediately by 4. All the magic here:



 main_loop: mov eax,dword ptr [ecx] ; read 4 bytes mov edx,7efefeffh add edx,eax xor eax,-1 xor eax,edx add ecx,4 test eax,81010100h je short main_loop
      
      





Is it possible to do faster using vector instructions of modern processors? Let's try it.



Using the Intel Intrinsics guide , we find the intrinsic _mm_cmpistri SSE4.2, designed just for working with strings. Two vectors of length 128 bits and a mask of operations are fed to the input. As a mask we use: _SIDD_UBYTE_OPS = 0 - data type, _SIDD_CMP_EQUAL_EACH = 8 - operation of byte alignment, and we will compare with a zero vector. The returned value will be the number of the first pairwise unequal elements (that is, if the element matches from left to right, the count stops, I will be glad if someone confirms the behavior).



 size_t length = 0; int res = 0; //align to 16 bytes while (((size_t(str+length)) & (sizeof(__m128i) - 1)) != 0) { if (str[length] == 0) return length; length++; } __m128i z128 = _mm_setzero_si128(); for(; ; length += 16) { __m128i data = _mm_load_si128((__m128i*)(str + length)); if ((res = _mm_cmpistri(z128, data, _SIDD_CMP_EQUAL_EACH)) != 16) break; } /*while (str[length]) length++;*/ return length + res;
      
      





The code



 while (((size_t(str+length)) & (sizeof(__m128i) - 1)) != 0) { if (str[length] == 0) return length; length++; }
      
      





It is used to align the address of the loaded line, the address is needed in multiples of 16 for most SSE instructions to work. For the pcmpistri instruction used by us, alignment is strictly not necessary, an access exception will not be thrown.

Intrinsic

 __m128i data = _mm_load_si128((__m128i*)(str + length)); if ((res = _mm_cmpistri(z128, data, _SIDD_CMP_EQUAL_EACH)) != 16)
      
      





Compiled into:

 pcmpistri xmm1,xmmword ptr [eax+edx],8
      
      





However, aligning to 16 is also useful in our case, it gives a small increase in speed, and so we are sure that a read cycle of 16 bytes will not go to a potentially unallocated page ( 4K memory page ).



Cycle:

 while (str[length]) length++;
      
      





"Gets" the size of the string if its end is detected (since I'm not completely sure about the _mm_cmpistri algorithm of operation).

deleted after picul comment, which gave an increase on lines of small size.



Did we make the fastest strlen? - Unfortunately, no, the guys from https://www.strchr.com/sse2_optimised_strlen did even faster and without using SSE4.2.



Table 2 - runtime of the strlen benchmark in seconds (Release)

Number of characters MS byte comparison Ms strlen SSE 4.2 SSE 2
10, x86 0.15 0.12 0.12 0.125
1K, x86 9.0 3.9 1.65 1.42
10, x64 0.15 0.11 0.08 0.1
1K, x64 8.5 2.3 1.6 1.32


Findings:



It seems to me that MS always needs to call library strlen, and not do substitution byte comparisons.



UPD.

Added x64 test.

Removed last loop in strlen_SSE4



All Articles