Introduction to deterministic assemblies in C / C ++. Part 2

The translation of the article was prepared specifically for students of the course "C ++ Developer" .







โ†’ Read the first part






Assembly folder information is distributed in binary files



If the same source files are compiled in different folders, sometimes the folder information is transferred to binary files. This can happen mainly for two reasons:





Continuing our hello world example on MacOS, let's split the source so that we can show the effect of location on final binaries. The structure of the project will be similar to the one below.



 . โ”œโ”€โ”€ run_build.sh โ”œโ”€โ”€ srcA โ”‚ โ”œโ”€โ”€ CMakeLists.txt โ”‚ โ”œโ”€โ”€ hello_world.cpp โ”‚ โ”œโ”€โ”€ hello_world.hpp โ”‚ โ””โ”€โ”€ main.cpp โ””โ”€โ”€ srcB โ”œโ”€โ”€ CMakeLists.txt โ”œโ”€โ”€ hello_world.cpp โ”œโ”€โ”€ hello_world.hpp โ””โ”€โ”€ main.cpp
      
      





Let's collect our binary files in debug mode.



 cd srcA/build cmake -DCMAKE_BUILD_TYPE=Debug .. make cd .. && cd .. cd srcB/build cmake -DCMAKE_BUILD_TYPE=Debug .. make cd .. && cd .. md5sum srcA/build/hello md5sum srcB/build/hello md5sum srcA/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o md5sum srcB/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o md5sum srcA/build/libHelloLib.a md5sum srcB/build/libHelloLib.a     : 3572a95a8699f71803f3e967f92a5040 srcA/build/hello 7ca693295e62de03a1bba14853efa28c srcB/build/hello 76e0ae7c4ef79ec3be821ccf5752730f srcA/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o 5ef044e6dcb73359f46d48f29f566ae5 srcB/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o dc941156608b578c91e38f8ecebfef6d srcA/build/libHelloLib.a 1f9697ef23bf70b41b39ef3469845f76 srcB/build/libHelloLib.a
      
      





Information about the folder is transferred from the object files to the final executable files, which makes our assemblies irreproducible. We can look at the differences between the binaries using a diffoscope to see where the folder information is embedded.



 > diffoscope helloA helloB --- srcA/build/hello +++ srcB/build/hello @@ -1282,20 +1282,20 @@ ... 00005070: 5f77 6f72 6c64 5f64 6562 7567 2f73 7263 _world_debug/src -00005080: 412f 006d 6169 6e2e 6370 7000 2f55 7365 A/.main.cpp./Use +00005080: 422f 006d 6169 6e2e 6370 7000 2f55 7365 B/.main.cpp./Use 00005090: 7273 2f63 6172 6c6f 732f 446f 6375 6d65 rs/carlos/Docume 000050a0: 6e74 732f 6465 7665 6c6f 7065 722f 7265 nts/developer/re 000050b0: 7072 6f64 7563 6962 6c65 2d62 7569 6c64 producible-build 000050c0: 732f 7361 6e64 626f 782f 6865 6c6c 6f5f s/sandbox/hello_ -000050d0: 776f 726c 645f 6465 6275 672f 7372 6341 world_debug/srcA +000050d0: 776f 726c 645f 6465 6275 672f 7372 6342 world_debug/srcB 000050e0: 2f62 7569 6c64 2f43 4d61 6b65 4669 6c65 /build/CMakeFile 000050f0: 732f 6865 6c6c 6f2e 6469 722f 6d61 696e s/hello.dir/main 00005100: 2e63 7070 2e6f 005f 6d61 696e 005f 5f5a .cpp.o._main.__Z ... @@ -1336,15 +1336,15 @@ ... 000053c0: 6962 6c65 2d62 7569 6c64 732f 7361 6e64 ible-builds/sand 000053d0: 626f 782f 6865 6c6c 6f5f 776f 726c 645f box/hello_world_ -000053e0: 6465 6275 672f 7372 6341 2f62 7569 6c64 debug/srcA/build +000053e0: 6465 6275 672f 7372 6342 2f62 7569 6c64 debug/srcB/build 000053f0: 2f6c 6962 4865 6c6c 6f4c 6962 2e61 2868 /libHelloLib.a(h 00005400: 656c 6c6f 5f77 6f72 6c64 2e63 7070 2e6f ello_world.cpp.o 00005410: 2900 5f5f 5a4e 3130 4865 6c6c 6f57 6f72 ).__ZN10HelloWor ...
      
      





Possible solutions



Again, the decision will depend on the compiler used:





The best way to solve this problem is to add flags to the compiler options. When using CMake:



 target_compile_options(target PUBLIC "-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.")
      
      





The order of files in the build system



The order of the files can be a problem if directories are read to make a list of their files. For example, Unix does not have a deterministic order in which readdir () and listdir () must return the contents of a directory, so trusting these functions to feed the assembly system can lead to non-deterministic assemblies.



The same problem occurs, for example, if your build system stores files for the linker in a container (for example, in a regular python dictionary), which can return elements in a non-deterministic order. This will cause the files to be linked in a different order each time, and different binary files will be created.



We can simulate this problem by rearranging the files in CMake. If we modify the previous example to have more than one source file for the library:



 . โ”œโ”€โ”€ CMakeLists.txt โ”œโ”€โ”€ CMakeListsA.txt โ”œโ”€โ”€ CMakeListsB.txt โ”œโ”€โ”€ hello_world.cpp โ”œโ”€โ”€ hello_world.hpp โ”œโ”€โ”€ main.cpp โ”œโ”€โ”€ sources0.cpp โ”œโ”€โ”€ sources0.hpp โ”œโ”€โ”€ sources1.cpp โ”œโ”€โ”€ sources1.hpp โ”œโ”€โ”€ sources2.cpp โ””โ”€โ”€ sources2.hpp
      
      





We can see that the compilation results are different if we change the order of the files in CMakeLists.txt



:



 cmake_minimum_required(VERSION 3.0) project(HelloWorld) set(CMAKE_CXX_STANDARD 11) set(CMAKE_CXX_STANDARD_REQUIRED ON) add_library(HelloLib hello_world.cpp sources0.cpp sources1.cpp sources2.cpp) add_executable(hello main.cpp) target_link_libraries(hello HelloLib)
      
      





If we make two consecutive assemblies with the names A and B, swapping the sources0.cpp



and sources1.cpp



in the list of files, we get the following checksums:



 30ab264d6f8e1784282cd1a415c067f2 helloA cdf3c9dd968f7363dc9e8b40918d83af helloB 707c71bc2a8def6885b96fb67b84d79c hello_worldA.cpp.o 707c71bc2a8def6885b96fb67b84d79c hello_worldB.cpp.o 694ff3765b688e6faeebf283052629a3 sources0A.cpp.o 694ff3765b688e6faeebf283052629a3 sources0B.cpp.o 0db24dc6a94da1d167c68b96ff319e56 sources1A.cpp.o 0db24dc6a94da1d167c68b96ff319e56 sources1B.cpp.o fd0754d9a4a44b0fcc4e4f3c66ad187c sources2A.cpp.o fd0754d9a4a44b0fcc4e4f3c66ad187c sources2B.cpp.o baba9709d69c9e5fd51ad985ee328172 libHelloLibA.a 72641dc6fc4f4db04166255f62803353 libHelloLibB.a
      
      





Object .o



files are identical, but .a libraries and executables are not. This is because the order of insertion in the library depends on the order in which the files are listed.



Compiler created randomness



This problem occurs, for example, in gcc when Link-Time Optimizations (the -flto



flag) is -flto



. This option injects randomly generated names into binary files. The only way to avoid this problem is to use the flag - frandom-seed



. This option provides a seed that gcc uses instead of random numbers. It is used to generate specific symbol names, which must be different in each compiled file. It is also used to place unique stamps in data coverage files and object files that produce them. This parameter should be different for each source file. One option is to set the checksum of the file so that the chance of collision is very low. For example, in CMake, this can be done using this function:



 set(LIB_SOURCES ./src/source1.cpp ./src/source2.cpp ./src/source3.cpp) foreach(_file ${LIB_SOURCES}) file(SHA1 ${_file} checksum) string(SUBSTRING ${checksum} 0 8 checksum) set_property(SOURCE ${_file} APPEND_STRING PROPERTY COMPILE_FLAGS "-frandom-seed=0x${checksum}") endforeach()
      
      





Some tips for using Conan



Conan hooks can help us make our builds reproducible. This feature allows you to customize the behavior of the client at certain points.



One way to use hooks can be to set environment variables at the pre_build



stage. In the example below, the set_environment



function is set_environment



, and then the environment is restored in the post_build



step using reset_environment



.



 def set_environment(self): if self._os == "Linux": self._old_source_date_epoch = os.environ.get("SOURCE_DATE_EPOCH") timestamp = "1564483496" os.environ["SOURCE_DATE_EPOCH"] = timestamp self._output.info( "set SOURCE_DATE_EPOCH: {}".format(timestamp)) elif self._os == "Macos": os.environ["ZERO_AR_DATE"] = "1" self._output.info( "set ZERO_AR_DATE: {}".format(timestamp)) def reset_environment(self): if self._os == "Linux": if self._old_source_date_epoch is None: del os.environ["SOURCE_DATE_EPOCH"] else: os.environ["SOURCE_DATE_EPOCH"] = self._old_source_date_epoch elif self._os == "Macos": del os.environ["ZERO_AR_DATE"]
      
      





Hooks can also be useful for fixing binaries at the post_build



stage. There are various tools for analyzing and correcting binary files, such as ducible



, pefile



, pe-parse



or strip-nondeterminism



. An example hook for fixing a PE binary using ducible



might be:



 class Patcher(object): ... def patch(self): if self._os == "Windows" and self._compiler == "Visual Studio": for root, _, filenames in os.walk(self._conanfile.build_folder): for filename in filenames: filename = os.path.join(root, filename) if ".exe" in filename or ".dll" in filename: self._patch_pe(filename) def _patch_pe(self, filename): patch_tool_location = "C:/ducible/ducible.exe" if os.path.isfile(patch_tool_location): self._output.info("Patching {} with md5sum: {}".format(filename,md5sum(filename))) self._conanfile.run("{} {}".format(patch_tool_location, filename)) self._output.info("Patched file: {} with md5sum: {}".format(filename,md5sum(filename))) ... def pre_build(output, conanfile, **kwargs): lib_patcher.init(output, conanfile) lib_patcher.set_environment() def post_build(output, conanfile, **kwargs): lib_patcher.patch() lib_patcher.reset_environment()
      
      





findings



Deterministic assemblies are a complex task that is closely related to the operating system and toolkit used. This introduction was supposed to help understand the most common causes of lack of determinism and how to address them.



References



general information





Instruments



Binary Comparison Tools





File Repair Tools





File Analysis Tools





โ†’ Read the first part



All Articles