Introduction to deterministic assemblies in C / C ++. Part 1

The translation of the article was prepared specifically for students of the course "C ++ Developer" .








What is a deterministic assembly?



A deterministic assembly is the process of building the same source code with the same environment and assembly instructions, in which the same binary files are created in any case, even if they are made on different machines, in different directories and with different names . Such assemblies are also sometimes referred to as playable or sealed assemblies, if it is guaranteed that they will create the same binaries even when compiling from different folders.



Deterministic assemblies are not something that happens by itself. They are not created in ordinary projects, and the reasons why this does not happen may be different for each operating system or compiler.



Deterministic assemblies must be guaranteed for a given assembly environment . This means that some variables, such as the operating system, build system versions, and target architecture , presumably remain the same in different assemblies.



In recent years, various organizations, such as Chromium , Reproducible builds, or Yocto , have made great efforts to achieve deterministic assemblies.



The importance of deterministic assemblies



There are two main reasons why deterministic assemblies are so important:





Binary files involved in the build process in C / C ++



There are various types of binaries that are created during the build process in C / C ++, depending on the operating system.



Microsoft Windows The most important are the files with the extensions .obj



, .lib



dll



and .exe



. All of them comply with the portable executable (PE) format specification. These files can be analyzed with tools like dumpbin .

Linux Files with the extensions .o



, .a



, .so



and without extensions (for executable binary files) correspond to the format of executable and linkable files (ELF). The contents of ELF files can be analyzed using readelf .

Mac OS Files with the extensions .o



, .a



, .dylib



and without extensions (for executable binary files) comply with the Mach-O format specification. These files can be verified using the otool application, which is part of the Xcode toolkit on MacOS.



Sources of Variations



Many different factors can make your assemblies non-deterministic . Factors will vary for different operating systems and compilers. Each compiler has certain parameters to correct the sources of variation. To date, gcc



and clang



are the compilers that contain more options for fixing. There are some undocumented options for msvc



that you can try, but in the end, you probably have to fix the binaries to get deterministic assemblies.



Timestamps Added by Compiler / Linker



There are two main reasons why our binaries may contain time information that makes them unplayable:





Let's look at an example where this information ends with compiling the static library of the hello world base project on MacOS.



 . β”œβ”€β”€ CMakeLists.txt β”œβ”€β”€ hello_world.cpp β”œβ”€β”€ hello_world.hpp β”œβ”€β”€ main.cpp └── run_build.sh
      
      





The library displays a message in the terminal:



 #include "hello_world.hpp" #include <iostream> void HelloWorld::PrintMessage(const std::string & message) { std::cout << message << std::endl; }
      
      





And the application will use this to display the message β€œHello World!”:



 #include <iostream> #include "hello_world.hpp" int main(int argc, char** argv) { HelloWorld hello; hello.PrintMessage("Hello World!"); return 0; }
      
      





We will use CMake to build the project:



 cmake_minimum_required(VERSION 3.0) project(HelloWorld) set(CMAKE_CXX_STANDARD 11) set(CMAKE_CXX_STANDARD_REQUIRED ON) add_library(HelloLibA hello_world.cpp) add_library(HelloLibB hello_world.cpp) add_executable(helloA main.cpp) add_executable(helloB main.cpp) target_link_libraries(helloA HelloLibA) target_link_libraries(helloB HelloLibB)
      
      





We will create two different libraries with the same source code, as well as two binary files with the same sources. Build the project and run md5sum



to see the checksums of all binary files:



 mkdir build && cd build cmake .. make md5sum helloA md5sum helloB md5sum CMakeFiles/HelloLibA.dir/hello_world.cpp.o md5sum CMakeFiles/HelloLibB.dir/hello_world.cpp.o md5sum libHelloLibA.a md5sum libHelloLibB.a
      
      





We get a conclusion like this:



 b5dce09c593658ee348fd0f7fae22c94 helloA b5dce09c593658ee348fd0f7fae22c94 helloB 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibA.dir/hello_world.cpp.o 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibB.dir/hello_world.cpp.o adb80234a61bb66bdc5a3b4b7191eac7 libHelloLibA.a 5ac3c70d28d9fdd9c6571e077131545e libHelloLibB.a
      
      





This is interesting because the helloA



and helloB



have the same checksums, as well as the Mach-O intermediate object files hello_world.cpp.o



, but this cannot be said for files with the .a



extension. This is because they store information about intermediate object files in an archive format. The header of this format includes a field called st_time



set by the stat



system call. Check libHelloLibA.a



and libHelloLibB.a



using otool



to show the headers:



 > otool -a libHelloLibA.a Archive : libHelloLibA.a 0100644 503/20 612 1566927276 #1/20 0100644 503/20 13036 1566927271 #1/28 > otool -a libHelloLibB.a Archive : libHelloLibB.a 0100644 503/20 612 1566927277 #1/20 0100644 503/20 13036 1566927272 #1/28
      
      





We see that the file contains several temporary fields that make our assembly non-deterministic. Note that these fields do not apply to the final executable file, since they have the same checksum. This problem can also occur when building on Windows with Visual Studio, but with a PE file instead of Mach-O.



At this point, we can try to make things worse and make our binaries also be non-deterministic. Change the main.cpp



file so that it includes the __TIME__



macro:



 #include <iostream> #include "hello_world.hpp" int main(int argc, char** argv) { HelloWorld hello; hello.PrintMessage("Hello World!"); std::cout << "At time: " << __TIME__ << std::endl; return 0; }
      
      





Check the checksums of the files again:



 625ecc7296e15d41e292f67b57b04f15 helloA 20f92d2771a7d2f9866c002de918c4da helloB 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibA.dir/hello_world.cpp.o 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibB.dir/hello_world.cpp.o b7801c60d3bc4f83640cadc1183f43b3 libHelloLibA.a 4ef6cae3657f2a13ed77830953b0aee8 libHelloLibB.a
      
      





We see that now we have different binaries. We could analyze the executable using a tool like diffoscope , which shows the difference between two binary files:



 > diffoscope helloA helloB --- helloA +++ helloB β”œβ”€β”€ otool -arch x86_64 -tdvV {} β”‚β”„ Code for architecture x86_64 β”‚ @@ -16,15 +16,15 @@ β”‚ 00000001000018da jmp 0x1000018df β”‚ 00000001000018df leaq -0x30(%rbp), %rdi β”‚ 00000001000018e3 callq 0x100002d54 ## symbol stub for: __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev β”‚ 00000001000018e8 movq 0x1721(%rip), %rdi ## literal pool symbol address: __ZNSt3__14coutE β”‚ 00000001000018ef leaq 0x162f(%rip), %rsi ## literal pool for: "At time: " β”‚ 00000001000018f6 callq 0x100002d8a ## symbol stub for: __ZNSt3__1lsINS_11char_traitsIcEEEERNS_13basic_ostreamIcT_EES6_PKc β”‚ 00000001000018fb movq %rax, %rdi β”‚ -00000001000018fe leaq 0x162a(%rip), %rsi ## literal pool for: "19:40:47" β”‚ +00000001000018fe leaq 0x162a(%rip), %rsi ## literal pool for: "19:40:48" β”‚ 0000000100001905 callq 0x100002d8a ## symbol stub for: __ZNSt3__1lsINS_11char_traitsIcEEEERNS_13basic_ostreamIcT_EES6_PKc β”‚ 000000010000190a movq %rax, %rdi β”‚ 000000010000190d leaq __ZNSt3__1L4endlIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_(%rip), %rsi #
      
      





It shows that __TIME__



information was pasted into the binary, which makes it non-deterministic. Let's see what can be done to avoid this.



Possible Solutions for Microsoft Visual Studio



Microsoft Visual Studio has a linker / Brepro flag that is not documented by Microsoft. This flag sets the timestamps from the Portable Executable format to -1, as seen in the figure below.







To activate this flag with CMake, we must add the following lines when creating the .exe



:



 add_link_options("/Brepro")
      
      





or these lines for .lib







 set_target_properties( TARGET PROPERTIES STATIC_LIBRARY_OPTIONS "/Brepro" )
      
      





The problem is that this flag makes the binaries playable (relative to the timestamps in the file format) in our final binary .exe, but does not remove all the timestamps from the .lib (the same problem as with the Mach-O object files, which we talked about above). The TimeDateStamp field from the COFF header file for .lib



files will remain. The only way to remove this information from the binary .lib



file is to fix the .lib



by replacing the bytes corresponding to the TimeDateStamp field with any known value.



Possible solutions for GCC and CLANG





Let's continue with our sample project for MacOS and see what the results will be when setting the ZERO_AR_DATE



environment ZERO_AR_DATE



.



 export ZERO_AR_DATE=1
      
      





Now, if we compile our executable file and libraries (removing the __DATE__



macro in the sources), we get:



 b5dce09c593658ee348fd0f7fae22c94 helloA b5dce09c593658ee348fd0f7fae22c94 helloB 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibA.dir/hello_world.cpp.o 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibB.dir/hello_world.cpp.o 9f9a9af4bb3e220e7a22fb58d708e1e5 libHelloLibA.a 9f9a9af4bb3e220e7a22fb58d708e1e5 libHelloLibB.a
      
      





All checksums are now the same. .a



analyze the file headers with the extension .a



:



 > otool -a libHelloLibA.a Archive : libHelloLibA.a 0100644 503/20 612 0 #1/20 0100644 503/20 13036 0 #1/28 > otool -a libHelloLibB.a Archive : libHelloLibB.a 0100644 503/20 612 0 #1/20 0100644 503/20 13036 0 #1/28
      
      





We can see that the timestamp



field of the library header was set to zero.



We smoothly came to the end of the first part of the article. The continuation of the material can be read here .



All Articles