The translation of the article was prepared specifically for students of the course "C ++ Developer" .
What is a deterministic assembly?
A deterministic assembly is the process of building the same source code with the same environment and assembly instructions, in which the same binary files are created in any case, even if they are made on different machines, in different directories and with different names . Such assemblies are also sometimes referred to as playable or sealed assemblies, if it is guaranteed that they will create the same binaries even when compiling from different folders.
Deterministic assemblies are not something that happens by itself. They are not created in ordinary projects, and the reasons why this does not happen may be different for each operating system or compiler.
Deterministic assemblies must be guaranteed for a given
assembly environment . This means that some variables, such as the
operating system, build system versions, and target architecture , presumably remain the same in different assemblies.
In recent years, various organizations, such as
Chromium ,
Reproducible builds, or
Yocto , have made great efforts to achieve deterministic assemblies.
The importance of deterministic assemblies
There are two main reasons why deterministic assemblies are so important:
- Safety Changing binaries instead of source code can make the changes invisible to the original authors. This can be fatal in safety-critical conditions such as medicine, aviation and space. Potentially identical results for these materials allow third parties to reach consensus on the correct result.
- Traceability and binary control . If you want to have a repository for storing your binary files, then most likely you do not want to create binary files with random checksums from sources in the same revision. This can cause the repository system to store different binaries as different versions when they should be the same. For example, if you work in Windows or MacOS, the library has fields with the time of creation / modification of the object files included in it, which will lead to differences in binary files.
Binary files involved in the build process in C / C ++
There are various types of binaries that are created during the build process in C / C ++, depending on the operating system.
Microsoft Windows The most important are the files with the extensions
.obj
,
.lib
dll
and
.exe
. All of them comply with the portable executable (PE) format specification. These files can be analyzed with tools like
dumpbin .
Linux Files with the extensions
.o
,
.a
,
.so
and without extensions (for executable binary files) correspond to the format of executable and linkable files (ELF). The contents of ELF files can be analyzed using
readelf .
Mac OS Files with the extensions
.o
,
.a
,
.dylib
and without extensions (for executable binary files) comply with the Mach-O format specification. These files can be verified using the
otool application, which is part of the
Xcode toolkit on MacOS.
Sources of Variations
Many different factors can make your assemblies
non-deterministic . Factors will vary for different operating systems and compilers. Each compiler has certain parameters to correct the sources of variation. To date,
gcc
and
clang
are the compilers that contain more options for fixing. There are some undocumented options for
msvc
that you can try, but in the end, you probably have to fix the binaries to get deterministic assemblies.
Timestamps Added by Compiler / Linker
There are two main reasons why our binaries may contain time information that makes them unplayable:
- Using the
__DATE__
or __TIME__
in the source. - When a file format forces you to store time information in object files. This is the case of the Portable Executable format on Windows and Mach-O on MacOS. On Linux, ELF files do not encode any timestamps.
Let's look at an example where this information ends with compiling the static library of the hello world base project on MacOS.
. βββ CMakeLists.txt βββ hello_world.cpp βββ hello_world.hpp βββ main.cpp βββ run_build.sh
The library displays a message in the terminal:
#include "hello_world.hpp" #include <iostream> void HelloWorld::PrintMessage(const std::string & message) { std::cout << message << std::endl; }
And the application will use this to display the message βHello World!β:
#include <iostream> #include "hello_world.hpp" int main(int argc, char** argv) { HelloWorld hello; hello.PrintMessage("Hello World!"); return 0; }
We will use CMake to build the project:
cmake_minimum_required(VERSION 3.0) project(HelloWorld) set(CMAKE_CXX_STANDARD 11) set(CMAKE_CXX_STANDARD_REQUIRED ON) add_library(HelloLibA hello_world.cpp) add_library(HelloLibB hello_world.cpp) add_executable(helloA main.cpp) add_executable(helloB main.cpp) target_link_libraries(helloA HelloLibA) target_link_libraries(helloB HelloLibB)
We will create two different libraries with the same source code, as well as two binary files with the same sources. Build the project and run
md5sum
to see the checksums of all binary files:
mkdir build && cd build cmake .. make md5sum helloA md5sum helloB md5sum CMakeFiles/HelloLibA.dir/hello_world.cpp.o md5sum CMakeFiles/HelloLibB.dir/hello_world.cpp.o md5sum libHelloLibA.a md5sum libHelloLibB.a
We get a conclusion like this:
b5dce09c593658ee348fd0f7fae22c94 helloA b5dce09c593658ee348fd0f7fae22c94 helloB 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibA.dir/hello_world.cpp.o 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibB.dir/hello_world.cpp.o adb80234a61bb66bdc5a3b4b7191eac7 libHelloLibA.a 5ac3c70d28d9fdd9c6571e077131545e libHelloLibB.a
This is interesting because the
helloA
and
helloB
have the same checksums, as well as the Mach-O intermediate object files
hello_world.cpp.o
, but this cannot be said for files with the
.a
extension. This is because they store information about intermediate object files in an archive format. The header of this format includes a field called
st_time
set by the
stat
system call. Check
libHelloLibA.a
and
libHelloLibB.a
using
otool
to show the headers:
> otool -a libHelloLibA.a Archive : libHelloLibA.a 0100644 503/20 612 1566927276 #1/20 0100644 503/20 13036 1566927271 #1/28 > otool -a libHelloLibB.a Archive : libHelloLibB.a 0100644 503/20 612 1566927277 #1/20 0100644 503/20 13036 1566927272 #1/28
We see that the file contains several temporary fields that make our assembly non-deterministic. Note that these fields do not apply to the final executable file, since they have the same checksum. This problem can also occur when building on Windows with Visual Studio, but with a PE file instead of Mach-O.
At this point, we can try to make things worse and make our binaries also be non-deterministic. Change the
main.cpp
file so that it includes the
__TIME__
macro:
#include <iostream> #include "hello_world.hpp" int main(int argc, char** argv) { HelloWorld hello; hello.PrintMessage("Hello World!"); std::cout << "At time: " << __TIME__ << std::endl; return 0; }
Check the checksums of the files again:
625ecc7296e15d41e292f67b57b04f15 helloA 20f92d2771a7d2f9866c002de918c4da helloB 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibA.dir/hello_world.cpp.o 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibB.dir/hello_world.cpp.o b7801c60d3bc4f83640cadc1183f43b3 libHelloLibA.a 4ef6cae3657f2a13ed77830953b0aee8 libHelloLibB.a
We see that now we have different binaries. We could analyze the executable using a tool like
diffoscope , which shows the difference between two binary files:
> diffoscope helloA helloB --- helloA +++ helloB βββ otool -arch x86_64 -tdvV {} ββ Code for architecture x86_64 β @@ -16,15 +16,15 @@ β 00000001000018da jmp 0x1000018df β 00000001000018df leaq -0x30(%rbp), %rdi β 00000001000018e3 callq 0x100002d54 ## symbol stub for: __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev β 00000001000018e8 movq 0x1721(%rip), %rdi ## literal pool symbol address: __ZNSt3__14coutE β 00000001000018ef leaq 0x162f(%rip), %rsi ## literal pool for: "At time: " β 00000001000018f6 callq 0x100002d8a ## symbol stub for: __ZNSt3__1lsINS_11char_traitsIcEEEERNS_13basic_ostreamIcT_EES6_PKc β 00000001000018fb movq %rax, %rdi β -00000001000018fe leaq 0x162a(%rip), %rsi ## literal pool for: "19:40:47" β +00000001000018fe leaq 0x162a(%rip), %rsi ## literal pool for: "19:40:48" β 0000000100001905 callq 0x100002d8a ## symbol stub for: __ZNSt3__1lsINS_11char_traitsIcEEEERNS_13basic_ostreamIcT_EES6_PKc β 000000010000190a movq %rax, %rdi β 000000010000190d leaq __ZNSt3__1L4endlIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_(%rip), %rsi #
It shows that
__TIME__
information was pasted into the binary, which makes it non-deterministic. Let's see what can be done to avoid this.
Possible Solutions for Microsoft Visual Studio
Microsoft Visual Studio has a linker / Brepro flag that is not documented by Microsoft. This flag sets the timestamps from the Portable Executable format to -1, as seen in the figure below.
To activate this flag with CMake, we must add the following lines when creating the
.exe
:
add_link_options("/Brepro")
or these lines for
.lib
set_target_properties( TARGET PROPERTIES STATIC_LIBRARY_OPTIONS "/Brepro" )
The problem is that this flag makes the binaries playable (relative to the timestamps in the file format) in our final binary .exe, but does not remove all the timestamps from the .lib (the same problem as with the Mach-O object files, which we talked about above). The TimeDateStamp field from
the COFF header file for
.lib
files will remain. The only way to remove this information from the binary
.lib
file is to fix the
.lib
by replacing the bytes corresponding to the TimeDateStamp field with any known value.
Possible solutions for GCC and CLANG
- gcc detects the existence of the SOURCE_DATE_EPOCH environment variable. If this variable is set, its value indicates the UNIX timestamp that will be used to replace the current date and time in the
__DATE__
and __TIME__
macros so that the built-in timestamps become reproducible. The value can be set to a known time stamp, such as the time of the last change to the source files or package. - clang uses
ZERO_AR_DATE
, which, if set, resets the time ZERO_AR_DATE
provided in the archive files, setting it to 0. Note that this will not fix the __DATE__
or __TIME__
. If we want to fix the effect of this macro, we must either fix the binaries or fake the system time.
Let's continue with our sample project for MacOS and see what the results will be when setting the
ZERO_AR_DATE
environment
ZERO_AR_DATE
.
export ZERO_AR_DATE=1
Now, if we compile our executable file and libraries (removing the
__DATE__
macro in the sources), we get:
b5dce09c593658ee348fd0f7fae22c94 helloA b5dce09c593658ee348fd0f7fae22c94 helloB 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibA.dir/hello_world.cpp.o 0a4a0de3df8cc7f053f2fcb6d8b75e6d CMakeFiles/HelloLibB.dir/hello_world.cpp.o 9f9a9af4bb3e220e7a22fb58d708e1e5 libHelloLibA.a 9f9a9af4bb3e220e7a22fb58d708e1e5 libHelloLibB.a
All checksums are now the same.
.a
analyze the file headers with the extension
.a
:
> otool -a libHelloLibA.a Archive : libHelloLibA.a 0100644 503/20 612 0 #1/20 0100644 503/20 13036 0 #1/28 > otool -a libHelloLibB.a Archive : libHelloLibB.a 0100644 503/20 612 0 #1/20 0100644 503/20 13036 0 #1/28
We can see that the
timestamp
field of the library header was set to zero.
We smoothly came to the end of the first part of the article. The continuation of the material can be read here .