The translation of the article was prepared specifically for students of the course "C ++ Developer" .
โ
Read the first part
Assembly folder information is distributed in binary files
If the same source files are compiled in different folders, sometimes the folder information is transferred to binary files. This can happen mainly for two reasons:
- Using macros that contain information about the current file, such as the
__FILE__
macro. - Create debug binaries that store information about where the sources are.
Continuing our hello world example on MacOS, let's split the source so that we can show the effect of location on final binaries. The structure of the project will be similar to the one below.
. โโโ run_build.sh โโโ srcA โ โโโ CMakeLists.txt โ โโโ hello_world.cpp โ โโโ hello_world.hpp โ โโโ main.cpp โโโ srcB โโโ CMakeLists.txt โโโ hello_world.cpp โโโ hello_world.hpp โโโ main.cpp
Let's collect our binary files in debug mode.
cd srcA/build cmake -DCMAKE_BUILD_TYPE=Debug .. make cd .. && cd .. cd srcB/build cmake -DCMAKE_BUILD_TYPE=Debug .. make cd .. && cd .. md5sum srcA/build/hello md5sum srcB/build/hello md5sum srcA/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o md5sum srcB/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o md5sum srcA/build/libHelloLib.a md5sum srcB/build/libHelloLib.a : 3572a95a8699f71803f3e967f92a5040 srcA/build/hello 7ca693295e62de03a1bba14853efa28c srcB/build/hello 76e0ae7c4ef79ec3be821ccf5752730f srcA/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o 5ef044e6dcb73359f46d48f29f566ae5 srcB/build/CMakeFiles/HelloLib.dir/hello_world.cpp.o dc941156608b578c91e38f8ecebfef6d srcA/build/libHelloLib.a 1f9697ef23bf70b41b39ef3469845f76 srcB/build/libHelloLib.a
Information about the folder is transferred from the object files to the final executable files, which makes our assemblies irreproducible. We can look at the differences between the binaries using a diffoscope to see where the folder information is embedded.
> diffoscope helloA helloB --- srcA/build/hello +++ srcB/build/hello @@ -1282,20 +1282,20 @@ ... 00005070: 5f77 6f72 6c64 5f64 6562 7567 2f73 7263 _world_debug/src -00005080: 412f 006d 6169 6e2e 6370 7000 2f55 7365 A/.main.cpp./Use +00005080: 422f 006d 6169 6e2e 6370 7000 2f55 7365 B/.main.cpp./Use 00005090: 7273 2f63 6172 6c6f 732f 446f 6375 6d65 rs/carlos/Docume 000050a0: 6e74 732f 6465 7665 6c6f 7065 722f 7265 nts/developer/re 000050b0: 7072 6f64 7563 6962 6c65 2d62 7569 6c64 producible-build 000050c0: 732f 7361 6e64 626f 782f 6865 6c6c 6f5f s/sandbox/hello_ -000050d0: 776f 726c 645f 6465 6275 672f 7372 6341 world_debug/srcA +000050d0: 776f 726c 645f 6465 6275 672f 7372 6342 world_debug/srcB 000050e0: 2f62 7569 6c64 2f43 4d61 6b65 4669 6c65 /build/CMakeFile 000050f0: 732f 6865 6c6c 6f2e 6469 722f 6d61 696e s/hello.dir/main 00005100: 2e63 7070 2e6f 005f 6d61 696e 005f 5f5a .cpp.o._main.__Z ... @@ -1336,15 +1336,15 @@ ... 000053c0: 6962 6c65 2d62 7569 6c64 732f 7361 6e64 ible-builds/sand 000053d0: 626f 782f 6865 6c6c 6f5f 776f 726c 645f box/hello_world_ -000053e0: 6465 6275 672f 7372 6341 2f62 7569 6c64 debug/srcA/build +000053e0: 6465 6275 672f 7372 6342 2f62 7569 6c64 debug/srcB/build 000053f0: 2f6c 6962 4865 6c6c 6f4c 6962 2e61 2868 /libHelloLib.a(h 00005400: 656c 6c6f 5f77 6f72 6c64 2e63 7070 2e6f ello_world.cpp.o 00005410: 2900 5f5f 5a4e 3130 4865 6c6c 6f57 6f72 ).__ZN10HelloWor ...
Possible solutions
Again, the decision will depend on the compiler used:
- msvc cannot set parameters to avoid adding this information to binary files. The only way to get reproducible binaries is to use the repair tool again to remove this information during the build phase. Please note that since we fix binaries to produce reproducible binaries, the folders used for different assemblies should be the same length in characters.
-
gcc
has three compiler flags to get around this problem:
-
-fdebug-prefix-map=OLD=NEW
can remove directory prefixes from debug information. -
-fmacro-prefix-map=OLD=NEW
available since gcc 8 and solves the irreproducibility problem by using the __FILE__ macro. -
-ffile-prefix-map=OLD=NEW
is available since gcc 8 and is a union of -fdebug-prefix-map and -fmacro-prefix-map
-
clang
supported -fdebug-prefix-map=OLD=NEW
since version 3.8 and is working on supporting two other flags for future versions.
The best way to solve this problem is to add flags to the compiler options. When using CMake:
target_compile_options(target PUBLIC "-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.")
The order of files in the build system
The order of the files can be a problem if directories are read to make a list of their files. For example, Unix does not have a deterministic order in which readdir () and listdir () must return the contents of a directory, so trusting these functions to feed the assembly system can lead to non-deterministic assemblies.
The same problem occurs, for example, if your build system stores files for the linker in a container (for example, in a regular python dictionary), which can return elements in a non-deterministic order. This will cause the files to be linked in a different order each time, and different binary files will be created.
We can simulate this problem by rearranging the files in CMake. If we modify the previous example to have more than one source file for the library:
. โโโ CMakeLists.txt โโโ CMakeListsA.txt โโโ CMakeListsB.txt โโโ hello_world.cpp โโโ hello_world.hpp โโโ main.cpp โโโ sources0.cpp โโโ sources0.hpp โโโ sources1.cpp โโโ sources1.hpp โโโ sources2.cpp โโโ sources2.hpp
We can see that the compilation results are different if we change the order of the files in
CMakeLists.txt
:
cmake_minimum_required(VERSION 3.0) project(HelloWorld) set(CMAKE_CXX_STANDARD 11) set(CMAKE_CXX_STANDARD_REQUIRED ON) add_library(HelloLib hello_world.cpp sources0.cpp sources1.cpp sources2.cpp) add_executable(hello main.cpp) target_link_libraries(hello HelloLib)
If we make two consecutive assemblies with the names A and B, swapping the
sources0.cpp
and
sources1.cpp
in the list of files, we get the following checksums:
30ab264d6f8e1784282cd1a415c067f2 helloA cdf3c9dd968f7363dc9e8b40918d83af helloB 707c71bc2a8def6885b96fb67b84d79c hello_worldA.cpp.o 707c71bc2a8def6885b96fb67b84d79c hello_worldB.cpp.o 694ff3765b688e6faeebf283052629a3 sources0A.cpp.o 694ff3765b688e6faeebf283052629a3 sources0B.cpp.o 0db24dc6a94da1d167c68b96ff319e56 sources1A.cpp.o 0db24dc6a94da1d167c68b96ff319e56 sources1B.cpp.o fd0754d9a4a44b0fcc4e4f3c66ad187c sources2A.cpp.o fd0754d9a4a44b0fcc4e4f3c66ad187c sources2B.cpp.o baba9709d69c9e5fd51ad985ee328172 libHelloLibA.a 72641dc6fc4f4db04166255f62803353 libHelloLibB.a
Object
.o
files are identical, but .a libraries and executables are not. This is because the order of insertion in the library depends on the order in which the files are listed.
Compiler created randomness
This problem occurs, for example, in gcc when
Link-Time Optimizations (the
-flto
flag) is
-flto
. This option injects randomly generated names into binary files. The only way to avoid this problem is to use the flag -
frandom-seed
. This option provides a seed that gcc uses instead of random numbers. It is used to generate specific symbol names, which must be different in each compiled file. It is also used to place unique stamps in data coverage files and object files that produce them. This parameter should be different for each source file. One option is to set the checksum of the file so that the chance of collision is very low. For example, in CMake, this can be done using this function:
set(LIB_SOURCES ./src/source1.cpp ./src/source2.cpp ./src/source3.cpp) foreach(_file ${LIB_SOURCES}) file(SHA1 ${_file} checksum) string(SUBSTRING ${checksum} 0 8 checksum) set_property(SOURCE ${_file} APPEND_STRING PROPERTY COMPILE_FLAGS "-frandom-seed=0x${checksum}") endforeach()
Some tips for using Conan
Conan
hooks can help us make our builds reproducible. This feature allows you to customize the behavior of the client at certain points.
One way to use hooks can be to set environment variables at the
pre_build
stage. In the example below, the
set_environment
function is
set_environment
, and then the environment is restored in the
post_build
step using
reset_environment
.
def set_environment(self): if self._os == "Linux": self._old_source_date_epoch = os.environ.get("SOURCE_DATE_EPOCH") timestamp = "1564483496" os.environ["SOURCE_DATE_EPOCH"] = timestamp self._output.info( "set SOURCE_DATE_EPOCH: {}".format(timestamp)) elif self._os == "Macos": os.environ["ZERO_AR_DATE"] = "1" self._output.info( "set ZERO_AR_DATE: {}".format(timestamp)) def reset_environment(self): if self._os == "Linux": if self._old_source_date_epoch is None: del os.environ["SOURCE_DATE_EPOCH"] else: os.environ["SOURCE_DATE_EPOCH"] = self._old_source_date_epoch elif self._os == "Macos": del os.environ["ZERO_AR_DATE"]
Hooks can also be useful for fixing binaries at the
post_build
stage. There are various tools for analyzing and correcting binary files, such as
ducible
,
pefile
,
pe-parse
or
strip-nondeterminism
. An example hook for fixing a PE binary using
ducible
might be:
class Patcher(object): ... def patch(self): if self._os == "Windows" and self._compiler == "Visual Studio": for root, _, filenames in os.walk(self._conanfile.build_folder): for filename in filenames: filename = os.path.join(root, filename) if ".exe" in filename or ".dll" in filename: self._patch_pe(filename) def _patch_pe(self, filename): patch_tool_location = "C:/ducible/ducible.exe" if os.path.isfile(patch_tool_location): self._output.info("Patching {} with md5sum: {}".format(filename,md5sum(filename))) self._conanfile.run("{} {}".format(patch_tool_location, filename)) self._output.info("Patched file: {} with md5sum: {}".format(filename,md5sum(filename))) ... def pre_build(output, conanfile, **kwargs): lib_patcher.init(output, conanfile) lib_patcher.set_environment() def post_build(output, conanfile, **kwargs): lib_patcher.patch() lib_patcher.reset_environment()
findings
Deterministic assemblies are a complex task that is closely related to the operating system and toolkit used. This introduction was supposed to help understand the most common causes of lack of determinism and how to address them.
References
general information
Instruments
Binary Comparison Tools
File Repair Tools
File Analysis Tools
โ
Read the first part