This tool makes it easy to capture the steps taken during the build process of a software project. This can be very useful for:
- Understanding build processes
- Debugging obscure build problems
- Applying static analysis and verification tools
In the first two cases, this tool gives a low-level view of exactly the set of files accessed by the build process (e.g., fully resolving all file includes and relative paths) in a way that is difficult to achieve by merely reading and understanding a build system. In a sense, it identifies the bill of materials for software.
In the third case, assurance tools generally require rebuilding programs in special modes or with alternative compilers (e.g., into LLVM bitcode for analysis or instrumentation). Doing so is typically labor intensive, as it requires extensive work to understand an existing build system, and more work still to modify it. This tool provides a way to apply analysis tools in a build system agnostic way.
This tool is primarily designed to help tame the myriad build systems of the C/C++ ecosystem, but it applies to any software project with a build step.
This tool wraps your normal build command and builds LLVM bitcode when possible. It arranges things such that any binary artifacts produced by your build system (e.g., object files, archives, shared libraries, or binaries) have their LLVM bitcode attached to them and accessible. The workflow of this tool proceeds in two phases:
- Building your software (using the
generate-bitcode
command wrapper) - Extracting your bitcode (using the
extract-bitcode
command)
There tool also supports some auxiliary commands for generating traces of builds for visualization and understanding.
An example use of the tool is shown below for a make
-based build:
$ build-bom generate-bitcode -- make $ build-bom extract-bitcode /path/to/binary --output=/tmp/output.bc
In the first step, the tool acts as a wrapper around the real build system. It runs the build system and, if it observes any compilation commands, it runs an extra build of the source file using clang to generate bitcode. It attaches bitcode to object files, and then resumes the build.
In the next step, the tool extracts all of the accumulated bitcode.
The generate-bitcode
command has a number of options that may be useful in various contexts.
--bc-out
: Directory to place LLVM bitcode (bc) output data--clang
: Specify the full name (or path if desired) of a clang binary to use--objcopy
: Specify the full name (or path if desired) of the objcopy binary to use--suppress-automatic-debug
: By default,build-bom
automatically adds the-g
flag when building bitcode to generate debug information; this flag inhibits this behavior--strict
: Generate strictly adhering bitcode: leaves in compilation arguments (e.g. optimization, arch-specific flags, etc.) that are normally removed because they might be problematic forclang
.--inject-argument=STRING
: Directsbuild-bom
to inject an additional argument (STRING
) into the command line for the command used to build bitcode (e.g., to configure the optimization level or level of debug information); can be specified multiple times--remove-argument=REGEX
: Directsbuild-bom
to remove any argument matching the regular expression from the argument list when generating bitcode; can be specified multiple times--preproc-native
: performs the preprocessing step using the compiler native to the build, and then compiles the result withclang
. This can be very useful for cross-compilers whose header files are incompatible withclang
: the pre-processing step performs all of the#include
and#define
operations in the context of the cross-compiler, and the result should be C code that can then be turned into LLVM IR bitcode byclang
.
Note that --suppress-automatic-debug
could be useful in cases where the generated bitcode is disruptively large due to the presence of unneeded debug information. Since it is useful in most cases, however, it is generated by default.
The --remove-argument
can be used to remove arguments that inhibit analysis (e.g., -O3
may apply optimizations that are annoying for a static analysis, so it could be removed). Note that build-bom
does not add any anchors to the beginning (e.g., ^
) or end (e.g., $
) of the regular expression it is given, so users will likely want to specify them manually as needed. The regular expressions are matched against each argument as seen by execve
, so conjoined single-argument flags like --foo=bar
count as a single flag that could be matched against, while --foo bar
appear as two separate entries in the argument list seen by execve
. Without explicit regex anchors, build-bom
allows the specified regex to match anywhere in each argument.
The extract-bitcode
command also provides options:
--llvm-link
: Specify a name or path to thellvm-link
binary; this is useful if LLVM commands are versioned on your system--objcopy
: Specify a name or path for theobjcopy
binary.
The tool uses low-level operating system services to observe builds and record their actions. On Linux, it uses ptrace
to observe every system call. When a source compilation command is observed, the tool generates the corresponding bitcode file using clang. It attaches the bitcode to the object file via a separate ELF section, allowing bitcode to be accumulated as a side effect of the build. At every stage, bitcode remains attached to build artifacts to ensure it is not lost.
There are four key observations enabling this approach to bitcode collection:
- Whenever we see the original build system compile a C/C++ file, we know we need to make the corresponding bitcode file
- We can attach arbitrary extra data (e.g., bitcode) to object files in extra ELF sections
- ELF sections containing data without special meaning are concatenated by the linker
- Standard tar files can be concatenated to produce a valid tar file that is the union of their contents
We wrap our generated bitcode in singleton tar files and allow the linker to accumulate them for us. When we want to collect aggregated bitcode for executable artifacts, we simply extract the tar file from their special LLVM bitcode ELF sections, extract the collected bitcode, and link it together with llvm-link
.
Observe as well that the build-bom
process useful for selective rebuilds: rebuilding only a portion of the sources will still have access to llvm-bitcode ELF sections in object from previous builds. The use of build-bom
also has graceful degradation properties: object files which do not have llvm bitcode sections in their ELF (i.e. built separately without using build-bom
) will simply not contributed to the ELF section/tarfile accumulation of bitcode; the final extraction llvm-link
does not need to be total and is tolerant of unresolved symbols.
The bitcode extracted will be representative of the binary code contained in the specified file. It will not necessarily be identical to that code due to strictness flags, differences between clang and the native build compiler, and a different linking step.
- Executable: bitcode for the entirety of the executable, including any static libraries the executable was linked with, but not including any shared libraries (even if they themselves were built with
build-bom
) or components built outside of abuild-bom
process. - Shared library: bitcode for the entirety of the shared library will be extracted, excluding any components of the library built outside of a
build-bom
process. - Static library: bitcode will only be available for the last element in the library. This is due to
build-bom
’s use ofobjcopy
to extract the ELF sections: all llvm bitcode sections from each member of the static library will be extracted, but they will successively overwrite each other, leaving only the bitcode from the last entry in the library. This is also noted in #42.
This tool is also able to record all relevant system calls into a log. The tracing is designed to capture all of the information necessary to replay a build. It currently doesn’t capture everything (especially file move and directory operations), but will be extended as-needed. Beyond system calls, it also captures the environment and working directory of each executed command.
The tool currently supports Linux, but is designed so that it will be modular enough to have separate tracing implementations for MacOS and Windows, while sharing the rest of the code.
There are a number of tools in the space of build interposition for the purpose of instrumentation, build modification, or bitcode generation. Most are based on acting as wrappers around standard compilers either through explicit modification of the build system or by placing themselves earlier in the PATH
as aliases to real build tools.
- Tools like wllvm and gllvm solve the problem of wrapping compiler commands to generate LLVM, but require manual modifications to the build system in order to invoke them.
- Tools like Bear and blight provide general mechanisms for interposing on build commands by pretending to be a normal compiler earlier in your PATH. Bear additionally provides another mode based on using
LD_PRELOAD
to hook calls toexecve
. - Other tools record builds and replay them
These tools can be very effective, but have some issues with more complex build systems:
- Scripts that wrap compiler commands can have difficulty successfully getting through complex configure scripts that e.g., do aggressive version sniffing
- While configure script difficulties can be sometimes avoided by configuring with the real compiler and replacing or interposing the real build commands after the fact, it doesn’t always work
- Build systems that record absolute paths at configure time are difficult to modify completely
- Some build systems run additional configure scripts as part of the build process, which are again difficult to pass using interposition
- Using
LD_PRELOAD
to hookexecve
can be very effective, but difficult, as some build systems rely on failedexecve
calls to performPATH
searches; it is difficult to know which commands succeed, asexecve
never returns in those contexts - The
LD_PRELOAD
approach does not work for statically-linked compilers (so Bear has a fallback to wrapper scripts) - Some types of multi-stage build require that all intermediate results actually be built and be executable (e.g., if a build creates a code generator and uses it for later build stages)
- Replaying builds based solely on compiler commands works for simple builds, but fails when build systems create and delete directories during the build (or make other interesting environmental changes) that make consistent replay very difficult
As a whole, these tools tend to require significant effort in build system understanding and modification to work on more complex codebases. The build-bom tool is designed to eliminate any need for build system modification to achieve its goals (primarily LLVM bitcode generation, but potentially arbitrary build modifications). In contrast to the other tools in this space, it monitors and interposes on the build system at the level of ptrace
.
- By working at the level of
execve
, it can observe when real build tools are called, no matter what names the build system thinks they actually have (e.g., if the build system itself uses build tool wrappers) - By working directly at the syscall level (rather than
LD_PRELOAD
), it works on both static and dynamically-linked build tools - By working at the level of
execve
, build-bom never needs to implement any shell lexing logic, as the shell has already lexed all of the arguments - By working at the
ptrace
level, build-bom is able to determine which calls toexecve
actually succeed - Moreover, it can delay action until after build steps succeed (since it can observe when execed processes terminate, not just when they are about to start)
- The build-bom tool is able to maintain persistent state for an entire build without external storage, as a single process is able to view all build steps
- Configure scripts are never a problem (at any stage of the build) because the real build always runs
- Multi-stage builds always work because intermediate tools are build and are executable
- It is not possible to take advantage of parallel builds while using this tool, as all system calls in the entire build tree are serialized through a single tracing process
- Build steps that rely on input our output redirection through pipes are very difficult to replicate, since their targets are not observable without modeling the calling process file descriptor connection logic
Here is a full example on a real codebase:
wget https://ftp.gnu.org/gnu/tar/tar-1.32.tar.gz
tar xf tar-1.32.tar.gz
cd tar-1.32
./configure
# Run the build under the bitcode generator
build-bom generate-bitcode -- make
# Use a suffix on LLVM tools because they are version-suffixed on Ubuntu
build-bom extract-bitcode src/tar --output=../tar.bc --llvm-link=llvm-link-9
- Serious polish required
- Build step dependency analysis for in-order replay
- Add more thorough support for Linux system calls
- Add a 32 bit x86 syscall table
- Add ARM syscall tables
- Explore automated processing of system call argument lists
- Additional tools
- Dependency graph analyzer and visualizer
- A command to list all targets (or all library targets or all executable targets)
- A command to rebuild a target binary with libfuzzer, Address Sanitizer, or Thread Sanitizer
- Add a command to randomly test for potential missing dependencies in build systems
- Automated granular filename tracking (to precisely model renames)
- Fix parallel builds
- Full handling of environment variables
- Additional normalization policies
- Ignore trivial dependencies like ld.so
- Add ability to ignore dynamically loaded library dependencies
- Easier scripting
- MacOS backend based on Dtrace
- Windows backend
Licensed under either of
- Apache License, Version 2.0 LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.