When a compiler is used to compile some piece of software how do we verify that the compiler can be trusted? Is it well known who compiled the compiler itself? Usually compilers are not built from source, and even when they are, they are seeded from a binary that itself is opaque and difficult to verify. So how does one check if the supply chain integrity of the compiler itself is in tact, even before we get to building software with it?
Compiler supply chains are obscured and at many points seeded from binaries, so that it becomes nearly impossible to verify their integrity. In 1984, Ken Thompson wrote ["Reflections on Trusting Trust"](https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf) and illustrated that a compiler can modify software during the compilation process in order to compromise the software. Put simply, this means that reviewing the source code is not enough. We need to be sure that the compiler itself isn't compromised as it could be used to modify the intended behavior of the software.
What about the software that's built using the compiler? Has the source code been modified during compilation? Has the resulting binary of the software been tampered with, perhaps in the CI/CD runner which runs an OS with a vulnerability in one of its sub dependencies, or perhaps the server host has been compromised and attackers have gained control of the infrastructure? These are difficult software supply chain security issues which are often swept under the rug, or completely overlooked due to lack of understanding. The bottom line is that in order to eliminate this surface area of attack, we need a good answer to these questions, and more importantly we need tooling and practical methods which can help close these gaps in the supply chain.
This line of questioning becomes especially concerning in the context of widely used software such as images pulled from DockerHub, package managers, and Linux distributions. Software procured via these channels are used widely, and are pervasive in almost all software and as such pose a severe attack vector. If the maintainer of a widely used DockerHub image has their machine compromised, or are coerced or even forced under duress to insert malicious code into the binaries they are responsible for in most cases there is no effective measure in place to detect and catch this which can result in millions of downstream consumers being impacted. Imagine what would happen if the maintainer of a default DockerHub image of a widely used language was compromised, and the binary they released had a backdoor in it. The implications are extremely far reaching, and would be disastrous.
There are two distinct problems at hand which share a solution:
1. How do we ensure that we can trust the toolchain used to build software
2. How do we ensure that we can trust software built with the toolchain
The answer to both questions is the same. We achieve it via verifiability and determinism. Now to be clear, we are not trying to solve the problem of the code itself being compromised in the source. If the source code is compromised, determinism does not help prevent that. If the code is reviewed and verified as being secure, then determinism and and multiple reproductions of the software do add a set of excellent guarantees.
Deterministically built software is simply software which always compiles to the same bit-for-bit exact binary. This is useful because it makes it trivial to check the integrity of the binary. This is because if the binary is always the same, we can use hashing to ensure that nothing about the binary has changed. Typically minor differences which are introduced during the build process, such as time stamps mean that software is typically non-deterministic. By essentially pinning all aspects of the environment the software is built in, and removing any changing factors such as time, we can force the software to always be bit-for-bit the same. Now imagine a scenario where a developer is compiling software, and they are not doing it deterministically. Any time they build the software, they have no way to easily verify if the binary changed in a meaningful way compared to the previous one without doing low level inspection. With determinism, it's as easy as hashing one binary, repeating the compilation, and then hashing the second result and comparing it with the original. This is great, but it's still not enough to ensure that the binary can be trusted. This is because there may be malware which always modifies the binary in the same manner. To mitigate this we can build the software on multiple different machines, ideally by different entities, using different operating systems and even different hardware, as it's much less likely that multiple diverse stacks and individuals are compromised by the same malware or attacker. In this manner, we can eliminate the risk of modification during compilation going undetected. In order to add a layer of trust that the hashes which have been produced by different entities can be trusted we can use cryptographic signing, as is customary for many software releases.
To assess the current state of affairs when it comes to what's available regarding software package managers and distributions and how far they have gone to mitigate the risks we expound on above, we took a hard look at the usual suspects.
Alpine is the most popular Linux distribution (distro) in container-land and has made great strides in providing a minimal `musl` based distribution with reasonable security defaults and is suitable for a lot of use cases, however in the interest of developer productivity and low friction for contributors, none of it is cryptographically signed.
Debian (and derivatives like Ubuntu) is one of most popular option for servers, and is largely reproducible and also signs all packages, however being `glibc` based with a focus on compatibility and desktop use cases, it results in a huge number of dependencies for almost any software run on it, enacts partial code freezes for long periods of time between releases, and often has very stale packages as various compatibility goals block updates. This overhead introduces a lot of surface area of malicious code to hide itself in. Unfortunately due to its design, when building software deterministically on this OS, each and every repo needs to keep costly snapshots of all dependencies to be able to reproduce build containers as Debian packages are archived and retired after some time to servers with extremely low bandwidth. This creates a lot of friction for teams who as a result have to archive often hundreds of .deb files for every project, and also has the added issue of Debian having very old versions of things like Rust, which is a common requirement, which can be quite problematic for teams who want to access latest language features. Even with all this work, Debian does not have truly reproducible Rust (will talk about that more in a bit), and packages are signed only by single maintainers whom we have to fully trust that they didn't release a compromised binary.
Fedora (and RedHat based distros) also sign all packages, but otherwise suffer from similar one-size-fits-all bloat problems as Debian with a different coat of paint. Additionally, their reliance on centralized builds has been used as justification for them to not pursue reproducibility at all which makes them a non-starter for security focused use cases.
Arch has very fast updates as a rolling release distro, and package definitions are signed, and often reproducible, but they change from one minute to the next still resulting in the challenge of having to come up with a solution to pin and archive sets of dependencies that work well together for software that's built using it and requires determinism.
Nix is almost entirely reproducible by design and allows for lean and minimal output artifacts. It is also a big leap forward in having good separation of concerns between privileged immutable and unprivileged mutable spaces, however like Alpine there is no maintainer-level signing in order to reduce the friction for hobbyist that wants to contribute.
Guix is reproducible by design as well, borrowing a lot from Nix. It also does maintainer-level signing like Debian. It comes the closest to the solution we need, but it only provides single signed package contributions, and a `glibc` base with a large dependency tree, with a significant footprint of tooling to review and understand to form confidence in it. This is still too much overhead we simply don't want or need for use cases like container builds of software, lean embedded operating systems, or any sensitive system where we want the utmost level of supply chain security assurance.
For those whose goal is to build their own software packages deterministically with high portability, maintainability, and maximally easy supply chain auditability, none of these solutions hit the mark.
On reflecting on these issues, we concluded we want the `musl`-based container-ideal minimalism of Alpine, the obsessive determinism and full-source supply chain goals of Guix, and a step beyond the single-sig signed packages of Debian, Fedora, and Arch. We also concluded that we want a fully verifiable bootstrapped toolchain, consisting of a compiler and accompanying libraries required for building most modern software.
You know where this is going. Here is where we made the totally reasonable and not-at-all-crazy choice to effectively create…
## Yet *Another* Linux Distribution
Let’s take a look at some of the features we care about most compared to make it more clear why nothing else hit the mark for us.
A comparison of `stagex` to other distros in some of the areas we care about:
| Debian | Adapted | 1 Human | Glibc | No | Partial | 232 |
| Arch | Adapted | 1 Human | Glibc | No | Partial | 262 |
| Fedora | Adapted | 1 Bot | Glibc | No | No | 166 |
| Alpine | Adapted | None | Musl | No | No | 32 |
We are leaving out hundreds of distros here, but at the risk of starting a holy war, we felt it was useful to compare a few popular options for contrast to the goals of the minimal container-first, security-first, deterministic, distro we put together.
We are not the first to go down this particular road road, in fact the TalosOS project built their own tiny containerized toolchain from gcc to golang as the base to build their own minimal immutable k8s distro.
Getting all the way to bootstrapping rust however is a much bigger chunk of pain as we learned…
## The Oxidation Problem - Bootstrapping Rust
Getting from gcc all the way to golang was mostly pain-free, thanks to Google documenting this path well and providing all the tooling to do it. One only needs 3 versions of golang to get all the way back to GCC.
Bootstrapping Rust however is a bit of an ordeal. People love Rust for its memory safety and strictness, however we have to admit supply chain integrity is not an area where it excels. This is mostly because Rust changes so much from one release to the next, that a given version of Rust can only ever be built with its immediate predecessor.
If one follows the chicken-and-egg problem far enough the realization dawns that in most distros the chicken comes first. Most included a non-reproducible “seed” Rust binary presumably compiled by some member of the Rust team, then use that to build the next version, and then carry on from there. This means even some of the distros that -say- their Rust builds are reproducible have a pretty big asterisk. We won’t call anyone out - you know who you are.
Granted, even if you were to build all the way up from the OCaml roots of Rust (if you can find that code and then get it to build), you still have to have a trusted OCaml compiler. Software supply chains are hard, and we always end up back at the famous Trusting Trust Problem.
There have been some amazing efforts by the Guix team to bootstrap GCC and the entire package chain after it with a tiny human-auditable blob of x86 assembly via the GNU Mes project. That is probably in the cards for our stack as well, however for the short term we wanted to at least go as low in the stack as GCC like we do with go as a start which is already a sizable effort. Thankfully John Hodge (mutabah), a brilliant (crazy?) member of the open source community, created “mrustc” which implements a minimal semi-modern rust 1.54 compiler in C++ largely from transpiled Rust code. It is missing a lot of critical features that make it unsuitable for direct use, but it -does- support enough features to compile official Rust 1.55 sources, which can compile Rust 1.56 and so on. This is the path Guix and Nix both went down, and we are taking their lead here.
Mrustc however lacked support for musl libc which threw a wrench in things, but after a fair bit of experimentation we were able to patch in support musl and get it upstream.
The result is we now have the first deterministic musl based rust compiler bootstrapped all the way back to 256 bytes of assembly, and you can reproduce our builds right now from any OS that can run Docker.
To demonstrate how determinism can be used to prevent real world attacks in practical terms let's consider a major breach which could have been prevented.
SolarWinds experienced a major security breach in which Russian threat actors were able to compromise their infrastructure and piggyback on their software in order to distribute their malware to their entire client base. The attackers achieved this by injecting malicious code into SolarWinds products such as the Orion Platform, which was then downloaded by the end users. This seems like a very difficult thing to protect from, but there is a surprisingly simple solution. If SolarWinds leveraged deterministic builds of their software, they would have been able to detect that the binaries of the software they are delivering to their clients have been tampered. To achieve this, there are a few ways they could have gone about this, but without getting too deep into implementation details, it would have sufficed to have multiple runners in different isolated environments, or event on different cloud platforms, which would reproduce the deterministic build, and compare the resulting hashes in order to verify the binaries have not been tampered. If any of the systems built the software and got a different hash - that would be a clear signal that further investigations should be made which would have likely lead to the detection of the intruder. Without this approach, SolarWinds was completely unaware of their systems being infiltrated for months, and during this period large quantities of end user data was exfiltrated, along with their tooling. Considering that SolarWinds is a cybersecurity software and services provider, the tools stolen from them were then likely used to further develop the attacker's capabilities to avoid detection, and even weaponize them.
These initial efforts were predominately sponsored with financial and engineering time contributions from Distrust, Mysten Labs, and Turnkey who all share threat models and container-driven workflows Stagex is designed to support.
While we all have a vested interest to help maintain it, we all felt it important this project stand on its own and belong to the community and are immensely appreciative to a number of volunteers that have very quickly dived in and started making significant contributions and improvements.
As of writing this, Stagex has 100+ packages covering some of the core software you may be using regularly, all built using the deterministically built toolchain, and of course the software itself also built deterministically. Some of the packages include `rust`, `go`, `nodejs`, `python3.8`, `curl`, `bash`, `git`, `tofu` and many more.
We would like to support building with `buildah` and `podman` for build-tooling diversity. We would also love help from the open source community to see GCC bootstrapped all the way down to x86_assembly via Mes. This may require using multiple seed distro containers to work in parallel to ensure we don’t have a single provenance source for even that layer.