When a compiler is used to compile some piece of software, how do we verify
that the compiler can be trusted? Is it well known who compiled the compiler
itself? Usually compilers are not built from source, and even when they are,
they are seeded from a binary that itself is opaque and difficult to verify.
How does one check if the supply chain integrity of the compiler itself is
intact, even before we get to building software with it?
When a compiler is used to compile some piece of software how do we verify that the compiler can be trusted? Is it well known who compiled the compiler itself? Usually compilers are not built from source, and even when they are, they are seeded from a binary that itself is opaque and difficult to verify. So how does one check if the supply chain integrity of the compiler itself is in tact, even before we get to building software with it?
Compiler supply chains are obscured and at many points seeded from binaries,
making it nearly impossible to verify their integrity. In 1984, Ken Thompson
wrote "Reflections on Trusting Trust" and illustrated that a compiler can
modify software during the compilation process, compromising the software. Put
simply, this means that reviewing the source code is not enough. We need to be
sure that the compiler itself isn't compromised, as it could be used to modify
the intended behavior of the software.
Compiler supply chains are obscured and at many points seeded from binaries, so that it becomes nearly impossible to verify their integrity. In 1984, Ken Thompson wrote ["Reflections on Trusting Trust"](https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf) and illustrated that a compiler can modify software during the compilation process in order to compromise the software. Put simply, this means that reviewing the source code is not enough. We need to be sure that the compiler itself isn't compromised as it could be used to modify the intended behavior of the software.
What about the software that's built using the compiler? Has the source code
been modified during compilation? Has the resulting binary of the software been
tampered with, perhaps in the CI/CD runner which runs an OS with a
vulnerability in one of its sub dependencies? Or perhaps the server host has
been compromised and attackers have gained control of the infrastructure?
These are difficult software supply chain security issues which are often swept
under the rug or completely overlooked due to lack of understanding. To
eliminate this surface area of attack, we need a good answer to these
questions, and more importantly we need tooling and practical methods which can
help close these gaps in the supply chain.
What about the software that's built using the compiler? Has the source code been modified during compilation? Has the resulting binary of the software been tampered with, perhaps in the CI/CD runner which runs an OS with a vulnerability in one of its sub dependencies, or perhaps the server host has been compromised and attackers have gained control of the infrastructure? These are difficult software supply chain security issues which are often swept under the rug, or completely overlooked due to lack of understanding. The bottom line is that in order to eliminate this surface area of attack, we need a good answer to these questions, and more importantly we need tooling and practical methods which can help close these gaps in the supply chain.
This line of questioning becomes especially concerning in the context of widely
used software, such as images pulled from DockerHub, package managers, and
Linux distributions. Software procured via these channels are used widely and
are pervasive in almost all software and as such pose a severe attack vector.
If the maintainer of a widely used DockerHub image has their machine
compromised, or are coerced or even forced under duress to insert malicious
code into the binaries they are responsible for, there is no effective measure
in place to detect and catch this, resulting in millions of downstream
consumers being impacted. Imagine what would happen if the maintainer of a
default DockerHub image of a widely used language was compromised, and the
binary they released had a backdoor in it. The implications are extremely far
reaching, and would be disastrous.
This line of questioning becomes especially concerning in the context of widely used software such as images pulled from DockerHub, package managers, and Linux distributions. Software procured via these channels are used widely, and are pervasive in almost all software and as such pose a severe attack vector. If the maintainer of a widely used DockerHub image has their machine compromised, or are coerced or even forced under duress to insert malicious code into the binaries they are responsible for in most cases there is no effective measure in place to detect and catch this which can result in millions of downstream consumers being impacted. Imagine what would happen if the maintainer of a default DockerHub image of a widely used language was compromised, and the binary they released had a backdoor in it. The implications are extremely far reaching, and would be disastrous.
There are two distinct problems at hand which share a solution:
1. How do we ensure that we can trust the toolchain used to build software
2. How do we ensure that we can trust software built with the toolchain
The answer to both questions is the same. We achieve it via verifiability and
determinism. To be clear, we are not trying to solve the problem of the code
itself being compromised in the source. If the source code is compromised,
determinism does not help prevent that. If the code is reviewed and verified as
being secure, then determinism and multiple reproductions of the software
add a set of excellent guarantees.
The answer to both questions is the same. We achieve it via verifiability and determinism. Now to be clear, we are not trying to solve the problem of the code itself being compromised in the source. If the source code is compromised, determinism does not help prevent that. If the code is reviewed and verified as being secure, then determinism and and multiple reproductions of the software do add a set of excellent guarantees.
Deterministically built software is any software which always compiles to the
same bit-for-bit exact binary. This is useful because it makes it trivial to
check the integrity of the binary. If the binary is always the same, we can use
hashing to ensure that nothing about the binary has changed. Typically minor
differences which are introduced during the build process, such as time stamps,
mean that software is typically non-deterministic. By pinning all aspects of
the environment the software is built in and removing any changing factors such
as time and user or machine IDs, we can force the software to always be
bit-for-bit.
Deterministically built software is simply software which always compiles to the same bit-for-bit exact binary. This is useful because it makes it trivial to check the integrity of the binary. This is because if the binary is always the same, we can use hashing to ensure that nothing about the binary has changed. Typically minor differences which are introduced during the build process, such as time stamps mean that software is typically non-deterministic. By essentially pinning all aspects of the environment the software is built in, and removing any changing factors such as time, we can force the software to always be bit-for-bit the same. Now imagine a scenario where a developer is compiling software, and they are not doing it deterministically. Any time they build the software, they have no way to easily verify if the binary changed in a meaningful way compared to the previous one without doing low level inspection. With determinism, it's as easy as hashing one binary, repeating the compilation, and then hashing the second result and comparing it with the original. This is great, but it's still not enough to ensure that the binary can be trusted. This is because there may be malware which always modifies the binary in the same manner. To mitigate this we can build the software on multiple different machines, ideally by different entities, using different operating systems and even different hardware, as it's much less likely that multiple diverse stacks and individuals are compromised by the same malware or attacker. In this manner, we can eliminate the risk of modification during compilation going undetected. In order to add a layer of trust that the hashes which have been produced by different entities can be trusted we can use cryptographic signing, as is customary for many software releases.
Now, imagine a scenario where a developer is compiling software, and they are
not doing it deterministically. Any time they build the software, they have no
way to easily verify if the binary changed in a meaningful way compared to the
previous one without doing low level inspection. With determinism, it's as
simple as hashing one binary, repeating the compilation, hashing the second
result, and comparing it with the original. This is great, but it's still not
enough to ensure that the binary can be trusted, as there may be malware which
always modifies the binary in the same manner. To mitigate this, we can build
the software on multiple different machines, ideally by different maintainers,
using different operating systems and even different hardware, as it's much
less likely that multiple diverse stacks and individuals are compromised by the
same malware or attacker. Following this process, we can eliminate the risk of
modification during compilation going undetected. To add a layer of trust that
the hashes can be trusted, we can use cryptographic signing, as is customary
for many software releases.
To assess the current state of affairs when it comes to what's available regarding software package managers and distributions and how far they have gone to mitigate the risks we expound on above, we took a hard look at the usual suspects.
Assessing the current state of affairs regarding software package managers and
Linux distributions, and how far they have gone to mitigate these risks, we
performed an analysis of popular projects:
Alpine is the most popular Linux distribution (distro) in container-land and has made great strides in providing a minimal `musl` based distribution with reasonable security defaults and is suitable for a lot of use cases, however in the interest of developer productivity and low friction for contributors, none of it is cryptographically signed.
Alpine is the most popular Linux distribution (distro) in the container
ecosystem and has made great strides in providing a minimal `musl` based
distribution with reasonable security defaults and is suitable for a lot of use
cases, however in the interest of developer productivity and low friction for
contributors, none of it is cryptographically signed.
Debian (and derivatives like Ubuntu) is one of most popular option for servers, and is largely reproducible and also signs all packages, however being `glibc` based with a focus on compatibility and desktop use cases, it results in a huge number of dependencies for almost any software run on it, enacts partial code freezes for long periods of time between releases, and often has very stale packages as various compatibility goals block updates. This overhead introduces a lot of surface area of malicious code to hide itself in. Unfortunately due to its design, when building software deterministically on this OS, each and every repo needs to keep costly snapshots of all dependencies to be able to reproduce build containers as Debian packages are archived and retired after some time to servers with extremely low bandwidth. This creates a lot of friction for teams who as a result have to archive often hundreds of .deb files for every project, and also has the added issue of Debian having very old versions of things like Rust, which is a common requirement, which can be quite problematic for teams who want to access latest language features. Even with all this work, Debian does not have truly reproducible Rust (will talk about that more in a bit), and packages are signed only by single maintainers whom we have to fully trust that they didn't release a compromised binary.
Debian (and derivatives like Ubuntu) is one of most popular option for servers
and is largely reproducible and also signs all packages. Being `glibc` based
with a focus on compatibility and desktop use cases, it results in a huge
number of dependencies for almost any software run on it, enacts partial code
freezes for long periods of time between releases, and often has very stale
packages as various compatibility goals block updates. This overhead introduces
a lot of surface area of malicious code to hide itself in. Unfortunately, due
to its design, when building software deterministically on this OS, each and
every repo needs to keep costly snapshots of all dependencies to reproduce
build containers, as Debian packages are archived and retired after some time
to servers with low bandwidth. This creates a lot of friction for teams who, as
a result, have to archive often hundreds of .deb files for every project, and
also has the added issue of Debian having very old versions of software such as
Rust, which is a common requirement. This can be quite problematic for teams
who want to access latest language features. Even with all this work, Debian
does not have truly reproducible Rust (which will be discussed later in this
post), and packages are signed only by single maintainers whom we have to fully
trust that they didn't release a compromised binary.
Fedora (and RedHat based distros) also sign all packages, but otherwise suffer from similar one-size-fits-all bloat problems as Debian with a different coat of paint. Additionally, their reliance on centralized builds has been used as justification for them to not pursue reproducibility at all which makes them a non-starter for security focused use cases.
Fedora (and RedHat based distros) also sign all packages, but otherwise suffer
from similar one-size-fits-all bloat problems as Debian with a different coat
of paint. Additionally, their reliance on centralized builds has been used as
justification for them to not pursue reproducibility at all which makes them a
non-starter for security focused use cases.
Arch has very fast updates as a rolling release distro, and package definitions are signed, and often reproducible, but they change from one minute to the next still resulting in the challenge of having to come up with a solution to pin and archive sets of dependencies that work well together for software that's built using it and requires determinism.
Arch has very fast updates as a rolling release distro, and package definitions
are signed and often reproducible, but they change from one minute to the next,
still resulting in the challenge of having to come up with a solution to pin
and archive sets of dependencies that work well together for software that's
built using it and requires determinism.
Nix is almost entirely reproducible by design and allows for lean and minimal output artifacts. It is also a big leap forward in having good separation of concerns between privileged immutable and unprivileged mutable spaces, however like Alpine there is no maintainer-level signing in order to reduce the friction for hobbyist that wants to contribute.
Nix is almost entirely reproducible by design and allows for lean and minimal
output artifacts. It is also a big leap forward in having good separation of
concerns between privileged immutable and unprivileged mutable spaces, however
like Alpine there is no maintainer-level signing in order to reduce the
friction for hobbyist that wants to contribute.
Guix is reproducible by design as well, borrowing a lot from Nix. It also does maintainer-level signing like Debian. It comes the closest to the solution we need, but it only provides single signed package contributions, and a `glibc` base with a large dependency tree, with a significant footprint of tooling to review and understand to form confidence in it. This is still too much overhead we simply don't want or need for use cases like container builds of software, lean embedded operating systems, or any sensitive system where we want the utmost level of supply chain security assurance.
Guix is reproducible by design as well, borrowing a lot from Nix. It also does
maintainer-level signing like Debian. It comes the closest to the solution we
need, but it only provides single signed package contributions, and a `glibc`
base with a large dependency tree, with a significant footprint of tooling to
review and understand to form confidence in it. This is still too much overhead
we simply don't want or need for use cases like container builds of software,
lean embedded operating systems, or any sensitive system where we want the
utmost level of supply chain security assurance.
For those whose goal is to build their own software packages deterministically with high portability, maintainability, and maximally easy supply chain auditability, none of these solutions hit the mark.
For those whose goal is to build their own software packages deterministically
with high portability, maintainability, and maximally easy supply chain
auditability, none of these solutions hit the mark.
On reflecting on these issues, we concluded we want the `musl`-based container-ideal minimalism of Alpine, the obsessive determinism and full-source supply chain goals of Guix, and a step beyond the single-sig signed packages of Debian, Fedora, and Arch. We also concluded that we want a fully verifiable bootstrapped toolchain, consisting of a compiler and accompanying libraries required for building most modern software.
On reflecting on these issues, we concluded we want the `musl`-based
container-ideal minimalism of Alpine, the obsessive determinism and full-source
supply chain goals of Guix, and a step beyond the single-signature packages of
Debian, Fedora, and Arch. We also concluded that we want a fully verifiable
bootstrapped toolchain, consisting of a compiler and accompanying libraries
required for building most modern software.
You may know where this is going. Here is where we made the totally reasonable
and not-at-all-crazy choice to effectively create…
You know where this is going. Here is where we made the totally reasonable and not-at-all-crazy choice to effectively create…
## Yet *Another* Linux Distribution
Let’s take a look at some of the features we care about most compared to make
it more clear why nothing else hit the mark for us.
Let’s take a look at some of the features we care about most compared to make it more clear why nothing else hit the mark for us.
A comparison of `stagex` to other distros in some of the areas we care about:
@ -178,118 +67,44 @@ A comparison of `stagex` to other distros in some of the areas we care about:
| Fedora | Adapted | 1 Bot | Glibc | No | No | 166 |
| Alpine | Adapted | None | Musl | No | No | 32 |
We are leaving out hundreds of distros here, but at the risk of starting a holy
war, we felt it was useful to compare a few popular options for contrast to the
goals of the minimal container-first, security-first, deterministic distro we
put together.
We are leaving out hundreds of distros here, but at the risk of starting a holy war, we felt it was useful to compare a few popular options for contrast to the goals of the minimal container-first, security-first, deterministic, distro we put together.
We are not the first to go down this particular road road. The Talos Linux
project built their own tiny containerized toolchain from gcc to golang as the
base to build their own minimal immutable k8s distro.
We are not the first to go down this particular road road, in fact the TalosOS project built their own tiny containerized toolchain from gcc to golang as the base to build their own minimal immutable k8s distro.
Getting all the way to bootstrapping rust, however, is a much bigger chunk of
pain as we learned…
Getting all the way to bootstrapping rust however is a much bigger chunk of pain as we learned…
## The Oxidation Problem - Bootstrapping Rust
Getting from gcc all the way to golang was mostly pain-free, thanks to Google
documenting this path well and providing all the tooling to do it. One only
needs 3 versions of golang to get all the way back to GCC.
Getting from gcc all the way to golang was mostly pain-free, thanks to Google documenting this path well and providing all the tooling to do it. One only needs 3 versions of golang to get all the way back to GCC.
Bootstrapping Rust is a bit of an ordeal. People love Rust for its memory
safety and strictness, however we have noticed supply chain integrity is not
an area where it excels. This is mostly because Rust changes so much from one
release to the next, that a given version of Rust can only ever be built with
its immediate predecessor.
Bootstrapping Rust however is a bit of an ordeal. People love Rust for its memory safety and strictness, however we have to admit supply chain integrity is not an area where it excels. This is mostly because Rust changes so much from one release to the next, that a given version of Rust can only ever be built with its immediate predecessor.
If one follows the chicken-and-egg problem far enough the realization dawns
that in most distros the chicken comes first. Most included a non-reproducible
“seed” Rust binary presumably compiled by some member of the Rust team, then
use that to build the next version, and then carry on from there. This means
even some of the distros that _say_ their Rust builds are reproducible have a
pretty big asterisk. We won’t call anyone out - you know who you are.
If one follows the chicken-and-egg problem far enough the realization dawns that in most distros the chicken comes first. Most included a non-reproducible “seed” Rust binary presumably compiled by some member of the Rust team, then use that to build the next version, and then carry on from there. This means even some of the distros that -say- their Rust builds are reproducible have a pretty big asterisk. We won’t call anyone out - you know who you are.
Granted, even if you were to build all the way up from the OCaml roots of Rust
(if you can find that code and then get it to build), you would still require a
trusted OCaml compiler. Software supply chains are hard, and we always end up
back at the famous Trusting Trust Problem.
Granted, even if you were to build all the way up from the OCaml roots of Rust (if you can find that code and then get it to build), you still have to have a trusted OCaml compiler. Software supply chains are hard, and we always end up back at the famous Trusting Trust Problem.
There have been some amazing efforts by the Guix team to bootstrap GCC and the
entire package chain after it with a tiny human-auditable blob of x86 assembly
via the GNU Mes project. That is probably in the cards for our stack as well,
however for the short term we wanted to at least go as low in the stack as GCC
like we do with go as a start which is already a sizable effort. Thankfully,
John Hodge (mutabah), a brilliant (crazy?) member of the open source community,
created “mrustc” which implements a minimal semi-modern rust 1.54 compiler in
C++ largely from transpiled Rust code. It is missing a lot of critical features
that make it unsuitable for direct use, but it _does_ support enough features
to compile official Rust 1.55 sources, which can compile Rust 1.56 and so on.
This is the path Guix and Nix both went down, and we are taking their lead
here.
There have been some amazing efforts by the Guix team to bootstrap GCC and the entire package chain after it with a tiny human-auditable blob of x86 assembly via the GNU Mes project. That is probably in the cards for our stack as well, however for the short term we wanted to at least go as low in the stack as GCC like we do with go as a start which is already a sizable effort. Thankfully John Hodge (mutabah), a brilliant (crazy?) member of the open source community, created “mrustc” which implements a minimal semi-modern rust 1.54 compiler in C++ largely from transpiled Rust code. It is missing a lot of critical features that make it unsuitable for direct use, but it -does- support enough features to compile official Rust 1.55 sources, which can compile Rust 1.56 and so on. This is the path Guix and Nix both went down, and we are taking their lead here.
Mrustc at the time lacked support for musl libc which threw a wrench in things,
but after a fair bit of experimentation we were able to patch in support musl
and get it upstream.
Mrustc however lacked support for musl libc which threw a wrench in things, but after a fair bit of experimentation we were able to patch in support musl and get it upstream.
The result is we now have the first deterministic `musl` based rust compiler
bootstrapped from 256 bytes of assembly, and you can reproduce our builds right
now from any OS that can run Docker 26.
The result is we now have the first deterministic musl based rust compiler bootstrapped all the way back to 256 bytes of assembly, and you can reproduce our builds right now from any OS that can run Docker.
## Determinism and Real World Applications
To demonstrate how determinism can be used to prevent real world attacks in
practical terms let's consider a major breach which could have been prevented.
To demonstrate how determinism can be used to prevent real world attacks in practical terms let's consider a major breach which could have been prevented.
SolarWinds experienced a major security breach in which Russian threat actors
were able to compromise their infrastructure and piggyback on their software to
distribute malware to their entire client base. The attackers achieved this by
injecting malicious code into SolarWinds products, such as the Orion Platform,
which was then downloaded by the end users. This seems like a very difficult
thing to protect from, but there is a surprisingly simple solution. If
SolarWinds leveraged deterministic builds of their software, they would have
been able to detect that the binaries of the software they are delivering to
their clients have been tampered.
To achieve this, there are a few ways they could have gone about this, but
without getting too deep into implementation details, it would have sufficed to
have multiple runners in different isolated environments, or even on different
cloud platforms, which would reproduce the deterministic build and compare the
resulting hashes in order to verify the binaries have not been tampered. If any
of the systems built the software and got a different hash - that would be a
clear signal that further investigations should be made which would have likely
lead to the detection of the intruder. Without this approach, SolarWinds was
completely unaware of their systems being infiltrated for months, and during
this period large quantities of end user data was exfiltrated, along with their
tooling. Considering SolarWinds is a cybersecurity software and services
provider, the tools stolen from them were then likely used to further develop
and weaponize the attacker's capabilities.
SolarWinds experienced a major security breach in which Russian threat actors were able to compromise their infrastructure and piggyback on their software in order to distribute their malware to their entire client base. The attackers achieved this by injecting malicious code into SolarWinds products such as the Orion Platform, which was then downloaded by the end users. This seems like a very difficult thing to protect from, but there is a surprisingly simple solution. If SolarWinds leveraged deterministic builds of their software, they would have been able to detect that the binaries of the software they are delivering to their clients have been tampered. To achieve this, there are a few ways they could have gone about this, but without getting too deep into implementation details, it would have sufficed to have multiple runners in different isolated environments, or event on different cloud platforms, which would reproduce the deterministic build, and compare the resulting hashes in order to verify the binaries have not been tampered. If any of the systems built the software and got a different hash - that would be a clear signal that further investigations should be made which would have likely lead to the detection of the intruder. Without this approach, SolarWinds was completely unaware of their systems being infiltrated for months, and during this period large quantities of end user data was exfiltrated, along with their tooling. Considering that SolarWinds is a cybersecurity software and services provider, the tools stolen from them were then likely used to further develop the attacker's capabilities to avoid detection, and even weaponize them.
## Future Work
These initial efforts were predominately sponsored with financial and
engineering time contributions from Distrust, Mysten Labs, and Turnkey, who all
share threat models and container-driven workflows Stagex is designed to
support.
These initial efforts were predominately sponsored with financial and engineering time contributions from Distrust, Mysten Labs, and Turnkey who all share threat models and container-driven workflows Stagex is designed to support.
While we all have a vested interest to help maintain it, we all felt it
important this project stand on its own and belong to the community and are
immensely appreciative to a number of volunteers that have very quickly dived
in and started making significant contributions and improvements.
While we all have a vested interest to help maintain it, we all felt it important this project stand on its own and belong to the community and are immensely appreciative to a number of volunteers that have very quickly dived in and started making significant contributions and improvements.
As of writing this, Stagex has 100+ packages covering some of the core software
you may be using regularly, all built using the deterministically built
toolchain, and of course the software itself also built deterministically. Some
of the packages include `rust`, `go`, `nodejs`, `python3.8`, `curl`, `bash`,
`git`, `tofu` and many more.
As of writing this, Stagex has 100+ packages covering some of the core software you may be using regularly, all built using the deterministically built toolchain, and of course the software itself also built deterministically. Some of the packages include `rust`, `go`, `nodejs`, `python3.8`, `curl`, `bash`, `git`, `tofu` and many more.
We would like to support building with `buildah` and `podman` for build-tooling
diversity. We would also love help from the open source community to see GCC
bootstrapped all the way down to x86_assembly via Mes. This may require using
multiple seed distro containers to work in parallel to ensure we don’t have a
single provenance source for that layer.
We would like to support building with `buildah` and `podman` for build-tooling diversity. We would also love help from the open source community to see GCC bootstrapped all the way down to x86_assembly via Mes. This may require using multiple seed distro containers to work in parallel to ensure we don’t have a single provenance source for even that layer.
We are also actively on and have made some progress towards the addition of
core packages required to use this distribution as a minimal Linux OS.
We are also actively on and have made some progress towards the addition of core packages required to use this distro as a minimal Linux OS.
If you have need for high trust in your own build system, please reach out and
we would love to find a way to collaborate.
If you have need for high trust in your own build system, please reach out and we would love to find a way to collaborate.