feat: add blog #7

Open
anton wants to merge 15 commits from feat/blog into main
1 changed files with 90 additions and 9 deletions
Showing only changes of commit fa751f1d2c - Show all commits

View File

@ -1,6 +1,6 @@
---
layout: post
title: Distrust - Trust But Verify
title: Adventures In Supply Chain Integrity
date: 2024-03-28
cover_image: "../assets/images/whale_shark.jpg"
authors:
@ -8,20 +8,101 @@ authors:
bio: Professional bonker / twerker.
twitter: le twitter
- name: Anton Livaja
bio: Professional .
bio: Professional banana juggler.
twitter: antonlivaja
- name: Lance R. Vick
bio: Dolphin trainer
twitter: no.
---
Bacon ipsum dolor amet porchetta brisket pork loin, cupim pork belly frankfurter landjaeger andouille ground round hamburger corned beef tri-tip short loin. Ribeye andouille bacon pork leberkas doner. Meatloaf capicola brisket hamburger tongue chuck. Tail ham prosciutto, beef ribs beef frankfurter flank strip steak tenderloin.
TODO: explain the mental trap of naive threat modelling versus completely eliminating certain attack vectors
TODO: ## Examples of Real World Attacks
TODO xz library backdoor
TODO solar winds backdoor
Chislic flank fatback, tri-tip short loin tenderloin ground round boudin venison bacon porchetta short ribs jowl doner bresaola. Doner frankfurter chislic, t-bone tongue leberkas cupim burgdoggen salami ribeye. Ham hock ham flank filet mignon beef ribs andouille. Pork loin tongue leberkas cupim short loin bacon cow.
---
Kielbasa ham hock ground round pig meatloaf chuck porchetta. Meatball boudin drumstick hamburger. Beef ribs capicola frankfurter, t-bone pork beef chuck ham hock tail bresaola kevin pig. Kevin spare ribs porchetta beef pig landjaeger pork shankle. Pork loin turkey strip steak kielbasa porchetta meatball turducken hamburger pork ball tip tri-tip chislic sausage.
When a compiler is used to compile some piece of software how do we verify that the compiler can be trusted? Is it well known who compiled the compiler itself? Usually compilers are not built from source, and even when they are built, they are seeded from a binary that itself is opaque and difficult to verify. So how does one check if the supply chain integrity of the compiler itself is in tact, even before we get to building software with it? Compiler supply chains are obscured and at many points seeded from binaries, so that it becomes nearly impossible to verify their integrity. In 1984, Ken Thompson wrote ["Reflections on Trusting Trust"](https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf) and illustrated that a compiler can modify software during the compilation process in order to compromise the software. Put simply, this means that reviewing the source code is not enough. We need to be sure that the compiler itself isn't compromised as it could be used to modify the intended behavior of the software.
![whale shark](../assets/images/whale_shark.jpg)
What about the software that's built using the compiler? Has the source code been modified during compilation? Has the resulting binary of the software been tampered with, perhaps in the CI/CD runner which runs an OS with a vulnerability in one of its sub dependencies, or perhaps the server host has been compromised and attackers have gained control of the infrastructure? These are difficult software supply chain security issues which are often swept under the rug, or completely overlooked due to lack of understanding. The bottom line is that in order to eliminate this surface area of attack, we need a good answer to these questions, and more importantly we need tooling and practical methods which can help close these gaps in the supply chain.
## Second Title
This line of questioning becomes especially concerning in the context of widely used software such as images pulled from DockerHub, package managers, and Linux distributions. Software procured via these channels are used widely, and are pervasive in almost all software and as such pose a severe attack vector. If the maintainer of a widely used DockerHub image has their machine compromised, or are coerced or even forced under duress to insert malicious code into the binaries they are responsible for in most cases there is no effective measure in place to detect and catch this which can result in millions of downstream consumers being impacted. Imagine what would happen if the maintainer of a default DockerHub image of a widely used language was compromised, and the binary they released had a backdoor in it. The implications are extremely far reaching, and would be disastrous.
Sirloin hamburger leberkas pig. Kielbasa doner picanha kevin. Meatball tenderloin ham hock spare ribs strip steak picanha tail drumstick t-bone pork loin venison flank rump. Turkey drumstick picanha t-bone, filet mignon fatback pork belly tail venison boudin shankle ribeye pancetta. Bacon meatball pig, pork loin t-bone ball tip meatloaf fatback cupim tenderloin.
There are two distinct problems at hand which share a solution:
1. How do we ensure that we can trust the toolchain used to build software
2. How do we ensure that we can trust software built with the toolchain
Doner pork pastrami, frankfurter t-bone kevin chislic chuck. Meatball short loin meatloaf spare ribs prosciutto brisket. Biltong kielbasa boudin pig jowl shankle swine frankfurter turducken pancetta buffalo kevin chislic pork chop flank. Cow alcatra meatball, fatback tenderloin porchetta tri-tip prosciutto chislic turducken biltong pig. Sirloin strip steak t-bone, swine sausage turducken alcatra filet mignon landjaeger burgdoggen capicola salami pork loin short ribs jerky. Pork chop hamburger strip steak, meatloaf sirloin picanha ground round pancetta andouille shoulder tenderloin bresaola. Jowl venison pork burgdoggen, ball tip swine doner pig frankfurter tongue ribeye meatball drumstick.
The answer to both questions is the same. We achieve it via verifiability and determinism. Now to be clear, we are not trying to solve the problem of the code itself being compromised in the source. If the source code is compromised, determinism does not help prevent that. If the code is reviewed and verified as being secure, then determinism and and multiple reproductions of the software do add a set of excellent guarantees.
Deterministically built software is simply software which always compiles to the same bit-for-bit exact binary. This is useful because it makes it trivial to check the integrity of the binary. This is because if the binary is always the same, we can use hashing to ensure that nothing about the binary has changed. Typically minor differences which are introduced during the build process, such as time stamps mean that software is typically non-deterministic. By essentially pinning all aspects of the environment the software is built in, and removing any changing factors such as time, we can force the software to always be bit-for-bit the same. Now imagine a scenario where a developer is compiling software, and they are not doing it deterministically. Any time they build the software, they have no way to easily verify if the binary changed in a meaningful way compared to the previous one without doing low level inspection. With determinism, it's as easy as hashing one binary, repeating the compilation, and then hashing the second result and comparing it with the original. This is great, but it's still not enough to ensure that the binary can be trusted. This is because there may be malware which always modifies the binary in the same manner. To mitigate this we can build the software on multiple different machines, ideally by different entities, using different operating systems and even different hardware, as it's much less likely that multiple diverse stacks and individuals are compromised by the same malware or attacker. In this manner, we can eliminate the risk of modification during compilation going undetected. In order to add a layer of trust that the hashes which have been produced by different entities can be trusted we can use cryptographic signing, as is customary for many software releases.
To assess the current state of affairs when it comes to what's available regarding software package managers and distributions and how far they have gone to mitigate the risks we expound on above, we took a hard look at the usual suspects.
Alpine is the most popular Linux distribution (distro) in container-land and has made great strides in providing a minimal `musl` based distribution with reasonable security defaults and is suitable for a lot of use cases, however in the interest of developer productivity and low friction for contributors, none of it is cryptographically signed.
Debian (and derivatives like Ubuntu) is one of most popular option for servers, and is largely reproducible and also signs all packages, however being `glibc` based with a focus on compatibility and desktop use cases, it results in a huge number of dependencies for almost any software run on it, enacts partial code freezes for long periods of time between releases, and often has very stale packages as various compatibility goals block updates. This overhead introduces a lot of surface area of malicious code to hide itself in. Unfortunately due to its design, when building software deterministically on this OS, each and every repo needs to keep costly snapshots of all dependencies to be able to reproduce build containers as Debian packages are archived and retired after some time to servers with extremely low bandwidth. This creates a lot of friction for teams who as a result have to archive often hundreds of .deb files for every project, and also has the added issue of Debian having very old versions of things like Rust, which is a common requirement, which can be quite problematic for teams who want to access latest language features. Even with all this work, Debian does not have truly reproducible Rust (will talk about that more in a bit), and packages are signed only by single maintainers whom we have to fully trust that they didn't release a compromised binary.
Fedora (and RedHat based distros) also sign all packages, but otherwise suffer from similar one-size-fits-all bloat problems as Debian with a different coat of paint. Additionally, their reliance on centralized builds has been used as justification for them to not pursue reproducibility at all which makes them a non-starter for security focused use cases.
Arch has very fast updates as a rolling release distro, and package definitions are signed, and often reproducible, but they change from one minute to the next still resulting in the challenge of having to come up with a solution to pin and archive sets of dependencies that work well together for software that's built using it and requires determinism.
Nix is almost entirely reproducible by design and allows for lean and minimal output artifacts. It is also a big leap forward in having good separation of concerns between privileged immutable and unprivileged mutable spaces, however like Alpine there is no maintainer-level signing in order to reduce the friction for hobbyist that wants to contribute.
Guix is reproducible by design as well, borrowing a lot from Nix. It also does maintainer-level signing like Debian. It comes the closest to the solution we need, but it only provides single signed package contributions, and a `glibc` base with a large dependency tree, with a significant footprint of tooling to review and understand to form confidence in it. This is still too much overhead we simply don't want or need for use cases like container builds of software, lean embedded operating systems, or any sensitive system where we want the utmost level of supply chain security assurance.
For those whose goal is to build their own software packages deterministically with high portability, maintainability, and maximally easy supply chain auditability, none of these solutions hit the mark.
On reflecting on these issues, we concluded we want the `musl`-based container-ideal minimalism of Alpine, the obsessive determinism and full-source supply chain goals of Guix, and a step beyond the single-sig signed packages of Debian, Fedora, and Arch. We also concluded that we want a fully verifiable bootstrapped toolchain, consisting of a compiler and accompanying libraries required for building most modern software.
You know where this is going. Here is where we made the totally reasonable and not-at-all-crazy choice to effectively create…
## Yet *Another* Linux Distribution
Lets take a look at some of the features we care about most compared to make it more clear why nothing else hit the mark for us.
A comparison of `stagex` to other distros in some of the areas we care about:
| Distro | Containerized | Signatures | Libc | Bootstrapped | Reproducible | Rust Deps |
|--------|---------------|------------|-------|--------------|--------------|-----------|
| Stagex | Native | 2+ Human | Musl | Yes | Yes | 4 |
| Guix | No | 1 Human | Glibc | Yes | Yes | 4 |
| Nix | No | 1 Bot | Glibc | Partial | Mostly | 4 |
| Debian | Adapted | 1 Human | Glibc | No | Partial | 232 |
| Arch | Adapted | 1 Human | Glibc | No | Partial | 262 |
| Fedora | Adapted | 1 Bot | Glibc | No | No | 166 |
| Alpine | Adapted | None | Musl | No | No | 32 |
We are leaving out hundreds of distros here, but at the risk of starting a holy war, we felt it was useful to compare a few popular options for contrast to the goals of the minimal container-first, security-first, deterministic, distro we put together.
We are not the first to go down this particular road road, in fact the TalosOS project built their own tiny containerized toolchain from gcc to golang as the base to build their own minimal immutable k8s distro.
Getting all the way to bootstrapping rust however is a much bigger chunk of pain as we learned…
## The Oxidation Problem - Bootstrapping Rust
Getting from gcc all the way to golang was mostly pain-free, thanks to Google documenting this path well and providing all the tooling to do it. One only needs 3 versions of golang to get all the way back to GCC.
Bootstrapping Rust however is a bit of an ordeal. People love Rust for its memory safety and strictness, however we have to admit supply chain integrity is not an area where it excels. This is mostly because Rust changes so much from one release to the next, that a given version of Rust can only ever be built with its immediate predecessor.
If one follows the chicken-and-egg problem far enough the realization dawns that in most distros the chicken comes first. Most included a non-reproducible “seed” Rust binary presumably compiled by some member of the Rust team, then use that to build the next version, and then carry on from there. This means even some of the distros that -say- their Rust builds are reproducible have a pretty big asterisk. We wont call anyone out - you know who you are.
Granted, even if you were to build all the way up from the OCaml roots of Rust (if you can find that code and then get it to build), you still have to have a trusted OCaml compiler. Software supply chains are hard, and we always end up back at the famous Trusting Trust Problem.
There have been some amazing efforts by the Guix team to bootstrap GCC and the entire package chain after it with a tiny human-auditable blob of x86 assembly via the GNU Mes project. That is probably in the cards for our stack as well, however for the short term we wanted to at least go as low in the stack as GCC like we do with go as a start which is already a sizable effort. Thankfully John Hodge (mutabah), a brilliant (crazy?) member of the open source community, created “mrustc” which implements a minimal semi-modern rust 1.54 compiler in C++ largely from transpiled Rust code. It is missing a lot of critical features that make it unsuitable for direct use, but it -does- support enough features to compile official Rust 1.55 sources, which can compile Rust 1.56 and so on. This is the path Guix and Nix both went down, and we are taking their lead here.
Mrustc however lacked support for musl libc which threw a wrench in things, but after a fair bit of experimentation we were able to patch in support musl and get it upstream.
The result is we now have the first deterministic musl based rust compiler bootstrapped all the way back to GCC, and you can reproduce our builds right now from any OS that can run Docker.
## Future Work
As of writing this, Stagex has 100+ packages covering some of the core software you may be using regularly, all built using the deterministically built toolchain, and of course the software itself also built deterministically. Some of the packages include `rust`, `go`, `nodejs`, `python3.8`, `curl`, `bash`, `git`, `tofu` and many more.
We would like to support building with `buildah` and `podman` for build-tooling diversity. We would also love help from the open source community to see GCC bootstrapped all the way down to x86_assembly via Mes. This may require using multiple seed distro containers to work in parallel to ensure we dont have a single provenance source for even that layer.
We are also actively on and have made some progress towards the addition of core packages required to use this distro as a minimal Linux OS.
If you have need for high trust in your own build system, please reach out and we would love to find a way to collaborate.
## References
* [Bootstraping rust](https://guix.gnu.org/en/blog/2018/bootstrapping-rust/)
* [Full source bootstrappin](https://guix.gnu.org/en/blog/2023/the-full-source-bootstrap-building-from-source-all-the-way-down/)
* [Running the "Reflections on Trusting Trust" Compiler](https://research.swtch.com/nih)
* [Reflections on Trusting Trust](https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf)