blog refinements

This commit is contained in:
Ksenia Lesko 2025-04-01 13:36:14 -04:00 committed by Anton Livaja
parent 15593cafdc
commit 1825b7152f
Signed by: anton
GPG Key ID: 44A86CFF1FDF0E85
1 changed files with 36 additions and 9 deletions

View File

@ -46,29 +46,56 @@ These findings highlight both endpoint compromise and weak controls around cloud
## Introduction
The compromise occured due to several key factors which have been summarized in other reports. This report will focus primarily on expounding on how this incident could have been prevented. It is important to address that the naive mitigating controls, while helpful, are not enough to mitigate the risk adequately. The naive security controls which we often observe as recommendations are improving safeguarding measure of the access tokens, access controls to cloud resources, such as the storage used for the JavaScript which is used to serve the web-application front-end, as well as monitoring (the quintessential reactive control, rather than a preventative one, and we strongly believe it's always better to prevent wherever possible). While these are improvements which are important, they are more of a "plugging holes on a sinking ship" exercise, rather than upgrading the hull to titanium. Even if improved controls are introduced around the token and cloud platform management, there are still many different single points of failure in the system.
The compromise occured due to several key factors, already documented in other reports. This report focuses on how the incident **could have been prevented** through a stronger, first-principles approach to infrustructure design.
While many security teams reach for quick wins—like access token rotation, stricter IAM policies, or improved monitoring—these are often reactive measures. They may help, but they're equivalent of "plugging holes on a sinking ship" rather than rebuilding the hull from stronger material.
For example, improving access control to the S3 bucket used to serve JavaScript resources, or adding better monitoring, are good steps. But they rely on trust placed in individuals or cloud platforms, which remain vulnerable to compromise.
> At the core of this breach lies a recurring theme: single points of failure.
To explore this from first principles, consider the deployment pipeline. In most companies, one individual—an admin or developer—often has the ability to modify critical infrustructure or code. That person becomes a single point of failure.
Even if the pipeline is hardened, the risk shifts, not disappears. They's always one super-admin who has full access. Most clould platforms encourage this pattern, and the industry has come to accept it.
But this isn't about distrusting your team—it's about designing systems where **trust is distributed**. In the blockchain space, this is already accepted practice. So the question becomes:
> *Does it make sense for a single individual to hold the integrity of an entire system in their hands?*
Those who've worked with decentralized systems would say: absolutely not.
To quickly dive into the first principles we can apply here to reason about how risk shifts around let's take the example of how the deployment pipeline is secured. Most companies have a system admin, or developer who has individual ability to modify the server (or software it runs) that's part of the build pipline used to build or serve the application, or can even modify the application code itself. This is considered to be a single point of failure in the Distrust threat model. The risk is difficult to migitgate, because even if the effort is taken to set up a hardened deployment pipeline to try to address the risk we elucidate here, it is simply shifted elsewhere. Assuming that the whole infrastructure is methodically hardened throughout the whole software development lifecycle, in the end there is always 1 super admin who has access to the cloud platform - which gives the individual complete control over the infrastructure. This is a problem most cloud platforms simply ignore as the industry has gotten used to accepting this risk. It is worth nothing that this is not about "not trusting" one's team, it's about distributing trust (Dis-trust, get it?). The question then is "does it make sense for a single individual to undermine the integrity of the whole system"? Those involved in the blockchain space would likely agree it does not.
#### Mitigation Principles
To adequately address the risks according to the Distrust threat model, some additional assumptions have to be made about the distinction between cold and hot wallets. These statements are conclusions drawn from experience stemming from building systems at BitGo, Unit410 and Turnkey, as well as conducting security due dilligence probes on most major custodian/vaulting solutions available on the market.
To adequately defend against the risks outlined in the Distrust threat model, it is critical to distinguish between **cold** and **hot** wallets. The following princpiples are drawn from practical experience building secure systems at BitGo, Unit410, and Turnkey, as well as from diligence work conduced across leading custodial and vaulting solutions.
* A *cold* cryptographic key management system is one where all involved components can be built, operated and verified offline. If any component requires trusting anything over the internet, then the entire system is a *hot* system by definition. In this case a system was advertised as cold that relied on internet controlled components which means it's a hot wallet according to our definition. We believe this was done for good UX reasons at the expense of a true offline quarantine that would have prevented this class of attack.
* A **cold cryptographic key management system** is one where all components can be built, operated, and verified offline. If any part of the system requires trusting a networked component, it becomes **hot** system by definition. For example, if a wallet relies on internet-connected components, it should be considered hot wallet—regardless of how it's marketed. While some systems make trade-offs for user experience, these often come at the cost of real security guarantees.
* *Cold* cryptographic key management system with a truly random entropy source are *not* susceptible to remote attacks and are only exposed to close proximity threats such as physical access or side-channel attacks.
* Cold cryptographic key management systems that leverage truly random entropy sources are **not susceptible to remote attacks**, and are only exposed to localized threats such as physical access or side-channel attacks.
* It is a common belief that keeping a key offline will protect any system that uses that key, and that this constitutes a cold system. The problem is that an atacker does not necessarily care about stealing the key - it is sufficient for them to succeede in achieving an outcome where the key performs an operation on the desired data.
* A common misconception is that simply keeping a key offline makes a system cold and secure. But an attacker doesn't always need to steal the key—they just need to achieve the outcome where the key performs an an operation on the desired data on their behalf.
* All software used in the stack has to be open source, built deterministically (to allow for reproduction), and using a full source bootstrapped compiler, or the system will still be exposed to single points of failure via supply chain attack vectors.
* **All software in the stack must be open source**, built deterministically (to support reproduction), and compiled using a fully bootstrapped toolchain. Otherwise, the system remains exposed to single points of failure, especially via supply chain compromise.
#### Mitigations and Reference Designs
There are two general design approaches which can remove the attack surface area which resulted in the incident. These two approaches both achieve a similar set of security guarantees, but vary significantly in the approach and the effort required to implement them. Of course, our opinion is that when a company is protecting billions of dollars of value, it's worthwhile to go through the effort of implementing all known tactics which can mitigate risk. The accounting is simple, invest in protecting yourself rather than betting on the likelihood you not be targetted. State funded actors are more motivated than ever to attack companies which are involved in digital assets - it's simple game theory - and spending a lot to compromise the aforementioned companies makes economic sense for these advanced threat actors. We've seen this type of attack before with Axie Infinity, and we will see it again. There is a large shift towards exploiting supply chains and single points of failure as blue teams are currently largely neglecting defense / mitigation of this attack surface area.
We propose two high-level design strategies that can eliminate the types of vulnerabilities exploited in the Safe{Wallet}/Bybit attack. Both approaches offer similar levels of security assurance—but differ significantly in implementation complexity and effort.
In our view, **when billions of dollars are at stake**, it is worth investing in proven low-level mitigations, even if they are operationally harder to deploy. The accounting is simple: **invest in securing your system up front**, rather than gambling on assumptions you won't be targeted.
State funded actors are highly motivated—and when digital assets are involved, it's game theory at work. The cost of compromising a weak system is often far less than the potential gain.
We've seen this playbook used in previous incidents, including Axie Infinity, and we will see it again. Attackers are increasingly exploiting both human and technical single points of failure—while defenders often uner-invest in securing this surface area.
#### Strategy 1 - Run Everything Locally
This strategy is something that can be implemented without major adjustments to the existing system. The design is focused on moving the component which is currently turning the wallet into a hot one, into a offline component, thus upgrading the system to a fully cold solution. The idea consists of taking the "signing" component of the application which currently happens in the UI, and creating an offline application, for example by using Electrum. It is important to note that even if something is offline, not all single points of failure are eliminated, unless the individual can build the application themselves from source, with a bootstrapped compiler, and in a deterministic manner (something we created open source tooling for with [StageX](https://codeberg.org/stagex/stagex) - to learn more about why reproducible builds matter, you can refer to the video of one of our co-founders explaining the Solar Winds incident and how it could have been prevented in [this video](https://antonlivaja.com/videos/2024-incyber-stagex-talk.mp4).
This strategy can be implemented without major adjustments to the existing system. The goal is to move the component currently introducing risk—effectively making the wallet "hot"—into an offline component, upgrading the system to a fully cold solution.
The idea centers on extracting the **signing** component from the application (which currently operates in the UI) and converting it into an offline application. A practical example of this approach would be using a tool like **Electrum**.
However, simply making a component offline does not eliminate all single points of failure. The security requires that the individual builds the application themselves from source, using a fully bootstrapped compiler and a **deterministic build process**.
We've developed open-source tooling for this under **[StageX](https://codeberg.org/stagex/stagex)**. To learn more about the importance of reproducible builds, check out [this video](https://antonlivaja.com/videos/2024-incyber-stagex-talk.mp4), where one of our co-founders explains how the SolarWinds incident unfolded—and how it could have been prevented.
##### Reference Design