website/_posts/2025-03-20-bitby-report.md

176 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
layout: post
title: The Safe{Wallet}/Bybit incident report: Root cause ananlysis and mitigation strategies
date: 2025-03-20
---
The Safe{Wallet}/Bybit incident is an example of a nation-state actor executing a series of sophisticated, multi-layered attacks on high-value targets. In cases where the potential gain is significant, it may be justified for attackers to invest in multiple 0-day vulnerabilities and chain them into elaborate exploit sequences. These campaigns often span multiple layers of tech stack, involve precision-targeted social engineering, insider compromise, or even physical infiltration.
As such, threat model required to defend against this level of aversary must be extreme. It demands defenders adopt a much more rigorous set of assumptions about attacher capabilities and invest time in implementing controls that typical organizations may not need. When protecting high value assets, the game changes.
### Threat model assumptions
At Distrust, we operate under the assumption that nation-state actors are persistent, highly resourced, and capable of compromising nearly any layer of the system. Accordingly, our threat model assumes:
* All screens are visible to the adversary
* All keyboard input is being logged by the adversary
* Any firmware or bootloader not verified on every boot is considered compromised
* Any host OS with network access is compromised
* Any guest OS used for non-production purposes is compromised
* At least one member of the Production Team is compromised
* At least one maintainer of third party code used in the system is compromised
* At least one member of third party system used in production is compromised
* Physical attacks are viable and likely
* Side-channel attacks are viable and likely
These assumptions drive the design strategies and tooling outlined in this report. The controls we've developed are built specifically to address this elevated thread model. Many of the tools are ready to use today, some are reference designs, while other tooling requires further development. If you care about these issues and want to help us push this work forward, [talk to us](https://distrust.co/contact.html).
### Summary
This report identifies critical single points of failure—cases where trust is placed in a single individual or computer—creating opportunities for compromise. In contrast, blockchains offer stronger security properties through cryptography and decentralized trust models.
Traditional infrustructure has historically lacked mechanisms to distribute trust, but this limitation can be addressed. By applying targeted design strategies, it's possible to distribute trust across systems and reduce the risks of a single compromised actor undermining the integrity of the entire system.
---
## Root aause analysis and mitigation strategies
In our opinion, the primary causes of this incident stem from two key issues identified in the [Sygnia report](https://www.sygnia.co/blog/sygnia-investigation-bybit-hack/):
* > ... a developers Mac OS workstation was compromised, likely through social engineering.
* > ... the modification of JavaScript resources directly on the S3 bucket serving the domain app.safe[.]global.
These findings highlight both endpoint compromise and weak controls around cloud infrustructure. The following sections focus on how such risks could be mitigated through architectural decisions and more rigorous threat modeling.
## Introduction
The compromise occured due to several key factors, already documented in other reports. This report focuses on how the incident **could have been prevented** through a stronger, first-principles approach to infrustructure design.
While many security teams reach for quick wins—like access token rotation, stricter IAM policies, or improved monitoring—these are often reactive measures. They may help, but they're equivalent of "plugging holes on a sinking ship" rather than rebuilding the hull from stronger material.
For example, improving access control to the S3 bucket used to serve JavaScript resources, or adding better monitoring, are good steps. But they rely on trust placed in individuals or cloud platforms, which remain vulnerable to compromise.
> At the core of this breach lies a recurring theme: single points of failure.
To explore this from first principles, consider the deployment pipeline. In most companies, one individual—an admin or developer—often has the ability to modify critical infrustructure or code. That person becomes a single point of failure.
Even if the pipeline is hardened, the risk shifts, not disappears. They's always one super-admin who has full access. Most clould platforms encourage this pattern, and the industry has come to accept it.
But this isn't about distrusting your team—it's about designing systems where **trust is distributed**. In the blockchain space, this is already accepted practice. So the question becomes:
> *Does it make sense for a single individual to hold the integrity of an entire system in their hands?*
Those who've worked with decentralized systems would say: absolutely not.
#### Mitigation principles
To adequately defend against the risks outlined in the Distrust threat model, it is critical to distinguish between **cold** and **hot** wallets. The following princpiples are drawn from practical experience building secure systems at BitGo, Unit410, and Turnkey, as well as from diligence work conduced across leading custodial and vaulting solutions.
* A **cold cryptographic key management system** is one where all components can be built, operated, and verified offline. If any part of the system requires trusting a networked component, it becomes **hot** system by definition. For example, if a wallet relies on internet-connected components, it should be considered hot wallet—regardless of how it's marketed. While some systems make trade-offs for user experience, these often come at the cost of real security guarantees.
* Cold cryptographic key management systems that leverage truly random entropy sources are **not susceptible to remote attacks**, and are only exposed to localized threats such as physical access or side-channel attacks.
* A common misconception is that simply keeping a key offline makes a system cold and secure. But an attacker doesn't always need to steal the key—they just need to achieve the outcome where the key performs an an operation on the desired data on their behalf.
* **All software in the stack must be open source**, built deterministically (to support reproduction), and compiled using a fully bootstrapped toolchain. Otherwise, the system remains exposed to single points of failure, especially via supply chain compromise.
#### Mitigations and reference designs
We propose two high-level design strategies that can eliminate the types of vulnerabilities exploited in the Safe{Wallet}/Bybit attack. Both approaches offer similar levels of security assurance—but differ significantly in implementation complexity and effort.
In our view, **when billions of dollars are at stake**, it is worth investing in proven low-level mitigations, even if they are operationally harder to deploy. The accounting is simple: **invest in securing your system up front**, rather than gambling on assumptions you won't be targeted.
State funded actors are highly motivated—and when digital assets are involved, it's game theory at work. The cost of compromising a weak system is often far less than the potential gain.
We've seen this playbook used in previous incidents, including Axie Infinity, and we will see it again. Attackers are increasingly exploiting both human and technical single points of failure—while defenders often uner-invest in securing this surface area.
#### Strategy 1 - Run everything locally
This strategy can be implemented without major adjustments to the existing system. The goal is to move the component currently introducing risk—effectively making the wallet "hot"--—into an offline component, upgrading the system to a fully cold solution.
The idea centers on extracting the **signing** component from the application (which currently operates in the UI) and converting it into an offline application. A practical example of this approach would be using a tool like **Electrum**.
However, simply making a component offline does not eliminate all single points of failure. The security requires that the individual builds the application themselves from source, using a fully bootstrapped compiler and a **deterministic build process**.
We've developed open-source tooling for this under **[StageX](https://codeberg.org/stagex/stagex)**. To learn more about the importance of reproducible builds, check out [this video](https://antonlivaja.com/videos/2024-incyber-stagex-talk.mp4), where one of our co-founders explains how the SolarWinds incident unfolded—and how it could have been prevented.
##### Reference design
This reference design was developed for the Safe{Wallet} team, but it can be applied to any team seeking to build an offline component with minimal single points of failure.
1. **System administrators use dedicated offline laptops**
* All radio hardware (Bluetooth, Wi-Fi) is physically removed
* Machines are air-gapped and have never been connected to the internet
2. **Engineers provision and manage their own personal signing keys (PGP)**
* Smart cards like NitroKey or YubiKey are used
* Signing operations are performed exclusively on the engineer's offline system
* Distrust has developed open-source tooling to support secure key provisioning: **[Trove](https://trove.distrust.co/generated-documents/all-levels/pgp-key-provisioning.html)**
3. **Offline signing applications are deterministically compiled, verified, and signed by multiple engineers**
* Includes a full set of tools needed to secure offline key operations
* Distrust also created **[AirgapOS](https://git.distrust.co/public/airgap)**, a custom custom Linux distribution designed specifically for offline secret management.It has been independantly audited and is in production with several major digital asset organizations.
4. **All sensitive operations are fully verified offline before any cryptographic action is taken**
This deisng drastically reduces exposure to remote attacks and central points of trust, aligning closely with Distrust's first-principles security model.
#### Strategy 2 - Use remotely verified service
This strategy maintains a user experience nearly identical to the current system, while introducing verifiability at critical points in the architecture. It requires significantly moe engineering effort and operational discipline, and the tooling needed to support this model is still under active development.
##### Reference design
This design relies on **secure enclaves** to host servers that are immutable, deterministic, and capable of cryptographically attesting to the software they are running. While this brings us closer to a cold setup, some residual attack surface—such as browser exploits, host OS compromise, or 0-day attacks—will always remain.
The core implementation steps include:
1. **Rewrite the application to run entirely within a secure enclave**
* TLS termination occurs **inside** the enclave
* The web interface is served **from within" the enclave
* Nothing outside the enclave is trusted
2. **Create a deterministic OS image with remote attestation (e.g., TPM2, Nitro Enclave or similar)**
* The entire stack is built using fully bootstrapped compiler in a reproducible manner
3. **One engineer deploys a new enclave** with the updated application code
4. **A second engineer independantly verifies** that the deployed code matches the version in the source repository
5. **Clients are issued a service worker** on first load that pins attestation keys for all future remote verification
* Users can optionally verify and download the application locally for offline operations
* Users are also encouraged to self-build and match the published signed hash
## Implementation considerations
Implementing these strategies can be technically demanding. They represent two ends of the trust minimization spectrum: one favoring offline, air-gapped assurance; the other introducing verifiability within connected systems. Both approaches significantly reduce risk but vary in complexibity, tooling and requirements, and rollout timelines.
This high-level overview is meant to illustrate the kinds of problems we focus on at Distrust. Depending on the chosen strategy and organizational context, implementation can take anywhere from a few weeks to several years, especially as tooling continues to mature.
## About Distrust
The Distrust team has helped build and secure some of the highest-risk systems in the world. This includes vaulting infrastructure at BitGo, Unit410, and Turnkey, as well as security work with electrical frid operators, industrial control systems, and other mission-critical environments.
We've conducted deep security due diligence across most major custodians. Through our experience with organizations that operate under constant threat—where **every class of attack is viable**—we've developed a methadology and set of open-source tools designed to defend against even the most sophisticated adversaries.
Today, we're taking the hard-earned lessons from that work and sharing them with the broader community. Our goal is to help others strengthen their security posture by making what we've learned—and the tools we've built—available to everyone.
**Looking for help analyzing and mitigating security risks in your own organization? [Talk to us](https://distrust.co/contact.html)**.