website/_posts/2025-03-20-bitby-report.md

132 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
layout: post
title: Safe{Wallet}/Bybit incident report and mitigating controls
date: 2025-03-20
---
The Safe{Wallet}/Bybit incident is an example of a nation-state actor executing a series of sophisticated, multi-layered attacks on high-value targets. In cases where the potential gain is significant, it may be justified for attackers to invest in multiple 0-day vulnerabilities and chain them into elaborate exploit sequences. These campaigns often span multiple layers of tech stack, involve precision-targeted social engineering, insider compromise, or even physical infiltration.
As such, threat model required to defend against this level of aversary must be extreme. It demands defenders adopt a much more rigorous set of assumptions about attacher capabilities and invest time in implementing controls that typical organizations may not need. When protecting high value assets, the game changes.
### Threat Model Assumptions
At Distrust, we operate under the assumption that nation-state actors are persistent, highly resourced, and capable of compromising nearly any layer of the system. Accordingly, our threat model assumes:
* All screens are visible to the adversary
* All keyboard input is being logged by the adversary
* Any firmware or bootloader not verified on every boot is considered compromised
* Any host OS with network access is compromised
* Any guest OS used for non-production purposes is compromised
* At least one member of the Production Team is compromised
* At least one maintainer of third party code used in the system is compromised
* At least one member of third party system used in production is compromised
* Physical attacks are viable and likely
* Side-channel attacks are viable and likely
These assumptions drive the design strategies and tooling outlined in this report. The controls we've developed are built specifically to address this elevated thread model. Many of the tools are ready to use today, some are reference designs, while other tooling requires further development. If you care about these issues and want to help us push this work forward, [talk to us](https://distrust.co/contact.html).
### Summary
This report highlights the major single points of failure, which rely on a single individual and/or computer, thus creating an opportunity for compromise. Blockchains benefit from security of the network via strong cryptography and decentralization. More "traditional" parts of the infrastructure historically have not had the ability to distribute trust, but there are tactics that can be leveraged to achieve distribution of trust which help reduce risk from a single individual or computer undermining the integrity of a system.
---
## Root Cause Analysis and Mitigating Controls
In our opinion, the main reasons this hack occured are these two points found in the [Sygnia report](https://www.sygnia.co/blog/sygnia-investigation-bybit-hack/):
* > ... a developers Mac OS workstation was compromised, likely through social engineering.
* > ... the modification of JavaScript resources directly on the S3 bucket serving the domain app.safe[.]global.
## Introduction
The compromise occured due to several key factors which have been summarized in other reports. This report will focus primarily on expounding on how this incident could have been prevented. It is important to address that the naive mitigating controls, while helpful, are not enough to mitigate the risk adequately. The naive security controls which we often observe as recommendations are improving safeguarding measure of the access tokens, access controls to cloud resources, such as the storage used for the JavaScript which is used to serve the web-application front-end, as well as monitoring (the quintessential reactive control, rather than a preventative one, and we strongly believe it's always better to prevent wherever possible). While these are improvements which are important, they are more of a "plugging holes on a sinking ship" exercise, rather than upgrading the hull to titanium. Even if improved controls are introduced around the token and cloud platform management, there are still many different single points of failure in the system.
To quickly dive into the first principles we can apply here to reason about how risk shifts around let's take the example of how the deployment pipeline is secured. Most companies have a system admin, or developer who has individual ability to modify the server (or software it runs) that's part of the build pipline used to build or serve the application, or can even modify the application code itself. This is considered to be a single point of failure in the Distrust threat model. The risk is difficult to migitgate, because even if the effort is taken to set up a hardened deployment pipeline to try to address the risk we elucidate here, it is simply shifted elsewhere. Assuming that the whole infrastructure is methodically hardened throughout the whole software development lifecycle, in the end there is always 1 super admin who has access to the cloud platform - which gives the individual complete control over the infrastructure. This is a problem most cloud platforms simply ignore as the industry has gotten used to accepting this risk. It is worth nothing that this is not about "not trusting" one's team, it's about distributing trust (Dis-trust, get it?). The question then is "does it make sense for a single individual to undermine the integrity of the whole system"? Those involved in the blockchain space would likely agree it does not.
#### Mitigation Principles
To adequately address the risks according to the Distrust threat model, some additional assumptions have to be made about the distinction between cold and hot wallets. These statements are conclusions drawn from experience stemming from building systems at BitGo, Unit410 and Turnkey, as well as conducting security due dilligence probes on most major custodian/vaulting solutions available on the market.
* A *cold* cryptographic key management system is one where all involved components can be built, operated and verified offline. If any component requires trusting anything over the internet, then the entire system is a *hot* system by definition. In this case a system was advertised as cold that relied on internet controlled components which means it's a hot wallet according to our definition. We believe this was done for good UX reasons at the expense of a true offline quarantine that would have prevented this class of attack.
* *Cold* cryptographic key management system with a truly random entropy source are *not* susceptible to remote attacks and are only exposed to close proximity threats such as physical access or side-channel attacks.
* It is a common belief that keeping a key offline will protect any system that uses that key, and that this constitutes a cold system. The problem is that an atacker does not necessarily care about stealing the key - it is sufficient for them to succeede in achieving an outcome where the key performs an operation on the desired data.
* All software used in the stack has to be open source, built deterministically (to allow for reproduction), and using a full source bootstrapped compiler, or the system will still be exposed to single points of failure via supply chain attack vectors.
#### Mitigations and Reference Designs
There are two general design approaches which can remove the attack surface area which resulted in the incident. These two approaches both achieve a similar set of security guarantees, but vary significantly in the approach and the effort required to implement them. Of course, our opinion is that when a company is protecting billions of dollars of value, it's worthwhile to go through the effort of implementing all known tactics which can mitigate risk. The accounting is simple, invest in protecting yourself rather than betting on the likelihood you not be targetted. State funded actors are more motivated than ever to attack companies which are involved in digital assets - it's simple game theory - and spending a lot to compromise the aforementioned companies makes economic sense for these advanced threat actors. We've seen this type of attack before with Axie Infinity, and we will see it again. There is a large shift towards exploiting supply chains and single points of failure as blue teams are currently largely neglecting defense / mitigation of this attack surface area.
#### Strategy 1 - Run Everything Locally
This strategy is something that can be implemented without major adjustments to the existing system. The design is focused on moving the component which is currently turning the wallet into a hot one, into a offline component, thus upgrading the system to a fully cold solution. The idea consists of taking the "signing" component of the application which currently happens in the UI, and creating an offline application, for example by using Electrum. It is important to note that even if something is offline, not all single points of failure are eliminated, unless the individual can build the application themselves from source, with a bootstrapped compiler, and in a deterministic manner (something we created open source tooling for with [StageX](https://codeberg.org/stagex/stagex) - to learn more about why reproducible builds matter, you can refer to the video of one of our co-founders explaining the Solar Winds incident and how it could have been prevented in [this video](https://antonlivaja.com/videos/2024-incyber-stagex-talk.mp4).
##### Reference Design
This reference design focuses on the Safe{Wallet} team, but applies to any team trying to build an offline component which has minimized single points of failure.
1. All system administrators are provided with dedicated offline laptops
* Radio cards are removed (bluetooth, wifi)
* Machine that has never been connected to the internet
2. All engineers provision and distribute their own personal signing keys (PGP)
* Use smart cards such as NitroKey or YubiKey
* Only do signing operations with these keys on the personal offline system
* Distrust has created open source tooling that simplifies secure provisioning: [Trove](https://trove.distrust.co/generated-documents/all-levels/pgp-key-provisioning.html)
3. An offline signing application is deterministically compiled, verified and signed by multiple engineers
* Includes all necessary tools to carry out offline key operations
* Distrust also developed [AirgapOS](https://git.distrust.co/public/airgap) which is custom Linux OS that is meant for managing secret material offline. It has been audited by a third party and is being used in production by several major digital asset companies.
4. All sensitive operations are fully verified offline before any cryptographic operations take place
#### Strategy 2 - Use Remotely Verified Service
This strategy re-establishes nearly identical user experience as present albeit with significantly more engineering effort to add verifiability at key points of the system. This strategy requires much more engineering effort and the tooling to execute on this design easily is not yet fully built (but we are working on it).
##### Reference Design
This design focuses on leveraging secure encalves to create servers which are immutable, deterministic and can cryptographically attest to the software they are running. While this design gets close to the fully cold design from the previous step, it will always inevitably remain exposed to attack surface area of browsers, such as via 0-day exploits, extensions in the browser, host operating system compromise etc.
1. Rewrite application to run in secure enclave
* TLS termination inside of the enclave
* Web interface served from inside of enclave
* Nothing outside of the enclave is trusted
2. Create deterministic OS image with remote attestation (TPM2, Nitro Enclave or similar)
* The whole stack is built using full source bootstrapped compiler and in a reproducible manner
3. One engineer deploys a new enclave with new code
4. Different engineer proves remote code matches reviewed code in vcs repository
5. Clients are issued a service worker on first load that pins keys allowing remote attestation verification on all subsequent loads
* User has option to verify and download application locally for full offline operations
* User is also encouraged to build themselves and match published signed hash
Implementing these strategies can be challenging, and this is a high level overview of the type of problems we work on. Depending on the chosen approach and context implementing these strategies can take anywhere from a few weeks to a few years depending on available resources.
## Summary
## About Distrust
The Distrust team has helped build and secure some of the highest risks systems in the world such as the vaulting systems at BitGo, Unit410, and Turnkey as well as helping electrical grid operators, industrial control system operators and others secure their mission critical systems. Distrust has also conducted security due dilligence probes on most major custodians. Through working with companies that are exposed to the most sophisticated known attackers where all attacks are viable, Distrust developed a methodology and open source tooling to help mitigate this level of threat. We are now using our hard learned lessons to help everyone improve their security posture by sharing what we learnined and creating open source tooling everyone can benefit from.