website/_posts/2025-03-20-bitby-report.md

13 KiB
Raw Blame History

layout title date
post Safe{Wallet}/Bybit incident report and mitigating controls 2025-03-20

The Safe{Wallet}/Bybit incident is an example of a nation state actor using a series of sophisticated attacks to compromise high value targets. When the value at stake is such that it justifies spending funds on buying 0-day vulnerabilities, in some cases, multiple 0-day vulnerabilities, and combining them into elaborate exploit chains, attacking multiple different layers of the tech stack, highly targetted social engineering, compromise of individuals, planting of moles or even phsyical attacks, the threat model which needs to be assumed to adequately address risks needs to be extreme. This type of attacker requires that the defender assumes a more rigorous set of assumptions around the capabilities of the attacker, and in turn takes the time to implement additional controls, which most companies don't have to.

Threat Model Assumptions

The assumptions we make about nation state actors at Distrust:

  • All screens are visible to the adversary
  • All keyboards are logging to the adversary
  • Any firmware/boot-loaders not verified on every boot are compromised
  • Any host OS with network access is compromised
  • Any guest OS used for any purpose other than production access is compromised
  • At least one member of the Production Team is always compromised
  • At least one maintainer of third party code used in the system is compromised
  • At least one member of third party system used is compromised
  • Physical attacks are viable and likely
  • Side-channel attacks are viable and likely

The suggested mitigating controls which follow in this report consist of design strategies and tooling which we developed to address exactly this type of threat model. The good news is the reference designs and concepts are available to you today, but some of the tooling needs more work - so if you care about these issues and want to help us complete the work on the missing pieces, please talk to us.

Summary

This report highlight the major single points of failure, which rely on a single individual and/or computer, thus creating an opportunity for compromise. Blockchains benefit from security of the network via strong cryptography and decentralization. More "traditional" parts of the infrastructure historically have not had the ability to distribute trust, but there are tactics that can be leveraged to achieve distribution of trust which helps us reduce risk from a single individual or computer undermining the integrity of a system.


Root Cause Analysis and Mitigating Controls

In our opinion the main reasons this hack occured are these two points found in the Sygnia report

  • ... a developers Mac OS workstation was compromised, likely through social engineering.

  • ... the modification of JavaScript resources directly on the S3 bucket serving the domain app.safe[.]global.

Introduction

The compromise occured due to several key factors which have been nicely summarized in other reports, so this report will focus primarily on expounding on how this incident could have been prevented. It is important to address the naive mitigating controls while helpful, are not enough to mitigate the risk adequately. The naive security controls which we often observe as recommendations are improving safeguarding measure of the access tokens, access controls to cloud resources, such as the storage used for the JavaScript which is used to serve the web-application front-end, as well as monitoring (the quintessential reactive control, rather than a preventative one, and we strongly believe it's always better to prevent wherever possible). While these are improvements which are important, they are more of a "plugging holes on a sinking ship" exercise, rather than upgrading the hull to titanium. Even if improved controls are introduced around the token and cloud platform management, there are still many different single points of failure in the system.

To quickly dive into the first principles we can apply here to reason about how risk shifts around let's take the example of how the deployment pipeline is secured. Most companies have a system admin, or developer who has individual ability to modify the server (or software it runs) that's part of the build pipline used to build or serve the application, or can even modify the application code itself. This is considered to be a single point of failure in the Distrust threat model. The risk is difficult to migitgate, because even if the effort is taken to set up a hardened deployment pipeline to try to address the risk we elucidate here, it is simply shifted elsewhere. Assuming that the whole infrastructure is methodically hardened throughout the whole software development lifecycle, in the end there is always 1 super admin who has access to the cloud platform - which gives the individual complete control over the infrastructure. This is a problem most cloud platforms simply ignore as the industry has gotten used to accepting this risk. It is worth nothing that this is not about "not trusting" one's team, it's about distributing trust (Dis-trust, get it?). The question then is "does it make sense for a single individual to undermine the integrity of the whole system"? Those involved in the blockchain space would likely agree it does not.

Mitigation Principles

To adequately address the risks according to the Distrust threat model, some additional assumptions have to be made about the distinction between cold and hot wallets. These statements are conclusions drawn from experience stemming from building systems at BitGo, Unit410 and Turnkey, as well as conducting security due dilligence probes on most major custodian/vaulting solutions available on the market.

  • A cold cryptographic key management system is one where all involved components can be built, operated and verified offline. If any component requires trusting anything over the internet, then the entire system is a hot system by definition. In this case a system was advertised as cold that relied on internet controlled components which means it's a hot wallet according to our definition. We believe this was done for good UX reasons at the expense of a true offline quarantine that would have prevented this class of attack.

  • Cold cryptographic key management system with a truly random entropy source are not susceptible to remote attacks and are only exposed to close proximity threats such as physical access or side-channel attacks.

  • It is a common belief that keeping a key offline will protect any system that uses that key, and that this constitutes a cold system. The problem is that an atacker does not necessarily care about stealing the key - it is sufficient for them to succeede in achieving an outcome where the key performs an operation on the desired data.

  • All software used in the stack has to be open source, built deterministically (to allow for reproduction), and using a full source bootstrapped compiler, or the system will still be exposed to single points of failure via supply chain attack vectors.

Mitigations and Reference Designs

There are two general design approaches which can remove the attack surface area which resulted in the incident. These two approaches both achieve a similar set of security guarantees, but vary significantly in the approach and the effort required to implement them. Of course, our opinion is that when a company is protecting billions of dollars of value, it's worthwhile to go through the effort of implementing all known tactics which can mitigate risk. The accounting is simple, invest in protecting yourself rather than betting on the likelihood you not be targetted. State funded actors are more motivated than ever to attack companies which are involved in digital assets - it's simple game theory - and spending a lot to compromise the aforementioned companies makes economic sense for these advanced threat actors. We've seen this type of attack before with Axie Infinity, and we will see it again. There is a large shift towards exploiting supply chains and single points of failure as blue teams are currently largely neglecting defense / mitigation of this attack surface area.

Strategy 1 - Run Everything Locally

This strategy is something that can be implemented without major adjustments to the existing system. The design is focused on moving the component which is currently turning the wallet into a hot one, into a offline component, thus upgrading the system to a fully cold solution. The idea consists of taking the "signing" component of the application which currently happens in the UI, and creating an offline application, for example by using Electrum. It is important to note that even if something is offline, not all single points of failure are eliminated, unless the individual can build the application themselves from source, with a bootstrapped compiler, and in a deterministic manner (something we created open source tooling for with StageX - to learn more about why reproducible builds matter, you can refer to the video of one of our co-founders explaining the Solar Winds incident and how it could have been prevented in this video.

Reference Design

This reference design focuses on the Safe{Wallet} team, but applies to any team trying to build an offline component which has minimized single points of failure.

  1. All system administrators are provided with dedicated offline laptops

    • Radio cards are removed (bluetooth, wifi)

    • Machine that has never been connected to the internet

  2. All engineers provision and distribute their own personal signing keys (PGP)

    • Use smart cards such as NitroKey or YubiKey

    • Only do signing operations with these keys on the personal offline system

    • Distrust has created open source tooling that simplifies secure provisioning: Trove

  3. An offline signing application is deterministically compiled, verified and signed by multiple engineers

    • Includes all necessary tools to carry out offline key operations

    • Distrust also developed AirgapOS which is custom Linux OS that is meant for managing secret material offline. It has been audited by a third party and is being used in production by several major digital asset companies.

  4. All sensitive operations are fully verified offline before any cryptographic operations take place

Strategy 2 - Use Remotely Verified Service

This strategy re-establishes nearly identical user experience as present albeit with significantly more engineering effort to add verifiability at key points of the system. This strategy requires much more engineering effort and the tooling to execute on this design easily is not yet fully built (but we are working on it).

Reference Design

This design focuses on leveraging secure encalves to create servers which are immutable, deterministic and can cryptographically attest to the software they are running. While this design gets close to the fully cold design from the previous step, it will always inevitably remain exposed to attack surface area of browsers, such as via 0-day exploits, extensions in the browser, host operating system compromise etc.

  1. Rewrite application to run in secure enclave

    • TLS termination inside of the enclave

    • Web interface served from inside of enclave

    • Nothing outside of the enclave is trusted

  2. Create deterministic OS image with remote attestation (TPM2, Nitro Enclave or similar)

    • The whole stack is built using full source bootstrapped compiler and in a reproducible manner
  3. One engineer deploys a new enclave with new code

  4. Different engineer proves remote code matches reviewed code in vcs repository

  5. Clients are issued a service worker on first load that pins keys allowing remote attestation verification on all subsequent loads

    • User has option to verify and download application locally for full offline operations

    • User is also encouraged to build themselves and match published signed hash

Implementing these strategies can be challenging, and this is a high level overview of the type of problems we work on. Depending on the chosen approach and context implementing these strategies can take anywhere from a few weeks to a few years depending on available resources.

Summary

About Distrust

The Distrust team has helped build and secure some of the highest risks systems in the world such as the vaulting systems at BitGo, Unit410, and Turnkey as well as helping electrical grid operators, industrial control system operators and others secure their mission critical systems. Distrust has also conducted security due dilligence probes on most major custodians. Through working with companies that are exposed to the most sophisticated known attackers where all attacks are viable, Distrust developed a methodology and open source tooling to help mitigate this level of threat. We are now using our hard learned lessons to help everyone improve their security posture by sharing what we learnined and creating open source tooling everyone can benefit from.