update blog content

This commit is contained in:
Anton Livaja 2025-03-28 09:59:36 -07:00
parent 4b50b71600
commit a7705b5217
Signed by: anton
GPG Key ID: 44A86CFF1FDF0E85
1 changed files with 70 additions and 46 deletions

View File

@ -1,16 +1,15 @@
---
layout: post
title: Bybit incident report and mitigating controls
title: Safe{Wallet}/Bybit incident report and mitigating controls
date: 2025-03-20
cover_image: "/assets/images/whale_shark.jpg"
authors:
- name: Anton Livaja
---
The Bybit incident is an example of a nation state actor using a series of sophisticated attacks to compromise high value targets. When the value at stake is such that it justifies spending funds on buying 0-days, in some cases multiples, and combining them into elaborate exploit chains, attacking multiple different layers of the tech stack, highly targetted social engineering, compromise of individuals, planting of moles or even phsyical attacks, the threat model which needs to be assume to adequately address risks needs to be extreme.
The Safe{Wallet}/Bybit incident is an example of a nation state actor using a series of sophisticated attacks to compromise high value targets. When the value at stake is such that it justifies spending funds on buying 0-day vulnerabilities, in some cases, multiple 0-day vulnerabilities, and combining them into elaborate exploit chains, attacking multiple different layers of the tech stack, highly targetted social engineering, compromise of individuals, planting of moles or even phsyical attacks, the threat model which needs to be assumed to adequately address risks needs to be extreme. This type of attacker requires that the defender assumes a more rigorous set of assumptions around the capabilities of the attacker, and in turn takes the time to implement additional controls, which most companies don't have to.
### Threat Model Assumptions
The assumptions we make about nation state actors at Distrust:
* All screens are visible to the adversary
* All keyboards are logging to the adversary
* Any firmware/boot-loaders not verified on every boot are compromised
@ -22,84 +21,109 @@ The assumptions we make about nation state actors at Distrust:
* Physical attacks are viable and likely
* Side-channel attacks are viable and likely
The suggested mitigating controls following in this report consist of tools which we developed to address exactly this type of threat model and are at varying levels of maturity. The good news is the reference designs and concepts are available to you today, but some of the tooling needs more work - so if you care about these issues and want to help us complete the work on the missing pieces, please talk to us.
The suggested mitigating controls which follow in this report consist of design strategies and tooling which we developed to address exactly this type of threat model. The good news is the reference designs and concepts are available to you today, but some of the tooling needs more work - so if you care about these issues and want to help us complete the work on the missing pieces, please [talk to us](https://distrust.co/contact.html).
### The Method
### Summary
This report highlight the major single points of failure, which rely on a single individual and/or computer, thus creating an opportunity for compromise. Blockchains benefit from security of the network via strong cryptography and decentralization. More "traditional" parts of the infrastructure historically have not had the ability to distribute trust, but with some clever tactics we can achieve a decentralization of trust which helps us ensure that no single individual or computer can compromise a system.
This report highlight the major single points of failure, which rely on a single individual and/or computer, thus creating an opportunity for compromise. Blockchains benefit from security of the network via strong cryptography and decentralization. More "traditional" parts of the infrastructure historically have not had the ability to distribute trust, but there are tactics that can be leveraged to achieve distribution of trust which helps us reduce risk from a single individual or computer undermining the integrity of a system.
---
## Root Cause Analysis and Mitigating Controls
### I. Developer Workstation Compromise
> Earliest known malicious activity was identified, when a developers Mac OS workstation was compromised, likely through social engineering. ([Sygnia report](https://www.sygnia.co/blog/sygnia-investigation-bybit-hack/))
In our opinion the main reasons this hack occured are these two points found in the [Sygnia report](https://www.sygnia.co/blog/sygnia-investigation-bybit-hack/)
#### Primary Mitigation
* > ... a developers Mac OS workstation was compromised, likely through social engineering.
Day-to-day work machines should not be used for production access / managing tokens for production access. This is an operational security shortcoming, as any interaction with production systems, whether via an API token, or web interface should be done via a dedicated computer or highly isolated environment (hardware-based virtualization like QubesOS/Xen preferred) with minimal dependencies only used for carrying out production tasks. Any interactions outside of production related tasks create opportunities for the system to be compromised - downloading and opening files, downloading and running software libraries (such as Docker which was the source of malware in this case), visiting websites (yes, the browser sandbox can be broken) etc.
* > ... the modification of JavaScript resources directly on the S3 bucket serving the domain app.safe[.]global.
#### Advanced Mitigation
## Introduction
Another way to mitigate this risk is to use a hardened server, such as a secure enclave, which is immutable, and can remotely attest to the code it's running. Setting up that server to only deploy code that's signed by x trusted PGP (or other signing algorithms) can achieve a state where no single individual has the ability to modify the infrastructure.
The compromise occured due to several key factors which have been nicely summarized in other reports, so this report will focus primarily on expounding on how this incident could have been prevented. It is important to address the naive mitigating controls while helpful, are not enough to mitigate the risk adequately. The naive security controls which we often observe as recommendations are improving safeguarding measure of the access tokens, access controls to cloud resources, such as the storage used for the JavaScript which is used to serve the web-application front-end, as well as monitoring (the quintessential reactive control, rather than a preventative one, and we strongly believe it's always better to prevent wherever possible). While these are improvements which are important, they are more of a "plugging holes on a sinking ship" exercise, rather than upgrading the hull to titanium. Even if improved controls are introduced around the token and cloud platform management, there are still many different single points of failure in the system.
- Use [EnclaveOS](https://git.distrust.co/public/enclaveos) - a minimal and immutable operating system for running security critical software with high accountability on secure enclaves. EnclaveOS can also be extended to support multi-party management of secrets such that no person can control them alone. This can be used to set up secure enclave which acts as the deployment system. EnclaveOS is a reference implementation, but we are happy to help invest energy into making this tool easier to use for everyone.
To quickly dive into the first principles we can apply here to reason about how risk shifts around let's take the example of how the deployment pipeline is secured. Most companies have a system admin, or developer who has individual ability to modify the server (or software it runs) that's part of the build pipline used to build or serve the application, or can even modify the application code itself. This is considered to be a single point of failure in the Distrust threat model. The risk is difficult to migitgate, because even if the effort is taken to set up a hardened deployment pipeline to try to address the risk we elucidate here, it is simply shifted elsewhere. Assuming that the whole infrastructure is methodically hardened throughout the whole software development lifecycle, in the end there is always 1 super admin who has access to the cloud platform - which gives the individual complete control over the infrastructure. This is a problem most cloud platforms simply ignore as the industry has gotten used to accepting this risk. It is worth nothing that this is not about "not trusting" one's team, it's about distributing trust (Dis-trust, get it?). The question then is "does it make sense for a single individual to undermine the integrity of the whole system"? Those involved in the blockchain space would likely agree it does not.
- Use [Bootproof](https://git.distrust.co/bootproof) alongside EnclaveOS to prove which software booted on a given system by leveraging platform hardware or firmware remote attestation technologies. This tool is designed but not yet in development. Currently EnclaveOS can be used with Nitro VMs on AWS with some work to achieve remote attestation - and several Distrust clients are using this setup in production today. Our team would be happy to invest energy to develop this tooling if anyone is willing to help fund it. It would unlock use of general hardware like TPMs and other remote attestation technologies to allow deploying remote attestation setups to different cloud platforms for more security via diversity.
#### Mitigation Principles
#### Additional notes
To adequately address the risks according to the Distrust threat model, some additional assumptions have to be made about the distinction between cold and hot wallets. These statements are conclusions drawn from experience stemming from building systems at BitGo, Unit410 and Turnkey, as well as conducting security due dilligence probes on most major custodian/vaulting solutions available on the market.
* This isn't the first time an attack like this happened. Those who have been around for a while will remember the [Axie Infinity Hack](https://www.bleepingcomputer.com/news/security/hackers-stole-620-million-from-axie-infinity-via-fake-job-interviews/) which also happend due to compromise of a developer who used their day to day machine for managing cryptographic material and accessing production systems.
* A *cold* cryptographic key management system is one where all involved components can be built, operated and verified offline. If any component requires trusting anything over the internet, then the entire system is a *hot* system by definition. In this case a system was advertised as cold that relied on internet controlled components which means it's a hot wallet according to our definition. We believe this was done for good UX reasons at the expense of a true offline quarantine that would have prevented this class of attack.
* The use of tools like Mobile Device Management on systems for production access is not recommended, as they create a single point of failure. Most MDM solutions mean that a third party has complete access to the fleet of computers it's "protecting", and even self hosted creates a large single point of failure which is challenging to mitigate to a resonable degree. Instead, the approach should rely on making the surface area for attack so minimal that introducing anything else introduces more risk than benefit. For illustrative pruposes, imagine a hardware-based virtual machine which only has a minimal operating system, the CLI tool for the preferred cloud platform, and a network interface which has a firewall configuration permitting only connections to a specific production asset. If this sytem is only used for accessing that specific asset, the introduction of anything additional, including an MDM or anti-malware/anti-virus software, actually increases the surface area for attack. Of course, this is a stepping stone to improve controls around accessing production systems until better mitigating controls can be put in place, making it impossible to directly interact with and change production systems as an individual.
* *Cold* cryptographic key management system with a truly random entropy source are *not* susceptible to remote attacks and are only exposed to close proximity threats such as physical access or side-channel attacks.
* Additional resiliency can be achieved by deploying a system for deployment across multiple accouts with different ownership or even different cloud platforms. This is out of scope of this report which focuses on mitigating controls where most companies should start their journey to improve their supply chain security.
* It is a common belief that keeping a key offline will protect any system that uses that key, and that this constitutes a cold system. The problem is that an atacker does not necessarily care about stealing the key - it is sufficient for them to succeede in achieving an outcome where the key performs an operation on the desired data.
* It is also worth noting that it appears a Docker container with network connectivity was used to compromise a developer's machine initially. This points to an often overlooked issue, which is that Docker is not a secure containerization technology, as it makes it fairly trivial to move files across the container boundary, as part of its design. This is useful for some usecases but not for strict isolation - which should instead rely on hardware-based virtualization.
* All software used in the stack has to be open source, built deterministically (to allow for reproduction), and using a full source bootstrapped compiler, or the system will still be exposed to single points of failure via supply chain attack vectors.
### II. JavaScript Code Tampering
#### Mitigations and Reference Designs
> Preliminary incident reports by both Sygnia and Verichains were shared by Bybits CEO, Ben Zhou in his X post. Both reports highlighted the same attack vector the modification of JavaScript resources directly on the S3 bucket serving the domain app.safe[.]global. ([Sygnia repors](https://www.sygnia.co/blog/sygnia-investigation-bybit-hack/))
There are two general design approaches which can remove the attack surface area which resulted in the incident. These two approaches both achieve a similar set of security guarantees, but vary significantly in the approach and the effort required to implement them. Of course, our opinion is that when a company is protecting billions of dollars of value, it's worthwhile to go through the effort of implementing all known tactics which can mitigate risk. The accounting is simple, invest in protecting yourself rather than betting on the likelihood you not be targetted. State funded actors are more motivated than ever to attack companies which are involved in digital assets - it's simple game theory - and spending a lot to compromise the aforementioned companies makes economic sense for these advanced threat actors. We've seen this type of attack before with Axie Infinity, and we will see it again. There is a large shift towards exploiting supply chains and single points of failure as blue teams are currently largely neglecting defense / mitigation of this attack surface area.
#### Primary Mitigation
#### Strategy 1 - Run Everything Locally
Ensure that the bucket / server serving the website can not be modified by a single individual. Set up immutable infrastructure by deploying software using a hardened server—such as an enclave—that only serves software reproduced across multiple systems and signed by a set of trusted parties. The software is then deployed to an immutable server or bucket for secure delivery to clients. The main risk to mitigate here is the "root" access account controlling the infrastructure. However, secure enclaves and remote attestation can effectively reduce this risk ([EnclaveOS](https://git.distrust.co/enclaveos) + [Bootproof](https://git.distrust.co/bootproof)).
This strategy is something that can be implemented without major adjustments to the existing system. The design is focused on moving the component which is currently turning the wallet into a hot one, into a offline component, thus upgrading the system to a fully cold solution. The idea consists of taking the "signing" component of the application which currently happens in the UI, and creating an offline application, for example by using Electrum. It is important to note that even if something is offline, not all single points of failure are eliminated, unless the individual can build the application themselves from source, with a bootstrapped compiler, and in a deterministic manner (something we created open source tooling for with [StageX](https://codeberg.org/stagex/stagex) - to learn more about why reproducible builds matter, you can refer to the video of one of our co-founders explaining the Solar Winds incident and how it could have been prevented in [this video](https://antonlivaja.com/videos/2024-incyber-stagex-talk.mp4).
#### Advanced Mitigation
* Leverage bit-for-bit reproducibility to ensure that the software being delivered has not been tampered. In the case of JS code, which is not compiled but interpreted, the source code can be reviewed, and hashed to have a way for checking integrity of the code. This process of hashing should be done in trusted isolated environments, and ideally on multiple machines to ensure that no single computer has the ability to tamper with the code.
* [This video](https://antonlivaja.com/videos/2024-incyber-stagex-talk.mp4) (4:30-6:30) explains how reproducibility helps protect the integrity of software. For those new to reproduction and determinism, it's advised to watch the whole video.
##### Reference Design
* This attack vector actually extends to all underlying software used in the build environment, such as the different libraries, as well as the compiler. To maximally mitigate this risk, a bootstrapped compiler should be used, and all software including the compiler itself should be built deterministically to close off tampering attack vectors across the whole foundation of software used in build environments. This allows one to reproduce the identical bit-for-bit binary in diverse environments (different OS, different chipset, different cloud platform, different access etc.), and ensure that the is still exactly the same - proving there has been no tampering.
This reference design focuses on the Safe{Wallet} team, but applies to any team trying to build an offline component which has minimized single points of failure.
* Use [StageX](https://codeberg.org/stagex/stagex) to reproduce your software and close off compiler and environment risks. StageX is a minimalism and security first repository of reproducible and multi-signed OCI images of common open source software toolchains full-source bootstrapped from Stage 0 all the way up. It's currently actively being used by [Talos Linux](https://github.com/siderolabs/talos/releases/tag/v1.10.0-alpha.2), [Mysten Labs (SUI)](https://github.com/MystenLabs/sui/tree/jnaulty/stagex-update) and [Turnkey](https://whitepaper.turnkey.com/foundations) to name a few of the widely known projects.
* Use [ReprOS](https://codeberg.org/stagex/repros) to help with reproduction. It's a bare-bones immutable OS designed for securely reproducing and signing software. Each build is executed in a one-time use environment, eliminating persistent risks. It is in currently in beta testing. This project is currently in beta.
1. All system administrators are provided with dedicated offline laptops
#### Additional Notes
All third party code should be manually reviewed. Currently most companies rely on Static Application Security Testing tools. This is not enough, as SAST tools are unable to detect novel exploits. The cost of using open source code, at a minimum, should be to review every line of code manually. If companies are so stringent about having developers review their first party code, why do companies choose to not apply the same principles to third party code? It is burdensome, but necessary for high risk targets. If you're unfamiliar a good example of what's possible with supply chain attacks is the [xz backdoor](https://en.wikipedia.org/wiki/XZ_Utils_backdoor).
* Radio cards are removed (bluetooth, wifi)
- Distrust's answer to this is [SigRev](https://git.distrust.co/public/sigrev), which helps harness the power of nerds to create a repository of signed reports for reviews of open source software. The idea is that companies can come together to fund review of common open source software, to save money, and simultaneously help secure Open Source software. SIgRev has been designed, but is not yet in development and is seeking funding.
* Machine that has never been connected to the internet
### III. Compromise of WebUI
> Bybit initiated a transaction from the targeted cold wallet using Safe{Wallet}s web interface. The transaction was manipulated, and the attackers siphoned the funds from the cold wallets. ([Sygnia report](https://www.sygnia.co/blog/sygnia-investigation-bybit-hack/))
2. All engineers provision and distribute their own personal signing keys (PGP)
#### Primary Mitigation
Initializing transactions from a WebUI leaves a lot of surface area for the attack as browsers are known for being difficult to protect. This is due to the nature of what a browser is - a window into the open internet. Additionally, the v8 engine which is the backbone of most browsers is an immensly complex and difficult surface area to defend, resulting in frequent 0-day vulnerabilities, as well as supply chain issues.
* Use smart cards such as NitroKey or YubiKey
* Do not sign transactions involving large sums in a browser.
* Only do signing operations with these keys on the personal offline system
* Use offline trusted environments for signing, to protect key material, and mitigate the risk of a compromised UI displaying incorrect information. In the case of the ByBit hack in particular, preventing the JS tampering would have mitigated this risk, but other supply chain attack vectors which can achieve the same outcome remain (extensions, v8 engine 0-day exploits etc.). By using a minimal set of CLI tools to sign transactions offline, the WebUI compromise would have been avoided.
* Distrust has created open source tooling that simplifies secure provisioning: [Trove](https://trove.distrust.co/generated-documents/all-levels/pgp-key-provisioning.html)
* Use [AirgapOS](https://git.distrust.co/public/airgap), which is an immutable, diskless OS used for offline secret management and operations. It is a swiss-army knife which essentially turns a laptop into a hardware wallet. Some modifications for the laptop are required such as removing radio cards from the laptop. Inside of it are [keyfork](https://git.distrust.co/public/keyfork) and [Icepick](https://git.distrust.co/public/icepick) which are tools for generating and managing entropy which can be derived for different cryptographic algorithms, as well as for cryptographic signing operations. Keyfork and Icepick are both extremely minimal and written in rust and currently support Solana, Pyth, Cosmos, Kyve and Seda as we received funding to implement those, but can be extended to support other chains - we are currently working on Bitcoin, but would be happy to add support for Ethereum as well - again this is not a political decision, we just had individual sponsor implementing support for those blockchains first. These three tools are all being used in production today by multiple clients, and have been audited by several security firms whose reports can be found in the respective repositories.
3. An offline signing application is deterministically compiled, verified and signed by multiple engineers
## Extras
* Includes all necessary tools to carry out offline key operations
We have noticed that many companies still neglect basic security hygiene practices that apply to everyone and could meaningfully improve the security of systems with relatively little effort.
* Distrust also developed [AirgapOS](https://git.distrust.co/public/airgap) which is custom Linux OS that is meant for managing secret material offline. It has been audited by a third party and is being used in production by several major digital asset companies.
1. Adopt FIDO2 as MFA wherever possible and avoid using SMS, TOTP, Yubico OTP, email codes and push notifications. If your provider doesn't offer FIDO2, you should ask them why as it's objectively the best type of MFA currently available.
4. All sensitive operations are fully verified offline before any cryptographic operations take place
2. Use smart cards for FIDO2, and for managing PGP keys which can be used for *signing commits*, and *ssh* access. We built [tooling and guides](https://qvs.distrust.co/generated-documents/all-levels/pgp-key-provisioning.html) which makes it easy to provision PGP keys and load them onto smart cards. Signing commits is helpful as it can help protect modification of code via attacks like commit spoofing, and keeping the SSH key securely inside of a smart card is akin to keeping seed phrases safely stored in HSMs.
#### Strategy 2 - Use Remotely Verified Service
This strategy re-establishes nearly identical user experience as present albeit with significantly more engineering effort to add verifiability at key points of the system. This strategy requires much more engineering effort and the tooling to execute on this design easily is not yet fully built (but we are working on it).
##### Reference Design
This design focuses on leveraging secure encalves to create servers which are immutable, deterministic and can cryptographically attest to the software they are running. While this design gets close to the fully cold design from the previous step, it will always inevitably remain exposed to attack surface area of browsers, such as via 0-day exploits, extensions in the browser, host operating system compromise etc.
1. Rewrite application to run in secure enclave
* TLS termination inside of the enclave
* Web interface served from inside of enclave
* Nothing outside of the enclave is trusted
2. Create deterministic OS image with remote attestation (TPM2, Nitro Enclave or similar)
* The whole stack is built using full source bootstrapped compiler and in a reproducible manner
3. One engineer deploys a new enclave with new code
4. Different engineer proves remote code matches reviewed code in vcs repository
5. Clients are issued a service worker on first load that pins keys allowing remote attestation verification on all subsequent loads
* User has option to verify and download application locally for full offline operations
* User is also encouraged to build themselves and match published signed hash
Implementing these strategies can be challenging, and this is a high level overview of the type of problems we work on. Depending on the chosen approach and context implementing these strategies can take anywhere from a few weeks to a few years depending on available resources.
## Summary
The Distrust team has helped build and secure some of the highest risks systems in the world such as the vaulting systems at BitGo, Unit410, and Turnkey as well as helping electrical grid operators, industrial control system operators and other. Through working with companies that are exposed to the most sophisticated known attackers where all attacks are viable, Distrust developed a methodology to help mitigate this level of threat. We are now using our hard learned lessons to help everyone improve their security posture by open sourcing all our learnings and creating open source tooling everyone can benefit from.
## About Distrust
The Distrust team has helped build and secure some of the highest risks systems in the world such as the vaulting systems at BitGo, Unit410, and Turnkey as well as helping electrical grid operators, industrial control system operators and others secure their mission critical systems. Distrust has also conducted security due dilligence probes on most major custodians. Through working with companies that are exposed to the most sophisticated known attackers where all attacks are viable, Distrust developed a methodology and open source tooling to help mitigate this level of threat. We are now using our hard learned lessons to help everyone improve their security posture by sharing what we learnined and creating open source tooling everyone can benefit from.