Add blog post no.8
This commit is contained in:
parent
41400598e8
commit
89b703d898
|
@ -0,0 +1,108 @@
|
||||||
|
---
|
||||||
|
layout: post
|
||||||
|
title: "Update #8 - Custom Rust Library Optimizations"
|
||||||
|
author: ["Christian Reitter"]
|
||||||
|
date: 2024-04-29 10:00:00 +0000
|
||||||
|
---
|
||||||
|
|
||||||
|
This is a developer-focused research update on code optimizations for `secp256k1` libraries.
|
||||||
|
|
||||||
|
<div id="toc-container" markdown="1">
|
||||||
|
<h2 class="no_toc">Table of Contents</h2>
|
||||||
|
* placeholder
|
||||||
|
{:toc}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<br/>
|
||||||
|
|
||||||
|
## Speeding up Research Tasks
|
||||||
|
|
||||||
|
For our research into weak wallet usage, we wrote Rust programs that crunch through potential wallets of different types and investigate their usage on the blockchain. Notably, we're still doing the necessary calculations purely on the CPU, as opposed to offloading some computations to a graphics card (GPU) to speed them up.
|
||||||
|
|
||||||
|
Due to this design decision, we had to invest a significant amount of energy into selecting fast libraries and doing various optimizations in order to keep an acceptable performance while searching through large key ranges. The requirement for fast code is a challenge and complicates development, but learning how to make good use of the available CPU time is a skill that applies well beyond this specific research work as well, so it's not a lost effort.
|
||||||
|
|
||||||
|
Checking a key candidate for some on-chain usage with a certain coin involves a lot of costly steps, but there are two particular speed bottlenecks:
|
||||||
|
1. Performing many HMAC-SHA512 hashing steps for BIP39
|
||||||
|
- relevant when going from the original weak random entropy input to the master private key of a given wallet
|
||||||
|
2. Calculating a lot of `secp256k1` key derivations for BIP32
|
||||||
|
- relevant for determining public keys (-> addresses) for commonly used derivation paths
|
||||||
|
|
||||||
|
This research update is specifically about speeding up the `secp256k1` derivations.
|
||||||
|
|
||||||
|
## Faster `secp256k1` Derivations in Rust
|
||||||
|
### Generic Improvements
|
||||||
|
|
||||||
|
Getting good performance out of a given computer setup usually involves a lot of "generic" optimization decisions, such as
|
||||||
|
- Building on top of a programming language that has good runtime performance
|
||||||
|
- Picking libraries that are already optimized for the task
|
||||||
|
- Using all available resources and features of the local machine
|
||||||
|
- Determining and enabling relevant compile-time & runtime optimization flags
|
||||||
|
|
||||||
|
In our case, this meant relying on the [bitcoin](https://lib.rs/crates/bitcoin) and [secp256k1](https://lib.rs/crates/secp256k1) Rust crates, which under the hood uses the well-known [libsecp256k1](https://github.com/bitcoin-core/secp256k1) C library for the heavy lifting. We distribute wallet calculation tasks in parallel to all available logical CPU cores using [rayon](https://github.com/rayon-rs/rayon).
|
||||||
|
|
||||||
|
A number of `Cargo.toml` based Rust optimization settings can help to push the performance beyond the normal `--release` baseline, for example:
|
||||||
|
- Using faster `panic` abort behavior
|
||||||
|
- Link time optimization (`lto`)
|
||||||
|
- Using less `codegen-units`
|
||||||
|
- `RUSTFLAGS="-C target-cpu=native"`
|
||||||
|
- stripping symbols
|
||||||
|
|
||||||
|
This moves the needle towards slightly faster runtime, at the cost of slower compilation and less debug options for the release target.
|
||||||
|
|
||||||
|
Ultimately, these optimizations don't do very much on this particular workload, though. Even with all Rust toolchain-related generic optimizations in place and all CPU cores at near 100% during runtime, we felt that things were still not going as fast as they could. Therefore we started taking a closer look at what work was being done behind the scenes, profiling it, and checking if we actually needed all of it to happen - "work smarter, not harder".
|
||||||
|
|
||||||
|
### Doing the same Work with Less Steps
|
||||||
|
During our search for optimizations, we discovered the old [libsecp256k1_fast_unsafe](https://github.com/llamasoft/secp256k1_fast_unsafe) fork from 2016 that showcased a number of ways the cryptographic code could be reworked to obtain significant speedups.
|
||||||
|
|
||||||
|
In the end, the target-specific optimizations we applied mostly fell into three categories:
|
||||||
|
|
||||||
|
- Improve C code generation (the good)
|
||||||
|
- `libsecp256k1` pre-generates certain data tables. Switch to much larger but faster tables via increased `ECMULT_GEN_PREC_BITS`. This requires raising internal hardcoded limits.
|
||||||
|
- Pick the fastest local C compiler and optimization level.
|
||||||
|
- Ensure machine-specific C compiler optimizations are active via `-march=native`.
|
||||||
|
|
||||||
|
- Speedups through insecure code (the bad)
|
||||||
|
- Cryptographic libraries use special, slower computations designed to be constant-time to avoid leaking sensitive information. All the keys we handle are already weak enough to be compromised already, so switch to the faster variable-time function versions wherever possible.
|
||||||
|
- Skip zeroing out sensitive memory.
|
||||||
|
- Allow the compiler to "short circuit" some evaluations to speed them up.
|
||||||
|
|
||||||
|
- Breaking API changes to skip work (the ugly)
|
||||||
|
- The `bitcoin` crate always computes a special fingerprint after each derivation step, which we don't end up using. Skipping this calculation is a breaking change, but this isn't a problem for our use case.
|
||||||
|
|
||||||
|
Additionally, `secp256k1_fast_unsafe` has the concept of [Batched Key Serialization](https://github.com/llamasoft/secp256k1_fast_unsafe?tab=readme-ov-file#batched-key-serialization), which we didn't use so far.
|
||||||
|
|
||||||
|
### Open Source the Improvements
|
||||||
|
We've now published these optimizations as custom forks 📦🎉:
|
||||||
|
- [rust-secp256k1-unsafe-fast](https://git.distrust.co/milksad/rust-secp256k1-unsafe-fast)
|
||||||
|
- [rust-bitcoin-unsafe-fast](https://git.distrust.co/milksad/rust-bitcoin-unsafe-fast)
|
||||||
|
|
||||||
|
Please note that the changes **deliberately break important security mechanisms and functionality**, making this code completely unsuitable for any type of production use. We will also not provide support, maintenance or warranties of any kind. This code is meant be a helpful point-in-time reference for other security researchers, and we don't recommend any other usage or depending on it to work right. You've been warned 😉
|
||||||
|
|
||||||
|
### Optimizations
|
||||||
|
|
||||||
|
Fair and representative benchmarking is an art form, and we're too short on time to do the matter justice.
|
||||||
|
Therefore we'll just stick to some rough ballpark numbers for a real workload:
|
||||||
|
- AMD Ryzen 7950X3D based machine running Linux
|
||||||
|
- calculating a Bitcoin address with derivation path `m/44'/0'/0'/0/0` for each wallet
|
||||||
|
- crunching through `2^32` = 4.29 billion different wallets
|
||||||
|
- perform address lookups against a [bloom filter]({% link _posts/2023-12-06-research-update-3.md %}) in memory
|
||||||
|
- do some other work and I/O during program startup
|
||||||
|
|
||||||
|
In this scenario, exchanging the standard libraries with the optimized library variants results in a speedup of over **6x** and brings down the total runtime from about **540 minutes** (9 hours) to about **90 minutes**, which is a great improvement!
|
||||||
|
|
||||||
|
### Potential Future Work
|
||||||
|
|
||||||
|
In the world of performance optimizations, there's always more that can be done, although often with diminishing returns.
|
||||||
|
For example, we've noticed that the [hashing functions](https://github.com/rust-bitcoin/rust-bitcoin/tree/master/hashes) in `rust-bitcoin` are implemented as pure Rust code without the use of CPU architecture specific optimized assembler code, unlike [RustCrypto hashes](https://github.com/rustcrypto/hashes). This leaves room for improvements.
|
||||||
|
|
||||||
|
On the side of the "generic" optimizations, we haven't made use of [Profile-guided optimization](https://en.wikipedia.org/wiki/Profile-guided_optimization). This feature allows the C and Rust compiler to create more fine-tuned code that is often a few percent faster, but requires additional effort to provide suitable profile data.
|
||||||
|
|
||||||
|
Perhaps we'll revisit this area in the future.
|
||||||
|
|
||||||
|
## Summary & Outlook
|
||||||
|
|
||||||
|
In this research update, we discussed speeding up wallet private key & public key derivations through the deliberate removal of security mechanisms and non-essential functionality in two popular Rust cryptocurrency libraries. Although this approach is generally a bad idea in most other contexts, these unsafe optimizations may help other security researchers to work more easily with many wallets or derivations.
|
||||||
|
|
||||||
|
If you would like to see more articles with developer-level details of our work, [let us know]({% link index.md %}#contact)!
|
||||||
|
|
||||||
|
<br/>
|
Loading…
Reference in New Issue