website-public/_posts/2023-12-06-research-update-...

---
layout: post
title: "Update #3 - Bloom Filter, Dataset, Canaries"
author: ["Christian Reitter"]
date: 2023-12-06 11:10:00 +0000
---

This research update has some information on the Bloom filter mechanism and public blockchain address data we used to find weak Bitcoin wallets. Using this technique, we were able to check several billion of potential wallets for actual usage on the blockchain without running a Bitcoin full node, or flooding other Bitcoin servers and APIs with excessive network requests.

We also describe some artificially created wallets that we've placed to track the real-world theft behavior in one of the weak ranges.

<div id="toc-container" markdown="1">
<h2 class="no_toc">Table of Contents</h2>
* placeholder
{:toc}
</div>

## Bloom Filter Explanation and Address Data Source

When searching through billions of algorithm-generated data chunks that could reveal a few interesting private keys, efficient filtering becomes very important. In our original publication, [we briefly described]({% link disclosure.md %}#searching-for-wallets---implementation) this as follows:

> We used a publicly available list of all Bitcoin addresses historically seen by the Bitcoin network and constructed a bloom filter with a very low false positive rate on the data set. Using this filter, we were able to do quick address lookups to query and discard many unused wallet candidates, for which the relevant derived accounts were never seen by the network, without doing costly lookups to a Bitcoin full node.

A [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is a special data structure that provides quick lookup checks against previously added elements. Unlike a [hash table](https://en.wikipedia.org/wiki/Hash_table) or other common lossless read-access-optimized list structures, the Bloom filter deliberately trades off some lookup accuracy for space-efficiency. This make lookup in RAM possible for datasets that would otherwise be too large. Depending on the settings used when creating the filter structure and inserting items, the lookup will falsely detect an item as being in the original set - a false positive - for a certain percentage of queries. In return for this negative effect, only a fraction of the original data footprint has to be kept in memory. This was very attractive for us for optimization reasons.

In the first days of our research, we experimented with a Python Proof-of-Concept to test out this data structure for our tasks. After converging on Rust as the main language for our tooling, the [bloomfilter](https://github.com/jedisct1/rust-bloom-filter) crate became our tool of choice. This library is very fast, but fairly minimal, and doesn't have a built-in mechanism to export and import pre-generated Bloom filter files from disk. For this reason, we wrote some serialization code to do this for us, as seen in [published code](https://git.distrust.co/milksad/lookup/src/branch/main/bloom-filter-generator) for the `bloom-filter-generator` and its [use](https://git.distrust.co/milksad/lookup/src/branch/main/mnemonic-hash-checker/src/bloom.rs) in the lookup server process. For the research code, we're using the [Rayon](https://github.com/rayon-rs/rayon) library to parallelize our worker threads, which are able to use a single Bloom filter object to avoid memory duplication, which is important when dealing with multiple dozen threads.

To check if the wallet addresses we derived from the generated weak private keys were previously used, we need a collection of addresses that were used on-chain, ideally covering every address ever seen publicly. For a blockchain like Bitcoin which has a long history and frequent changes of receive/change addresses, this is a lot of data. We considered using only addresses seen after a certain date (such as the first `bx` code commit with the vulnerable mechanism). But we had some resources to spare, and decided against this additional restriction, to ensure we wouldn't miss other wallet keys that were older than expected or from other generation sources.

The most comprehensive and up-to-date public collection of Bitcoin Mainnet addresses that we could find to build our filter is from `blockchair.com`, via [https://blockchair.com/dumps](https://blockchair.com/dumps). Due to download speed limits and the split nature of the data, we did not use this download source directly. Instead, we went with a derivative of this data.

User `LoyceV` from the `bitcointalk.org` forum distributes regularly updated data sets assembled from the individual `blockchair.com` data dump snippets via [http://alladdresses.loyce.club/](http://alladdresses.loyce.club/), as far as as we've understood from [public forum posts](https://bitcointalk.org/index.php?topic=5254914.0). This was just what we needed for Bitcoin, and a valuable resource to kickstart our research, so we're thankful it's publicly hosted without any barriers 👍.

Our `all_Bitcoin_addresses_ever_used_sorted.txt.gz` list snapshot from ca. 2023-08-01, which we used for our initial searches, comes in at ca. 42 Gigabytes in uncompressed form and has ca. 1.19 billion individual Bitcoin addresses. The corresponding Bloom filter that we built from it reduced this to ca. 7.3 Gigabytes in size (with a 0.00000000001 false positive factor for searches), which is far less data to keep in RAM. These numbers should explain why we are interested in a fast lookup mechanism with reduced memory footprint compared to the original data. Since false positives are still annoying to deal with in later processing stages, we've further reduced the false positive factor in our later research by `100x`, which has worked out quite well.

Going forward, we would like to extend our search to some other selected coins, but are still looking for recently updated, comprehensive data collections that are publicly available.
If you're aware of public and well-maintained address/pubkey/pubkey-hash collections for Ethereum and other popular coins, we would love to hear from you [directly]({% link index.md %}#contact)!

## Canary Wallet Observations

Very early into the `bx` vulnerability discovery, one of our team members deliberately moved small amounts of Bitcoin onto known vulnerable `bx seed -b 256 | bx mnemonic-new` generated wallet private keys. At this point, we already understood the main weakness and could deliberately generate specific weak keys, but did not yet have custom tooling to search through the vulnerable range. Setting up a some "canary" wallets with a few dollars in Bitcoin each was therefore a cheap and simple way to gather data on the behavior of attackers.

One of our questions was: are attacker now actively watching the vulnerable range for _new deposits_, and quickly acting upon them?
At least for the `bx` BIP39 range with 24 mnemonic words and our used paths, this was not the case initially. By the time of publication of this new blogpost, all of the four sub-wallets have been emptied, though:

| PRNG ID | derivation path | address | original deposit | theft transaction | theft date |
| -- | -- | -- | -- | -- |
| `0x000001f4` | `m/44'/0'/0'/0/0` | {{ "13KqxkrmsPKy8gyYwochCQTuPHC7Lp8bFU" | BtcLinkAddressUrlFull }}  | $5 | {{ "ff8c6822846d835e5a476bf268ab4ddba396d476f0f1b5301eea62c6acfa9c3a" | BtcLinkTxUrlSliced }} | 2023-08-23 01:23 |
| `0x000001f4` | `m/0/0` | {{ "1NxkqwmsQMTqv4SrggPv4vGHDzJKR52S2f" | BtcLinkAddressUrlFull }} | $5 |  {{ "256b6b987af466b4239048272534167a0e7d197f0c3fa716c1ba24fee3f3a851" | BtcLinkTxUrlSliced }} | 2023-08-27 12:20 |
| `0xffffffff` | `m/44'/0'/0'/0/0` | {{ "1HQR3nKaDahAFrPHMoDVdWiMNFGFb7cHA5" | BtcLinkAddressUrlFull }} | $5 | {{ "48354a8bee5cb71eccb725b501f43e6351823a1d4d6dcdd1033214335b18a3d5" | BtcLinkTxUrlSliced }} | 2023-09-30 09:07 |
| `0xffffffff` | `m/0/0` | {{ "16pQhPkBa5puwEzudZVyKtsrugLtA87cy" | BtcLinkAddressUrlFull }} | $1 | {{ "8d09a736a442f87f7f31c691c068a8e526f67093250720de83b028c4ed1f03cd" | BtcLinkTxUrlSliced }} | 2023-10-01 22:16 |

Considering the date of deposit after the main 2023-07-12 theft, low per-wallet funds and theft dates, the thieves sweeping the funds are likely not related to the main attacker. It's still interesting to see that even a weak wallet with as little as $1 in BTC gets emptied sooner or later. The sharks are clearly in the water now 🦈.

Note that the `m/0/0` derivation path we used is an older pattern, and rare - we haven't found other `bx`-generated Bitcoin wallets in this range. Attackers may have looked into some of these unusual paths more exhaustively just for these particular wallet PRNG IDs after discovering some usage via the more common M44 P2PKH standard path pattern.

## Summary & Outlook
In this post, we introduced a combination of data structure and data set that we successfully used to look up large numbers of addresses.
Additionally, we listed some previously internal information about deliberately created weak wallets and related theft patterns.

We still have a long backlog of research topics to present here. We'll try to get the next post ready before the holidays 🎁

Check out our [RSS]({% link feed.xml %}) feed if you want to get notified by your favorite reader application.

<br/>
Add blog posts no.2 and no.3, customize Jekyll. Thanks to Heiko Schaefer for proofreading and edit suggestions. Technical changes: Extend the image handling via jekyll-responsive-image and corresponding configuration & templates. This requires ImageMagick dependencies in the Dockerfile for the rmagick plugin. Add custom Jekyll Liquid filters for semi-automatic Bitcoin address and transaction formatting/linking. Add CSS details for the figure handling. 2023-12-06 11:34:58 +00:00			`---`
			`layout: post`
			`title: "Update #3 - Bloom Filter, Dataset, Canaries"`
			`author: ["Christian Reitter"]`
			`date: 2023-12-06 11:10:00 +0000`
			`---`

			`This research update has some information on the Bloom filter mechanism and public blockchain address data we used to find weak Bitcoin wallets. Using this technique, we were able to check several billion of potential wallets for actual usage on the blockchain without running a Bitcoin full node, or flooding other Bitcoin servers and APIs with excessive network requests.`

			`We also describe some artificially created wallets that we've placed to track the real-world theft behavior in one of the weak ranges.`

			`<div id="toc-container" markdown="1">`
			`<h2 class="no_toc">Table of Contents</h2>`
			`* placeholder`
			`{:toc}`
			`</div>`

			`## Bloom Filter Explanation and Address Data Source`

			`When searching through billions of algorithm-generated data chunks that could reveal a few interesting private keys, efficient filtering becomes very important. In our original publication, [we briefly described]({% link disclosure.md %}#searching-for-wallets---implementation) this as follows:`

			`> We used a publicly available list of all Bitcoin addresses historically seen by the Bitcoin network and constructed a bloom filter with a very low false positive rate on the data set. Using this filter, we were able to do quick address lookups to query and discard many unused wallet candidates, for which the relevant derived accounts were never seen by the network, without doing costly lookups to a Bitcoin full node.`

			A [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is a special data structure that provides quick lookup checks against previously added elements. Unlike a [hash table](https://en.wikipedia.org/wiki/Hash_table) or other common lossless read-access-optimized list structures, the Bloom filter deliberately trades off some lookup accuracy for space-efficiency. This make lookup in RAM possible for datasets that would otherwise be too large. Depending on the settings used when creating the filter structure and inserting items, the lookup will falsely detect an item as being in the original set - a false positive - for a certain percentage of queries. In return for this negative effect, only a fraction of the original data footprint has to be kept in memory. This was very attractive for us for optimization reasons.

			In the first days of our research, we experimented with a Python Proof-of-Concept to test out this data structure for our tasks. After converging on Rust as the main language for our tooling, the [bloomfilter](https://github.com/jedisct1/rust-bloom-filter) crate became our tool of choice. This library is very fast, but fairly minimal, and doesn't have a built-in mechanism to export and import pre-generated Bloom filter files from disk. For this reason, we wrote some serialization code to do this for us, as seen in [published code](https://git.distrust.co/milksad/lookup/src/branch/main/bloom-filter-generator) for the `bloom-filter-generator` and its [use](https://git.distrust.co/milksad/lookup/src/branch/main/mnemonic-hash-checker/src/bloom.rs) in the lookup server process. For the research code, we're using the [Rayon](https://github.com/rayon-rs/rayon) library to parallelize our worker threads, which are able to use a single Bloom filter object to avoid memory duplication, which is important when dealing with multiple dozen threads.

			To check if the wallet addresses we derived from the generated weak private keys were previously used, we need a collection of addresses that were used on-chain, ideally covering every address ever seen publicly. For a blockchain like Bitcoin which has a long history and frequent changes of receive/change addresses, this is a lot of data. We considered using only addresses seen after a certain date (such as the first `bx` code commit with the vulnerable mechanism). But we had some resources to spare, and decided against this additional restriction, to ensure we wouldn't miss other wallet keys that were older than expected or from other generation sources.

			The most comprehensive and up-to-date public collection of Bitcoin Mainnet addresses that we could find to build our filter is from `blockchair.com`, via [https://blockchair.com/dumps](https://blockchair.com/dumps). Due to download speed limits and the split nature of the data, we did not use this download source directly. Instead, we went with a derivative of this data.

			User `LoyceV` from the `bitcointalk.org` forum distributes regularly updated data sets assembled from the individual `blockchair.com` data dump snippets via [http://alladdresses.loyce.club/](http://alladdresses.loyce.club/), as far as as we've understood from [public forum posts](https://bitcointalk.org/index.php?topic=5254914.0). This was just what we needed for Bitcoin, and a valuable resource to kickstart our research, so we're thankful it's publicly hosted without any barriers 👍.

			Our `all_Bitcoin_addresses_ever_used_sorted.txt.gz` list snapshot from ca. 2023-08-01, which we used for our initial searches, comes in at ca. 42 Gigabytes in uncompressed form and has ca. 1.19 billion individual Bitcoin addresses. The corresponding Bloom filter that we built from it reduced this to ca. 7.3 Gigabytes in size (with a 0.00000000001 false positive factor for searches), which is far less data to keep in RAM. These numbers should explain why we are interested in a fast lookup mechanism with reduced memory footprint compared to the original data. Since false positives are still annoying to deal with in later processing stages, we've further reduced the false positive factor in our later research by `100x`, which has worked out quite well.

			`Going forward, we would like to extend our search to some other selected coins, but are still looking for recently updated, comprehensive data collections that are publicly available.`
			`If you're aware of public and well-maintained address/pubkey/pubkey-hash collections for Ethereum and other popular coins, we would love to hear from you [directly]({% link index.md %}#contact)!`

			`## Canary Wallet Observations`

			Very early into the `bx` vulnerability discovery, one of our team members deliberately moved small amounts of Bitcoin onto known vulnerable `bx seed -b 256 \| bx mnemonic-new` generated wallet private keys. At this point, we already understood the main weakness and could deliberately generate specific weak keys, but did not yet have custom tooling to search through the vulnerable range. Setting up a some "canary" wallets with a few dollars in Bitcoin each was therefore a cheap and simple way to gather data on the behavior of attackers.

			`One of our questions was: are attacker now actively watching the vulnerable range for _new deposits_, and quickly acting upon them?`
			At least for the `bx` BIP39 range with 24 mnemonic words and our used paths, this was not the case initially. By the time of publication of this new blogpost, all of the four sub-wallets have been emptied, though:

			`\| PRNG ID \| derivation path \| address \| original deposit \| theft transaction \| theft date \|`
			`\| -- \| -- \| -- \| -- \| -- \|`
			\| `0x000001f4` \| `m/44'/0'/0'/0/0` \| {{ "13KqxkrmsPKy8gyYwochCQTuPHC7Lp8bFU" \| BtcLinkAddressUrlFull }} \| $5 \| {{ "ff8c6822846d835e5a476bf268ab4ddba396d476f0f1b5301eea62c6acfa9c3a" \| BtcLinkTxUrlSliced }} \| 2023-08-23 01:23 \|
			\| `0x000001f4` \| `m/0/0` \| {{ "1NxkqwmsQMTqv4SrggPv4vGHDzJKR52S2f" \| BtcLinkAddressUrlFull }} \| $5 \| {{ "256b6b987af466b4239048272534167a0e7d197f0c3fa716c1ba24fee3f3a851" \| BtcLinkTxUrlSliced }} \| 2023-08-27 12:20 \|
			\| `0xffffffff` \| `m/44'/0'/0'/0/0` \| {{ "1HQR3nKaDahAFrPHMoDVdWiMNFGFb7cHA5" \| BtcLinkAddressUrlFull }} \| $5 \| {{ "48354a8bee5cb71eccb725b501f43e6351823a1d4d6dcdd1033214335b18a3d5" \| BtcLinkTxUrlSliced }} \| 2023-09-30 09:07 \|
			\| `0xffffffff` \| `m/0/0` \| {{ "16pQhPkBa5puwEzudZVyKtsrugLtA87cy" \| BtcLinkAddressUrlFull }} \| $1 \| {{ "8d09a736a442f87f7f31c691c068a8e526f67093250720de83b028c4ed1f03cd" \| BtcLinkTxUrlSliced }} \| 2023-10-01 22:16 \|

			`Considering the date of deposit after the main 2023-07-12 theft, low per-wallet funds and theft dates, the thieves sweeping the funds are likely not related to the main attacker. It's still interesting to see that even a weak wallet with as little as $1 in BTC gets emptied sooner or later. The sharks are clearly in the water now 🦈.`

			Note that the `m/0/0` derivation path we used is an older pattern, and rare - we haven't found other `bx`-generated Bitcoin wallets in this range. Attackers may have looked into some of these unusual paths more exhaustively just for these particular wallet PRNG IDs after discovering some usage via the more common M44 P2PKH standard path pattern.

			`## Summary & Outlook`
			`In this post, we introduced a combination of data structure and data set that we successfully used to look up large numbers of addresses.`
			`Additionally, we listed some previously internal information about deliberately created weak wallets and related theft patterns.`

			`We still have a long backlog of research topics to present here. We'll try to get the next post ready before the holidays 🎁`

			`Check out our [RSS]({% link feed.xml %}) feed if you want to get notified by your favorite reader application.`

			`<br/>`