Bitcoin Address Clustering

This project extracts information from the Bitcoin blockchain to create transaction graphs and perform chain analysis through Bitcoin address clustering. The main goal is to group Bitcoin addresses that likely belong to the same entity using various heuristic methods, enabling better understanding of Bitcoin transaction flows and entity behaviors.

Features

Blockchain Data Collection: Automated extraction of Bitcoin blockchain data from blocks 0 to 115,000
Transaction Graph Generation: Creates directed graphs with transactions as nodes and UTXOs as edges
Multiple Heuristics: Implements 10+ different clustering heuristics
Spark Integration: Uses Apache Spark for distributed data processing
Interactive Visualization: Web-based interface for exploring address clusters
Chain Analysis: Tools for analyzing transaction flows and entity movements
Comprehensive Results: Achieves 45-78% reduction in entity count through clustering

Installation

Prerequisites

Python 3.7+
Apache Spark
Google Colab (for the provided notebook) or local Jupyter environment
Sufficient storage space for blockchain data

Dependencies

Install the required packages:

pip install pyspark
pip install PyDrive
pip install wget
pip install pyvis
pip install streamlit
pip install networkx
pip install matplotlib

Setup

Clone the repository:

git clone https://github.com/VincenzoImp/Bitcoin-Address-Clustering.git
cd Bitcoin-Address-Clustering

Configure Spark context with appropriate memory settings:

conf = SparkConf()\
    .set('spark.executor.memory', '50G')\
    .set('spark.driver.memory', '50G')\
    .set('spark.driver.maxResultSize', '50G')\
    .set("spark.driver.cores", "10")\
    .set("spark.sql.analyzer.maxIterations", "100000")

Usage

Basic Usage

Set Global Constants:

start_block = 0
end_block = 115000

Download Dataset:

v_path, e_path, a_path, d_path = download_dataset(start_block, end_block, DATA_DIR, spark, True)

Generate Transaction Graph:

nx_graph = generate_nx_graph(v_df, e_df, graph_path, start_block, end_block, True)

Apply Clustering Algorithm:

clustered_addresses = address_clustering(nx_graph, a_df, known_tx_df, spark, start_block, debug=True)

Web Application

Launch the interactive web interface:

streamlit run app.py 0 115000

Project Structure

Bitcoin-Address-Clustering/
├── dataset/
│   └── blocks-0-115000/
│       ├── vertices-0-115000/
│       ├── edges-0-115000/
│       └── addresses-0-115000/
├── app/
│   ├── app.py
│   ├── Bitcoin.png
│   └── bitcoin-img.svg
├── Bitcoin_Address_Clustering.ipynb
└── README.md

Methodology

The project follows a systematic approach:

Data Collection: Extract transaction data from Bitcoin blockchain via Blockchain.info API
Graph Construction: Build directed graphs with transactions as nodes and UTXOs as edges
Heuristic Application: Apply multiple clustering heuristics in sequence
Result Analysis: Evaluate clustering effectiveness and entity reduction
Visualization: Generate interactive graphs for cluster exploration

Heuristics Implemented

Simple Heuristics

Satoshi Heuristic: Groups addresses from early coinbase transactions (blocks < 19,500) as likely belonging to Satoshi Nakamoto
Coinbase Transaction Mining Address Clustering: Assumes all output addresses from coinbase transactions belong to the same miner
Common-Input-Ownership: Groups all input addresses in multi-input transactions as belonging to the same entity
Single Input/Output: Treats single input, single output transactions as address movements within the same entity
Consolidation Transaction: Groups addresses in transactions with multiple inputs and single output

Advanced Heuristics

Payment Transaction Analysis: Identifies payment transactions with change addresses
Change Address Detection: Uses multiple sub-heuristics:
- Same address in input and output
- Address reuse patterns
- Unnecessary input analysis
- New address identification
- Round number detection
Mixed Transaction Recognition: Identifies and handles CoinJoin transactions using taint analysis

Results

Clustering Effectiveness

Initial Addresses: ~1,000,000 unique addresses
After Clustering: ~550,000 entities (45% reduction)
With Small Cluster Assumption: ~220,000 entities (78% reduction)

Entity Distribution

The clustering reveals a power-law distribution of entity sizes:

Most entities contain 1-2 addresses
Few large entities contain hundreds of addresses
Largest clusters likely represent exchanges or major services

Web Application

The project includes a Streamlit-based web interface that allows users to:

Input Bitcoin addresses for clustering analysis
Visualize transaction graphs with cluster highlighting
Explore entity relationships and transaction flows
Download clustering results and statistics

Features:

Interactive network visualization using PyVis
Real-time address clustering
Detailed transaction information
Export capabilities

Use Cases

1. Entity Movement Visualization

Track how funds move between addresses belonging to the same entity:

address = '115uADbwcLhfKeWJzy7EHjSWjn3dpHK1vZ'
cluster_graph = visualize_entity_movements(address)

2. Chain Analysis Queries

Answer specific questions about blockchain activity:

"How many unique miners were active before 2011?"
"What's the largest entity by address count?"
"Which entities show mixing behavior?"

3. Research Applications

Academic research on Bitcoin privacy
Compliance and AML investigations
Cryptocurrency forensics
Network analysis studies

Data Sources

Blockchain Data: Blockchain.info API
Block Range: Genesis block (0) to block 115,000
Time Period: January 2009 to February 2011
Transactions: ~400,000 transactions analyzed

Performance Considerations

Memory Requirements: 50GB+ RAM recommended for full dataset
Processing Time: Several hours for complete clustering
Storage: ~10GB for preprocessed datasets
Scalability: Designed for distributed processing with Spark

Limitations

Privacy Techniques: Advanced privacy methods (CoinJoin, mixers) can reduce clustering effectiveness
False Positives: Heuristics may incorrectly group unrelated addresses
Temporal Scope: Analysis limited to early Bitcoin history (2009-2011)
Data Availability: Depends on external API availability

Contributing

Contributions are welcome! Please feel free to submit pull requests, report bugs, or suggest new features.

Development Guidelines

Follow PEP 8 style guidelines
Add comprehensive docstrings
Include unit tests for new heuristics
Update documentation for new features

Future Enhancements

Extended Block Range: Support for more recent blockchain data
Advanced Clustering: Integration of machine learning approaches
Real-time Analysis: Live blockchain monitoring capabilities
Privacy Metrics: Quantitative privacy assessment tools

References

Satoshi Nakamoto's Bitcoin Whitepaper
"A Fistful of Bitcoins" - Meiklejohn et al.
"An Analysis of Anonymity in the Bitcoin System" - Reid & Harrigan
Blockchain.info API Documentation

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Bitcoin Core developers
Apache Spark community
Blockchain.info for API access
Academic research community for heuristic development

Note: This tool is intended for research and educational purposes. Users should comply with applicable laws and regulations when analyzing blockchain data.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.ipynb_checkpoints		.ipynb_checkpoints
app		app
dataset		dataset
Bitcoin Address Clustering.pdf		Bitcoin Address Clustering.pdf
Bitcoin Address Clustering.pptx		Bitcoin Address Clustering.pptx
Bitcoin_Address_Clustering.ipynb		Bitcoin_Address_Clustering.ipynb
LICENSE		LICENSE
README.md		README.md
bitcoin-anonymity.png		bitcoin-anonymity.png
bitcoin_address_clustering.py		bitcoin_address_clustering.py
download_dataset.py		download_dataset.py
photo_2022-06-27_03-12-05.jpg		photo_2022-06-27_03-12-05.jpg
s.png		s.png

License

VincenzoImp/bitcoin-address-clustering

Folders and files

Latest commit

History

Repository files navigation

Bitcoin Address Clustering

Table of Contents

Features

Installation

Prerequisites

Dependencies

Setup

Usage

Basic Usage

Web Application

Project Structure

Methodology

Heuristics Implemented

Simple Heuristics

Advanced Heuristics

Results

Clustering Effectiveness

Entity Distribution

Web Application

Features:

Use Cases

1. Entity Movement Visualization

2. Chain Analysis Queries

3. Research Applications

Data Sources

Performance Considerations

Limitations

Contributing

Development Guidelines

Future Enhancements

References

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages