This project extracts information from the Bitcoin blockchain to create transaction graphs and perform chain analysis through Bitcoin address clustering. The main goal is to group Bitcoin addresses that likely belong to the same entity using various heuristic methods, enabling better understanding of Bitcoin transaction flows and entity behaviors.
- Features
- Installation
- Usage
- Project Structure
- Methodology
- Heuristics Implemented
- Results
- Web Application
- Use Cases
- Contributing
- License
- Blockchain Data Collection: Automated extraction of Bitcoin blockchain data from blocks 0 to 115,000
- Transaction Graph Generation: Creates directed graphs with transactions as nodes and UTXOs as edges
- Multiple Heuristics: Implements 10+ different clustering heuristics
- Spark Integration: Uses Apache Spark for distributed data processing
- Interactive Visualization: Web-based interface for exploring address clusters
- Chain Analysis: Tools for analyzing transaction flows and entity movements
- Comprehensive Results: Achieves 45-78% reduction in entity count through clustering
- Python 3.7+
- Apache Spark
- Google Colab (for the provided notebook) or local Jupyter environment
- Sufficient storage space for blockchain data
Install the required packages:
pip install pyspark
pip install PyDrive
pip install wget
pip install pyvis
pip install streamlit
pip install networkx
pip install matplotlib
- Clone the repository:
git clone https://github.com/VincenzoImp/Bitcoin-Address-Clustering.git
cd Bitcoin-Address-Clustering
- Configure Spark context with appropriate memory settings:
conf = SparkConf()\
.set('spark.executor.memory', '50G')\
.set('spark.driver.memory', '50G')\
.set('spark.driver.maxResultSize', '50G')\
.set("spark.driver.cores", "10")\
.set("spark.sql.analyzer.maxIterations", "100000")
- Set Global Constants:
start_block = 0
end_block = 115000
- Download Dataset:
v_path, e_path, a_path, d_path = download_dataset(start_block, end_block, DATA_DIR, spark, True)
- Generate Transaction Graph:
nx_graph = generate_nx_graph(v_df, e_df, graph_path, start_block, end_block, True)
- Apply Clustering Algorithm:
clustered_addresses = address_clustering(nx_graph, a_df, known_tx_df, spark, start_block, debug=True)
Launch the interactive web interface:
streamlit run app.py 0 115000
Bitcoin-Address-Clustering/
├── dataset/
│ └── blocks-0-115000/
│ ├── vertices-0-115000/
│ ├── edges-0-115000/
│ └── addresses-0-115000/
├── app/
│ ├── app.py
│ ├── Bitcoin.png
│ └── bitcoin-img.svg
├── Bitcoin_Address_Clustering.ipynb
└── README.md
The project follows a systematic approach:
- Data Collection: Extract transaction data from Bitcoin blockchain via Blockchain.info API
- Graph Construction: Build directed graphs with transactions as nodes and UTXOs as edges
- Heuristic Application: Apply multiple clustering heuristics in sequence
- Result Analysis: Evaluate clustering effectiveness and entity reduction
- Visualization: Generate interactive graphs for cluster exploration
-
Satoshi Heuristic: Groups addresses from early coinbase transactions (blocks < 19,500) as likely belonging to Satoshi Nakamoto
-
Coinbase Transaction Mining Address Clustering: Assumes all output addresses from coinbase transactions belong to the same miner
-
Common-Input-Ownership: Groups all input addresses in multi-input transactions as belonging to the same entity
-
Single Input/Output: Treats single input, single output transactions as address movements within the same entity
-
Consolidation Transaction: Groups addresses in transactions with multiple inputs and single output
-
Payment Transaction Analysis: Identifies payment transactions with change addresses
-
Change Address Detection: Uses multiple sub-heuristics:
- Same address in input and output
- Address reuse patterns
- Unnecessary input analysis
- New address identification
- Round number detection
-
Mixed Transaction Recognition: Identifies and handles CoinJoin transactions using taint analysis
- Initial Addresses: ~1,000,000 unique addresses
- After Clustering: ~550,000 entities (45% reduction)
- With Small Cluster Assumption: ~220,000 entities (78% reduction)
The clustering reveals a power-law distribution of entity sizes:
- Most entities contain 1-2 addresses
- Few large entities contain hundreds of addresses
- Largest clusters likely represent exchanges or major services
The project includes a Streamlit-based web interface that allows users to:
- Input Bitcoin addresses for clustering analysis
- Visualize transaction graphs with cluster highlighting
- Explore entity relationships and transaction flows
- Download clustering results and statistics
- Interactive network visualization using PyVis
- Real-time address clustering
- Detailed transaction information
- Export capabilities
Track how funds move between addresses belonging to the same entity:
address = '115uADbwcLhfKeWJzy7EHjSWjn3dpHK1vZ'
cluster_graph = visualize_entity_movements(address)
Answer specific questions about blockchain activity:
- "How many unique miners were active before 2011?"
- "What's the largest entity by address count?"
- "Which entities show mixing behavior?"
- Academic research on Bitcoin privacy
- Compliance and AML investigations
- Cryptocurrency forensics
- Network analysis studies
- Blockchain Data: Blockchain.info API
- Block Range: Genesis block (0) to block 115,000
- Time Period: January 2009 to February 2011
- Transactions: ~400,000 transactions analyzed
- Memory Requirements: 50GB+ RAM recommended for full dataset
- Processing Time: Several hours for complete clustering
- Storage: ~10GB for preprocessed datasets
- Scalability: Designed for distributed processing with Spark
- Privacy Techniques: Advanced privacy methods (CoinJoin, mixers) can reduce clustering effectiveness
- False Positives: Heuristics may incorrectly group unrelated addresses
- Temporal Scope: Analysis limited to early Bitcoin history (2009-2011)
- Data Availability: Depends on external API availability
Contributions are welcome! Please feel free to submit pull requests, report bugs, or suggest new features.
- Follow PEP 8 style guidelines
- Add comprehensive docstrings
- Include unit tests for new heuristics
- Update documentation for new features
- Extended Block Range: Support for more recent blockchain data
- Advanced Clustering: Integration of machine learning approaches
- Real-time Analysis: Live blockchain monitoring capabilities
- Privacy Metrics: Quantitative privacy assessment tools
- Satoshi Nakamoto's Bitcoin Whitepaper
- "A Fistful of Bitcoins" - Meiklejohn et al.
- "An Analysis of Anonymity in the Bitcoin System" - Reid & Harrigan
- Blockchain.info API Documentation
This project is licensed under the MIT License. See the LICENSE file for details.
- Bitcoin Core developers
- Apache Spark community
- Blockchain.info for API access
- Academic research community for heuristic development
Note: This tool is intended for research and educational purposes. Users should comply with applicable laws and regulations when analyzing blockchain data.