A modular and reproducible pipeline for processing Whole Slide Images (WSIs), integrating HistoGPT with DVC for efficient data versioning and experiment tracking.
This repository provides a structured pipeline for WSI data preparation and embedding extraction using HistoGPT. It emphasizes:
- π¦ Reproducibility via DVC
- βοΈ Custom configuration overrides for HistoGPT
- π§ͺ Notebook-based experimentation
- ποΈ Clear separation of data, code, and configs
wsi_data_pipeline_dev/
βββ assets/ # Visualization outputs
βββ config/ # YAML-based configuration files
βββ data/ # Raw/processed data (DVC-tracked)
βββ histogpt_install_setup/ # Customized HistoGPT overrides
βββ notebooks/ # Jupyter notebooks for experiments
βββ src/ # Core scripts (e.g., preprocessing)
βββ .dvc/ # DVC metadata
βββ .gitignore
βββ dvc.yaml # DVC pipeline definition
βββ dvc.lock # DVC version lock file
βββ requirements.txt # Python dependencies
βββ README.md
git clone https://github.com/xinghao302001/wsi_data_pipeline_dev.git
cd wsi_data_pipeline_dev
pip install -r requirements.txt
git clone https://github.com/marrlab/HistoGPT.git
cp -r histogpt_install_setup/* HistoGPT/ # Overwrite configs
dvc init
dvc pull # If using remote storage
Explore notebooks/
for examples on:
- Extracting embeddings using HistoGPT
- Visualising patch results
- Testing pipeline steps with sample slides
Issues and PRs are welcome. If you have improvements or questions, feel free to contribute!