HorizonML

A Hybrid Model Parallelism Framework for Distributed Training on Edge Devices. HorizonML enables efficient training of machine learning models across heterogeneous edge devices using distributed model parallelism, optimizing computation, communication, and resource allocation.

Overview

This repository contains implementations of two distributed training approaches:

Data Parallelism: Distributes data across multiple workers, with each worker having a complete copy of the model.
Model Parallelism: Splits the model across multiple workers, with each worker processing the same data but different parts of the model.

Both implementations use PyTorch's distributed communication primitives and are designed to work on CPU devices.

Project Structure

data_parallel_train.py: Implementation of data parallel training
layer_model_parallel_train.py: Implementation of model parallel training (layer-wise)
main.py: Benchmarking script to compare both approaches
analyze_results.py: Additional analysis tools for the benchmark results

Requirements

Python 3.7+
PyTorch 1.8+
torchvision
pandas
matplotlib
seaborn
psutil

Running the Code

Data Parallel Training

To run data parallel training independently, use the data_parallel_train.py script with the following parameters:

--world_size: Number of processes/workers to use (default: 5)
--epochs: Number of training epochs (default: 5)
--sample_size: Number of samples to use from CIFAR-10 (default: 1000)

Results will be saved in the data_parallel_logs directory.

Model Parallel Training

To run model parallel training independently, use the layer_model_parallel_train.py script with the following parameters:

--world_size: Number of processes/workers to use (default: 5)
--epochs: Number of training epochs (default: 5)
--sample_size: Number of samples to use from CIFAR-10 (default: 1000)

Results will be saved in the model_parallel_logs directory.

Benchmarking Both Approaches

To run benchmarks comparing both approaches, use the main.py script with these parameters:

--sample_sizes: List of sample sizes to benchmark (default: 1000 10000 50000)
--world_size: Number of processes/workers to use (default: 5)
--epochs: Number of training epochs (default: 5)
--output_dir: Directory to save benchmark results (default: benchmark_results)

This will run both training approaches with the specified sample sizes and generate comparison graphs in the benchmark_results directory.

Benchmark Metrics

The benchmarking compares the following metrics:

Accuracy: Model accuracy on the training data
Loss: Training loss
Training Time: Time per epoch
Computation vs. Communication Time: Breakdown of time spent
CPU Utilization: Average CPU usage
Memory Usage: Average memory consumption
Worker Idle Time: Time workers spend waiting
Gradient Divergence: For data parallel training
Bandwidth Usage: For model parallel training
Overall Performance: Radar chart comparing all metrics

Implementation Details

Data Parallelism

Uses PyTorch's DistributedDataParallel (DDP)
Each worker has a complete copy of the model
Data is split among workers using DistributedSampler
Gradients are synchronized across workers during backpropagation

Model Parallelism

Custom implementation that splits ResNet18 into segments
Each worker processes a different segment of the model
Data flows through the model segments in a pipeline fashion
Only the last worker computes the loss and performs backpropagation

Notes

Both implementations are designed for CPU training
The code uses the "gloo" backend for distributed communication
For larger sample sizes, expect longer training times
The model used is ResNet18 from torchvision

Troubleshooting

If you encounter port conflicts, the code will automatically find a free port
If processes don't terminate properly, you might need to manually kill them
For any "address already in use" errors, wait a few moments before retrying

Command Examples

Running Data Parallel Training

python data_parallel_train.py --world_size 5 --epochs 5 --sample_size 1000

Running Model Parallel Training

python layer_model_parallel_train.py --world_size 5 --epochs 5 --sample_size 1000

Running Tensor Parallel Training

python tensor_parallel_train.py --world_size 5 --epochs 5 --sample_size 1000

Running Benchmarks

python main.py --sample_sizes 1000 10000 50000 --world_size 5 --epochs 5 --output_dir benchmark_results

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
benchmark_results		benchmark_results
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
data_parallel_train.py		data_parallel_train.py
docker-compose.yml		docker-compose.yml
layer_model_parallel_train.py		layer_model_parallel_train.py
main.py		main.py
tensor_parallel_train.py		tensor_parallel_train.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HorizonML

Overview

Project Structure

Requirements

Running the Code

Data Parallel Training

Model Parallel Training

Benchmarking Both Approaches

Benchmark Metrics

Implementation Details

Data Parallelism

Model Parallelism

Notes

Troubleshooting

Command Examples

Running Data Parallel Training

Running Model Parallel Training

Running Tensor Parallel Training

Running Benchmarks

About

Uh oh!

Releases

Packages

Languages

License

Deeptanshu-sankhwar/horizonml

Folders and files

Latest commit

History

Repository files navigation

HorizonML

Overview

Project Structure

Requirements

Running the Code

Data Parallel Training

Model Parallel Training

Benchmarking Both Approaches

Benchmark Metrics

Implementation Details

Data Parallelism

Model Parallelism

Notes

Troubleshooting

Command Examples

Running Data Parallel Training

Running Model Parallel Training

Running Tensor Parallel Training

Running Benchmarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages