Skip to content

Multi-GPU Pretraining is slower than single GPU training? #220

@leannmlindsey

Description

@leannmlindsey

Hello, I was able to pretrain a full model using 1 A100 GPU using this config file

composer main.py yamls/main/flex-bert-rope-base.yaml

This took about 4 days for 1 epoch.

When I try to run the same command using 4 A100 GPUs it seems to go into an infinite loop and doesn't progress. I also checked the GPU load with nvidia-smi and there is no GPU activity on any node.

(base) [lindseylm@cn0096 ~]$ nvidia-smi
Mon Mar 31 13:27:33 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 30C P0 66W / 400W | 3894MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:46:00.0 Off | 0 |
| N/A 31C P0 68W / 400W | 2700MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:85:00.0 Off | 0 |
| N/A 32C P0 70W / 400W | 2700MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:C7:00.0 Off | 0 |
| N/A 32C P0 68W / 400W | 2628MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1081346 C ...eylm/conda/envs/bert24_2/bin/python 2618MiB |
| 0 N/A N/A 1081347 C ...eylm/conda/envs/bert24_2/bin/python 416MiB |
| 0 N/A N/A 1081348 C ...eylm/conda/envs/bert24_2/bin/python 416MiB |
| 0 N/A N/A 1081349 C ...eylm/conda/envs/bert24_2/bin/python 416MiB |
| 1 N/A N/A 1081347 C ...eylm/conda/envs/bert24_2/bin/python 2690MiB |
| 2 N/A N/A 1081348 C ...eylm/conda/envs/bert24_2/bin/python 2690MiB |
| 3 N/A N/A 1081349 C ...eylm/conda/envs/bert24_2/bin/python 2618MiB |
+---------------------------------------------------------------------------------------+

This is the same exact yaml file which runs to completion with no issue on 1 A100 GPU.

I also tried a config file from the pretrain_documentation branch

composer main.py yaml/pretrain/modernbert-base-pretrain.yaml

logfile.txt

This does seem to run properly with no errors and I can see that it is utilizing all of the GPUs, but it doesn't make much forward progress and the rate is extremely slow compared to the single GPU training.

Single GPU training
Train time/batch: 3
Train time/sample: 12288
Train time/batch_in_epoch: 3
Train time/sample_in_epoch: 12288
Train time/token: 6291456
Train time/token_in_epoch: 6291456
Train trainer/device_train_microbatch_size: 128
Train loss/train/total: 8.4557
Train metrics/train/LanguageCrossEntropy: 8.4557
Train metrics/train/MaskedAccuracy: 0.0003
Train time/train: 0.0071
Train time/val: 0.0000
Train time/total: 0.0071
Train lr-DecoupledAdamW/group0: 0.0000
Train lr-DecoupledAdamW/group1: 0.0000

Multi-GPU training
Train trainer/packing_efficiency: 0.9994
Train time/batch: 299
Train time/sample: 593403
Train time/batch_in_epoch: 299
Train time/sample_in_epoch: 593403
Train time/token: 117504980
Train time/token_in_epoch: 117504980
Train trainer/device_train_microbatch_size: 96
Train loss/train/total: 6.9307
Train throughput/batches_per_sec: 0.4220
Train throughput/samples_per_sec: 837.5386
Train throughput/device/batches_per_sec: 0.1055
Train throughput/device/samples_per_sec: 209.3846
Train throughput/tokens_per_sec: 165843.3564
Train throughput/device/tokens_per_sec: 41460.8391
Train time/train: 0.2375
Train time/val: 0.0000
Train time/total: 0.2375
Train lr-StableAdamW/group0: 0.0000
Train lr-StableAdamW/group1: 0.0000
Train gradient_norms/l1_norm: 2351.9883
Train gradient_norms/l2_norm: 0.7196

I would really like to be able to use multi-GPU for training, as this is why I was using your model. I am running this on Biowulf at the NIH.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions