Multi-GPU Pretraining is slower than single GPU training?

Hello, I was able to pretrain a full model using 1 A100 GPU using this config file

composer main.py yamls/main/flex-bert-rope-base.yaml

This took about 4 days for 1 epoch.

When I try to run the same command using 4 A100 GPUs it seems to go into an infinite loop and doesn't progress.  I also checked the GPU load with nvidia-smi and there is no GPU activity on any node.

(base) [lindseylm@cn0096 ~]$ nvidia-smi
Mon Mar 31 13:27:33 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   30C    P0              66W / 400W |   3894MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:46:00.0 Off |                    0 |
| N/A   31C    P0              68W / 400W |   2700MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:85:00.0 Off |                    0 |
| N/A   32C    P0              70W / 400W |   2700MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   32C    P0              68W / 400W |   2628MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1081346      C   ...eylm/conda/envs/bert24_2/bin/python     2618MiB |
|    0   N/A  N/A   1081347      C   ...eylm/conda/envs/bert24_2/bin/python      416MiB |
|    0   N/A  N/A   1081348      C   ...eylm/conda/envs/bert24_2/bin/python      416MiB |
|    0   N/A  N/A   1081349      C   ...eylm/conda/envs/bert24_2/bin/python      416MiB |
|    1   N/A  N/A   1081347      C   ...eylm/conda/envs/bert24_2/bin/python     2690MiB |
|    2   N/A  N/A   1081348      C   ...eylm/conda/envs/bert24_2/bin/python     2690MiB |
|    3   N/A  N/A   1081349      C   ...eylm/conda/envs/bert24_2/bin/python     2618MiB |
+---------------------------------------------------------------------------------------+

This is the same exact yaml file which runs to completion with no issue on 1 A100 GPU.

I also tried a config file from the pretrain_documentation branch

composer main.py yaml/pretrain/modernbert-base-pretrain.yaml

[logfile.txt](https://github.com/user-attachments/files/19539077/logfile.txt)

This does seem to run properly with no errors and I can see that it is utilizing all of the GPUs, but it doesn't make much forward progress and the rate is extremely slow compared to the single GPU training.

Single GPU training
Train time/batch: 3
         Train time/sample: 12288
         Train time/batch_in_epoch: 3
         Train time/sample_in_epoch: 12288
         Train time/token: 6291456
         Train time/token_in_epoch: 6291456
         Train trainer/device_train_microbatch_size: 128
         Train loss/train/total: 8.4557
         Train metrics/train/LanguageCrossEntropy: 8.4557
         Train metrics/train/MaskedAccuracy: 0.0003
         Train time/train: 0.0071
         Train time/val: 0.0000
         Train time/total: 0.0071
         Train lr-DecoupledAdamW/group0: 0.0000
         Train lr-DecoupledAdamW/group1: 0.0000

Multi-GPU training
Train trainer/packing_efficiency: 0.9994
	 Train time/batch: 299
	 Train time/sample: 593403
	 Train time/batch_in_epoch: 299
	 Train time/sample_in_epoch: 593403
	 Train time/token: 117504980
	 Train time/token_in_epoch: 117504980
	 Train trainer/device_train_microbatch_size: 96
	 Train loss/train/total: 6.9307
	 Train throughput/batches_per_sec: 0.4220
	 Train throughput/samples_per_sec: 837.5386
	 Train throughput/device/batches_per_sec: 0.1055
	 Train throughput/device/samples_per_sec: 209.3846
	 Train throughput/tokens_per_sec: 165843.3564
	 Train throughput/device/tokens_per_sec: 41460.8391
	 Train time/train: 0.2375
	 Train time/val: 0.0000
	 Train time/total: 0.2375
	 Train lr-StableAdamW/group0: 0.0000
	 Train lr-StableAdamW/group1: 0.0000
	 Train gradient_norms/l1_norm: 2351.9883
	 Train gradient_norms/l2_norm: 0.7196

I would really like to be able to use multi-GPU for training, as this is why I was using your model. I am running this on Biowulf at the NIH.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU Pretraining is slower than single GPU training? #220

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU Pretraining is slower than single GPU training? #220

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions