DeePMD Training in Parallel on LSF system #1571

rajnichahal · 2022-03-13T18:41:58Z

rajnichahal
Mar 13, 2022

Hello,
I was trying to run deepmd training in parallel using Horovod on LSF system.
singularity exec /share/pkg/deePMD-kit/2.0.3/deepmd-kit_2.0.3_gpu_horovod.simg horovodrun -np 4 dp train --mpi-log=workers input.json
I followed the instructions given in the link below to execute the training job in parallel
https://github.com/deepmodeling/deepmd-kit/blob/master/doc/train/parallel-training.md
However, the job didn't run in parallel. Upon contacting the HPC staff, they reported that it is a bug in the Horovod on LSF systems (horovod/horovod#3166).

Anyway, the HPC staff suggested an alternative which involved using mpirun as follows:
singularity exec /share/pkg/deePMD-kit/2.0.3/deepmd-kit_2.0.3_gpu_horovod.simg mpirun -l -launcher=lsf -np 4 dp train –mpi-log=workers input.json
Upon doing so, I am getting the error:
[0] [proxy:0:0@gpu14] [0] HYDU_create_process (utils/launch/launch.c:74): [0] execvp error on file train (No such file or directory)
[1] [proxy:0:0@gpu14] [1] HYDU_create_process (utils/launch/launch.c:74): [1] execvp error on file train (No such file or directory)
/bin/sh: /usr/local/bin/hydra_pmi_proxy: No such file or directory

The HPC staff recommended to take this issue to the DeePMD developers. Please let me know if you could help resolve this. Thanks!

wanghan-iapcm · 2022-03-14T01:52:54Z

wanghan-iapcm
Mar 14, 2022
Maintainer

@shishaochen Could you please take a look? Thanks!

0 replies

njzjz · 2022-03-18T06:22:32Z

njzjz
Mar 18, 2022
Maintainer

It looks that the command was parsed wrongly. Try singularity exec /share/pkg/deePMD-kit/2.0.3/deepmd-kit_2.0.3_gpu_horovod.simg mpirun -l -launcher=lsf -np 4 sh -c "dp train –mpi-log=workers input.json"

1 reply

rajnichahal Mar 19, 2022
Author

Thanks for your comments! I used the following command this time
singularity exec /share/pkg/deePMD-kit/2.0.3/deepmd-kit_2.0.3_gpu_horovod.simg mpirun -l -launcher=lsf -np 4 sh -c "dp train --mpi-log=workers input.json"

I am getting following error messages:
/bin/sh: /usr/local/bin/hydra_pmi_proxy: No such file or directory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeePMD Training in Parallel on LSF system #1571

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DeePMD Training in Parallel on LSF system #1571

Uh oh!

rajnichahal Mar 13, 2022

Replies: 2 comments · 1 reply

Uh oh!

wanghan-iapcm Mar 14, 2022 Maintainer

Uh oh!

njzjz Mar 18, 2022 Maintainer

Uh oh!

rajnichahal Mar 19, 2022 Author

rajnichahal
Mar 13, 2022

Replies: 2 comments 1 reply

wanghan-iapcm
Mar 14, 2022
Maintainer

njzjz
Mar 18, 2022
Maintainer

rajnichahal Mar 19, 2022
Author