DeePMD Training in Parallel on LSF system #1571
rajnichahal
started this conversation in
General
Replies: 2 comments 1 reply
-
@shishaochen Could you please take a look? Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
-
It looks that the command was parsed wrongly. Try |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I was trying to run deepmd training in parallel using Horovod on LSF system.
singularity exec /share/pkg/deePMD-kit/2.0.3/deepmd-kit_2.0.3_gpu_horovod.simg horovodrun -np 4 dp train --mpi-log=workers input.json
I followed the instructions given in the link below to execute the training job in parallel
https://github.com/deepmodeling/deepmd-kit/blob/master/doc/train/parallel-training.md
However, the job didn't run in parallel. Upon contacting the HPC staff, they reported that it is a bug in the Horovod on LSF systems (horovod/horovod#3166).
Anyway, the HPC staff suggested an alternative which involved using mpirun as follows:
singularity exec /share/pkg/deePMD-kit/2.0.3/deepmd-kit_2.0.3_gpu_horovod.simg mpirun -l -launcher=lsf -np 4 dp train –mpi-log=workers input.json
Upon doing so, I am getting the error:
[0] [proxy:0:0@gpu14] [0] HYDU_create_process (utils/launch/launch.c:74): [0] execvp error on file train (No such file or directory)
[1] [proxy:0:0@gpu14] [1] HYDU_create_process (utils/launch/launch.c:74): [1] execvp error on file train (No such file or directory)
/bin/sh: /usr/local/bin/hydra_pmi_proxy: No such file or directory
The HPC staff recommended to take this issue to the DeePMD developers. Please let me know if you could help resolve this. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions