Skip to content

Training fails with more than 1 GPU and/or more than 1 epoch #39

@hpckurt

Description

@hpckurt

I get ValueError: ctypes objects containing pointers cannot be pickled when trying to train with more than a single GPU:

(env) kstine@n02:/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch$ python train.py --threads 4 --num-workers 4 --gpus 2 --max_epochs 100 ../janggi-d8-1b.bin ../janggi-d8-1b.bin 
Feature set: HalfKAv2^
Num real features: 6336
Num virtual features: 768
Num features: 7104
Training with ../janggi-d8-1b.bin validating with ../janggi-d8-1b.bin
[rank: 0] Global seed set to 42
Seed 42
Using batch size 16384
Smart fen skipping: True
Random fen skipping: 3
limiting torch to 4 threads.
Using log dir logs/
ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved.
/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting `Trainer(gpus=2)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=2)` instead.
  rank_zero_deprecation(
/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python train.py --threads 4 --num-workers 4 --gpus 2 --max_ ...
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Using c++ data loader
Traceback (most recent call last):
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/train.py", line 96, in <module>
    main()
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/train.py", line 93, in main
    trainer.fit(nnue, train, val)
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 280, in start_processes
    idx, process, tf_name = start_process(i)
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 275, in start_process
    process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled
[W605 15:04:04.904764584 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Using only a single GPU works fine.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions