Training fails with more than 1 GPU and/or more than 1 epoch

I get `ValueError: ctypes objects containing pointers cannot be pickled` when trying to train with more than a single GPU:

```
(env) kstine@n02:/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch$ python train.py --threads 4 --num-workers 4 --gpus 2 --max_epochs 100 ../janggi-d8-1b.bin ../janggi-d8-1b.bin 
Feature set: HalfKAv2^
Num real features: 6336
Num virtual features: 768
Num features: 7104
Training with ../janggi-d8-1b.bin validating with ../janggi-d8-1b.bin
[rank: 0] Global seed set to 42
Seed 42
Using batch size 16384
Smart fen skipping: True
Random fen skipping: 3
limiting torch to 4 threads.
Using log dir logs/
ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved.
/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting `Trainer(gpus=2)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=2)` instead.
  rank_zero_deprecation(
/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python train.py --threads 4 --num-workers 4 --gpus 2 --max_ ...
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Using c++ data loader
Traceback (most recent call last):
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/train.py", line 96, in <module>
    main()
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/train.py", line 93, in main
    trainer.fit(nnue, train, val)
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 280, in start_processes
    idx, process, tf_name = start_process(i)
  File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 275, in start_process
    process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled
[W605 15:04:04.904764584 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
```
Using only a single GPU works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training fails with more than 1 GPU and/or more than 1 epoch #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Training fails with more than 1 GPU and/or more than 1 epoch #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions