-
-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
I get ValueError: ctypes objects containing pointers cannot be pickled
when trying to train with more than a single GPU:
(env) kstine@n02:/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch$ python train.py --threads 4 --num-workers 4 --gpus 2 --max_epochs 100 ../janggi-d8-1b.bin ../janggi-d8-1b.bin
Feature set: HalfKAv2^
Num real features: 6336
Num virtual features: 768
Num features: 7104
Training with ../janggi-d8-1b.bin validating with ../janggi-d8-1b.bin
[rank: 0] Global seed set to 42
Seed 42
Using batch size 16384
Smart fen skipping: True
Random fen skipping: 3
limiting torch to 4 threads.
Using log dir logs/
ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved.
/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting `Trainer(gpus=2)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=2)` instead.
rank_zero_deprecation(
/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python train.py --threads 4 --num-workers 4 --gpus 2 --max_ ...
rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Using c++ data loader
Traceback (most recent call last):
File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/train.py", line 96, in <module>
main()
File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/train.py", line 93, in main
trainer.fit(nnue, train, val)
File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
mp.start_processes(
File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 280, in start_processes
idx, process, tf_name = start_process(i)
File "/scratch/m000001/kstine/nnuedata/variant-nnue-pytorch/env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 275, in start_process
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled
[W605 15:04:04.904764584 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Using only a single GPU works fine.
Copilot
Metadata
Metadata
Assignees
Labels
No labels