Fix multiprocessing pickling error for multi-GPU training #40

Copilot · 2025-08-08T21:30:33Z

Training with multiple GPUs or multiple epochs was failing with ValueError: ctypes objects containing pointers cannot be pickled when PyTorch Lightning attempted to spawn worker processes for distributed training.

The root cause was module-level ctypes objects in nnue_dataset.py that contained C pointers which cannot be serialized for multiprocessing:

# Before: Module-level ctypes objects (unpicklable)
dll = ctypes.cdll.LoadLibrary(dllpath)
create_sparse_batch_stream = dll.create_sparse_batch_stream
# ... other ctypes function pointers

class TrainingDataProvider:
    def __init__(self, ...):
        self.stream = self.create_stream(...)  # Contains C pointers

Solution:

Moved ctypes dll loading from module-level to per-instance in TrainingDataProvider
Added custom __getstate__/__setstate__ methods to handle pickling safely by excluding ctypes objects
Implemented lazy initialization so each worker process creates its own ctypes objects instead of inheriting serialized ones

# After: Per-instance ctypes loading (picklable)
class TrainingDataProvider:
    def _setup_dll(self):
        dll = _load_dll()  # Load fresh in each process
        self._create_stream_func, ... = _setup_dll_functions(dll)
        self.stream = self._create_stream_func(...)
    
    def __getstate__(self):
        # Exclude unpicklable ctypes objects during serialization
        state = self.__dict__.copy()
        state['stream'] = None
        return state

Testing:

Verified multiprocessing with spawn method (PyTorch Lightning's approach) works correctly
Confirmed backward compatibility - all existing APIs continue to work
Validated performance - no regressions in dataset creation or pickling speed
Tested train.py startup with multi-GPU parameters successfully

The original failing command should now work:

python train.py --threads 4 --num-workers 4 --gpus 2 --max_epochs 100 ../janggi-d8-1b.bin ../janggi-d8-1b.bin

Fixes #39.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot · 2025-08-08T21:30:34Z

@ianfab 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: ianfab <12969303+ianfab@users.noreply.github.com>

Initial plan

2dc5ff4

Copilot AI assigned Copilot and ianfab Aug 8, 2025

Copilot started work on behalf of ianfab August 8, 2025 21:30 View session

Copilot AI and others added 2 commits August 8, 2025 21:38

Initial exploration and analysis of multiprocessing pickling issue

339bb02

Co-authored-by: ianfab <12969303+ianfab@users.noreply.github.com>

Fix multiprocessing pickling issue for multi-GPU training

e2bbfe2

Co-authored-by: ianfab <12969303+ianfab@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Training fails with more than 1 GPU and/or more than 1 epoch~~ Fix multiprocessing pickling error for multi-GPU training Aug 8, 2025

Copilot AI requested a review from ianfab August 8, 2025 21:47

Copilot finished work on behalf of ianfab August 8, 2025 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix multiprocessing pickling error for multi-GPU training #40

Fix multiprocessing pickling error for multi-GPU training #40

Uh oh!

Copilot AI commented Aug 8, 2025 •

edited

Loading

Uh oh!

Copilot AI commented Aug 8, 2025

Uh oh!

Uh oh!

Uh oh!

Fix multiprocessing pickling error for multi-GPU training #40

Are you sure you want to change the base?

Fix multiprocessing pickling error for multi-GPU training #40

Uh oh!

Conversation

Copilot AI commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI commented Aug 8, 2025

Uh oh!

Uh oh!

Copilot AI commented Aug 8, 2025 •

edited

Loading