Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Aug 8, 2025

Training with multiple GPUs or multiple epochs was failing with ValueError: ctypes objects containing pointers cannot be pickled when PyTorch Lightning attempted to spawn worker processes for distributed training.

The root cause was module-level ctypes objects in nnue_dataset.py that contained C pointers which cannot be serialized for multiprocessing:

# Before: Module-level ctypes objects (unpicklable)
dll = ctypes.cdll.LoadLibrary(dllpath)
create_sparse_batch_stream = dll.create_sparse_batch_stream
# ... other ctypes function pointers

class TrainingDataProvider:
    def __init__(self, ...):
        self.stream = self.create_stream(...)  # Contains C pointers

Solution:

  • Moved ctypes dll loading from module-level to per-instance in TrainingDataProvider
  • Added custom __getstate__/__setstate__ methods to handle pickling safely by excluding ctypes objects
  • Implemented lazy initialization so each worker process creates its own ctypes objects instead of inheriting serialized ones
# After: Per-instance ctypes loading (picklable)
class TrainingDataProvider:
    def _setup_dll(self):
        dll = _load_dll()  # Load fresh in each process
        self._create_stream_func, ... = _setup_dll_functions(dll)
        self.stream = self._create_stream_func(...)
    
    def __getstate__(self):
        # Exclude unpicklable ctypes objects during serialization
        state = self.__dict__.copy()
        state['stream'] = None
        return state

Testing:

  • Verified multiprocessing with spawn method (PyTorch Lightning's approach) works correctly
  • Confirmed backward compatibility - all existing APIs continue to work
  • Validated performance - no regressions in dataset creation or pickling speed
  • Tested train.py startup with multi-GPU parameters successfully

The original failing command should now work:

python train.py --threads 4 --num-workers 4 --gpus 2 --max_epochs 100 ../janggi-d8-1b.bin ../janggi-d8-1b.bin

Fixes #39.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI assigned Copilot and ianfab Aug 8, 2025
Copy link
Author

Copilot AI commented Aug 8, 2025

@ianfab 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Copilot AI and others added 2 commits August 8, 2025 21:38
Co-authored-by: ianfab <12969303+ianfab@users.noreply.github.com>
Co-authored-by: ianfab <12969303+ianfab@users.noreply.github.com>
@Copilot Copilot AI changed the title [WIP] Training fails with more than 1 GPU and/or more than 1 epoch Fix multiprocessing pickling error for multi-GPU training Aug 8, 2025
@Copilot Copilot AI requested a review from ianfab August 8, 2025 21:47
Copilot finished work on behalf of ianfab August 8, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training fails with more than 1 GPU and/or more than 1 epoch
2 participants