NotImplementedError: c10d::broadcast_: at

Hi, I'm running on CUDA 12.0
and torch 2.6.0+cu124
Python 3.10.14
transformers 4.50.1

When i run 
python train.py \
--model_name meta-llama/Llama-3.2-1B \
--gradient_accumulation_steps 2 \
--batch_size 8 \
--context_length 512 \
--num_epochs 1 \
--train_type qlora \
--use_gradient_checkpointing False \
--use_cpu_offload False \
--log_to wandb \
--dataset alpaca \
--verbose false \
--save_model true \
--output_dir ~/models/qlora_alpaca


I get 


  File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
    _init_param_handle_from_module(
  File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 629, in _init_param_handle_from_module
    _sync_module_params_and_buffers(
  File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 1126, in _sync_module_params_and_buffers
    _sync_params_and_buffers(
  File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/utils.py", line 334, in _sync_params_and_buffers
    dist._broadcast_coalesced(
NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add a fake impl. Please see the following for next steps:  https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html

/$HOME/anaconda3/envs/py10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NotImplementedError: c10d::broadcast_: at #75

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NotImplementedError: c10d::broadcast_: at #75

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions