-
Notifications
You must be signed in to change notification settings - Fork 204
Description
Hi, I'm running on CUDA 12.0
and torch 2.6.0+cu124
Python 3.10.14
transformers 4.50.1
When i run
python train.py
--model_name meta-llama/Llama-3.2-1B
--gradient_accumulation_steps 2
--batch_size 8
--context_length 512
--num_epochs 1
--train_type qlora
--use_gradient_checkpointing False
--use_cpu_offload False
--log_to wandb
--dataset alpaca
--verbose false
--save_model true
--output_dir ~/models/qlora_alpaca
I get
File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in init
_init_param_handle_from_module(
File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 629, in _init_param_handle_from_module
_sync_module_params_and_buffers(
File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 1126, in _sync_module_params_and_buffers
_sync_params_and_buffers(
File "$HOME/anaconda3/envs/py10/lib/python3.10/site-packages/torch/distributed/utils.py", line 334, in _sync_params_and_buffers
dist.broadcast_coalesced(
NotImplementedError: c10d::broadcast: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add a fake impl. Please see the following for next steps: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html
/$HOME/anaconda3/envs/py10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '