-
Notifications
You must be signed in to change notification settings - Fork 61
Description
1,现状:目前使用deepseek-r1进行awq量化,采用clip_version: v2报错
2,awq_w_only.yml参数如下:
base:
seed: &seed 42
model:
type: DeepseekV3
path: /mnt/DeepSeek-R1
tokenizer_mode: slow
torch_dtype: auto
calib:
name: pileval
download: False
path: /home/llmc/data/pileval
n_samples: 128
bs: -1
seq_len: 512
preproc: pileval_awq
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: False
path: /home/llmc/data/wikitext2
seq_len: 2048 # 2048
# For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False".
# For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True".
bs: 20
inference_per_block: True
quant:
method: Awq
weight:
bit: 4
symmetric: True
granularity: per_group
group_size: 128
calib_algo: learnable
special:
trans: True
# The options for "trans_version" include "v1" and "v2".
# But their results don't differ significantly.
trans_version: v2
weight_clip: True
clip_version: v2
# For 2-bit quantization, setting "clip_sym: False" will yield better results.
clip_sym: True
save_scale: True
scale_path: /home/llmc/scale_data
save_clip: True
clip_path: /home/llmc/clip_data
save:
save_trans: False
save_fake: False
save_path: /home/llmc/deepseek_quat
3,报错
[rank0]: File "/home/llmc/llmc/compression/quantization/base_blockwise_quantization.py", line 453, in run
[rank0]: self.block_transform(block, input_feat, self.input['kwargs'])
[rank0]: File "/home/wangke/miniconda3/envs/llmc/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/llmc/llmc/compression/quantization/awq.py", line 294, in block_transform
[rank0]: self.auto_clipper.run(
[rank0]: File "/home/wangke/miniconda3/envs/llmc/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/llmc/llmc/compression/quantization/auto_clip.py", line 68, in run
[rank0]: max_val, min_val = self.auto_clip_layer(
[rank0]: File "/home/miniconda3/envs/llmc/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/llmc/llmc/compression/quantization/auto_clip.py", line 161, in auto_clip_layer
[rank0]: q_w = self.fake_quantize_weight(
[rank0]: File "/home/llmc/llmc/compression/quantization/auto_clip.py", line 271, in fake_quantize_weight
[rank0]: q_w = self.wquantizer.fake_quant_weight_static(w, args)
[rank0]: File "/home/llmc/llmc/compression/quantization/quant.py", line 814, in fake_quant_weight_static
[rank0]: q_weight = self.quant_dequant(
[rank0]: File "/home/llmc/llmc/compression/quantization/quant.py", line 715, in quant_dequant
[rank0]: tensor = self.quant(tensor, scales, zeros, qmax, qmin)
[rank0]: File "/home/llmc/llmc/compression/quantization/quant.py", line 701, in quant
[rank0]: tensor = torch.clamp(self.round_func(tensor / scales) + zeros, qmin, qmax)
[rank0]: RuntimeError: The size of tensor a (3584) must match the size of tensor b (56) at non-singleton dimension 2
4,解决方案
调试发现其中的tensor和scales 维度问题。quant.py中的方法fake_quant_weight_static增加scales.reshape(-1, 1)就可以。是否可以提交pr