Forked from the BitDistiller paper repo https://github.com/DD-DuDa/BitDistiller.git. Please cite the original repo if you find this work interesting.
@misc{du2024bitdistiller,
title={BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation},
author={Dayou Du and Yijia Zhang and Shijie Cao and Jiaqi Guo and Ting Cao and Xiaowen Chu and Ningyi Xu},
year={2024},
eprint={2402.10631},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This is a student project with the goal being to explore unanswered questions from the BitDistiller paper such as:
- How does the approach do on smaller models (eg. TinyLlama 1.1B)?
- Does this approach work for 1/1.58 bit quantisation?
- How does the choice of teacher model affect performance?
The results of our experiments can be found in results.md
. In summary, the answers to the three questions above were:
- Yes, though not as well for 1B than for 3B or 7B. The model degrades slightly more compared to its full-precision counterpart.
- No, we found that the model performed no better than a random baseline on a the same multiple-choice QA benchmarks as the original BitDistiller paper.
- Unclear. We found no statistically siginficant improvement or degradation for 1B, and conflicting data for 3B.
- Create new branch, clone repo on a cloud GPU instance.
- Run Setup
- Run Pre-Training if applicable
- Run Training
- Upload model to hugging face.
- Run Eval to generate metrics.
- Delete instance!!
If you haven't already done so on your local machine, do the steps below so that you can clone, pull, push etc. locally.
eval "$(ssh-agent -s)" # start ssh agent, not automatic on vast
ssh-keygen -t ed25519
ssh-add; ssh-add -l
echo "public key:"
cat ~/.ssh/id_ed25519.pub
Press enter when prompted for file name/passphrase to use defaults. Copy the entire public key (including ssh-ed25519 and your email at the end) and add this to github under Settings > ssh keys.
Add your local ssh key to your cloud GPU platform eg. Lambdalabs or vast.ai and create an instance with CUDA version 12.4. Login via vscode's remote ssh extension using
ssh -i ~/.ssh/id_ed25519 -p port user@address # (+optional port forwarding with -L)
eg. on vast
ssh -i ~/.ssh/id_ed25519 -p 30077 root@185.150.27.254 -L 8080:localhost:8080
Repeat the setps in Generate an ssh key on your remote instance and clone the repo.
git clone git@github.com:BrownianNotion/BitDistiller.git
Run ./setup.sh
to setup env/install packages. Activate the venv with
source BitDistillerVenv/bin/activate
Note that for vast.ai, your repo will be under /workspace/BitDistiller
.
With all steps, change the output paths (eg. for clipped weights, checkpoints) to match the name of your experiment.
Clips/quantises the teacher model (eg. TinyLlama_v1.1 below) to get initial weights for quantised student model. Shouldn't need to be rerun unless using a new teacher/quantisation method. Initial weights stored in --dump_clip
argument.
cd quantization
CUDA_VISIBLE_DEVICES=0 python autoclip.py --model_path ../models/TinyLlama_v1.1 --calib_dataset pile --quant_type int --w_bit 2 --q_group_size 128 --run_clip --dump_clip ./clip_cache/TinyLlama_v1.1/int2-g128.pt
Generate the data for (distillation) training. Shouldn't need to be rerun unless using a new teacher. The main file we will use for training is data/datasets/tinyllama_v1.1/mix_wiki_alpaca_8000.json
.
cd data/generation
bash generate.sh ../../models/TinyLlama_v1.1 wikitext ../datasets/tinyllama_v1.1/ 16 3000
bash generate.sh ../../models/TinyLlama_v1.1 alpaca ../datasets/tinyllama_v1.1/ 16 5000
# change to path in .py
python mix_data.py
The model is by default trained on the dataset mix_wiki_alpaca_8000.json
. Make sure to change the bits
, quant_type
, and --clip
(initial clipped weights) path and any other training parameters needed in train.sh
. If doing a dry-run, change the parameters intrain_dry_run.sh
instead.
- Commit all changes made by your experiment to a branch for reproducibility. This includes changes to
train.sh
and other configs other than dry run. - Rerun clipping/data generation if needed (see Pre-Training).
- In
train/
, changetrain_dry_run.sh
if needed and run./dry_sun.sh
to check that your code works. This does a single step on a small dataset of 64 samples. - (Skip if on vast.ai) If dry run succeeds, create a new tmux session:
tmux new -s session_name
If your ssh connection ever drops, your training will keep running. You may need to reattach your session.
tmux attach -t session_name
- Run the training command below. Once the model starts training, see Monitoring below for how to monitor training.
cd train
bash train.sh ../data/datasets/tinyllama_v1.1/mix_wiki_alpaca_8000.json ./ckpts/tinyllama_v1.1/int2-g128/ ./ckpts/tinyllama_v1.1/int2-g128/runs/ 4
Run these commands in new terminals once actual training has started (i.e. you see two progress bars).
source BitDistillerVenv/bin/activate
cd train
# Nice dashboard of train/validation loss and other metrics. Eval metrics won't appear
# until an eval step has happened - this may take a while.
tensorboard --logdir=ckpts/tinyllama_v1.1/int2-g128/runs/ --port=8008
# (In new terminal)
# Shows GPU and GPU memory usage. This should be close to 100%/36.5GB for training.
nvtop
Signs your training has gone wrong (to be expanded):
- The loss curve isn't going down after a few steps
As eval takes time, begin uploading the model as soon as training has finished if the loss curves and validation metrics look good.
Login to hugging face with your access token (generate one if you don't have one) with
huggingface-cli login
Check your login succeeded with
huggingface-cli whoami
Make sure your tensorboard logs (.events.out.tfevents.{...}
) are inside your <model_path>
folder (hugging face will auto-generate a metrics tabs to display the loss curves).
Run upload_model.py
, specifying args <model_path>
, <bits>
and optionally
--quant_type, --extra_changes, --base_model, --ovewrite
. Run upload_model.py -h
for help on the options. For <model_path>, we want the best model checkpoint, which can be found in the best_model_checkpoint
field of trainer_state.json
.
This uploads the model to the hugging face repo your_username/model_name
. Model name follows the convention
"{base_model}_{num}bit_{quantisation method}(_{extra changes})".
Example Usage
python upload_model.py train/ckpts/tinyllama_v1.1/int2-g128/checkpoint-100 2 --quant_type int --extra_changes ce_loss
Make sure you're logged into hugging face first, see uploading the model.
To run all evals, use the generate_metrics.sh
with the model path, quant type and bits. This generates metrics.json
in the model path. For example,
cd test/general
bash generate_metrics.sh ../../train/ckpts/tinyllama_v1.1/int2-g128/checkpoint-100 int 2
Then run upload_metrics.py
to automatically upload the metrics to hugging face, specifying the path to the metrics.json
and the hugging face model name without your
user name.
python upload_metrics.py --metrics_json ../../train/ckpts/tinyllama_v1.1/int2-g128/checkpoint-100/metrics.json --model_id 2-bit-baseline
Note: this does not run MMLU by default as it is expensive.
Our main benchmarks will be perplexity (PPL), QA datasets (arc_easy, arc_challenge, winogrande, hellasawg, piqa) and MMLU. For consistency, do not change num_fewshot
. These benchmarks can be run individually as follows:
cd test/general
# PPL
python wiki_ppl.py --model ../../train/ckpts/tinyllama_v1.1/int2-g128/checkpoint-12/ --quant_type int --bits 2 --group_size 128
# QA
CUDA_VISIBLE_DEVICES=0 python llm_eval.py --model ../../train/ckpts/tinyllama_v1.1/int2-g128/checkpoint-12/ --eval_tasks arc_easy,arc_challenge,winogrande,hellaswag,piqa --test_set --bits 2 --group_size 128 --quant_type int --num_fewshot 0
# MMLU
CUDA_VISIBLE_DEVICES=0 python llm_eval.py --model ../../train/ckpts/tinyllama_v1.1/int2-g128/checkpoint-12/ --eval_tasks hendrycksTest-* --test_set --bits 2 --group_size 128 --quant_type int --num_fewshot 5