Mac Turner, Michael Ngo, Eric Hu, Neeraj Parihar
This repo is a re-implementation of Prefix-Tuning: Optimizing continuous prompts for generation by Xiang Lisa Li and Percy Liang [1]. This paper introduces a finetuning method that requires only tuning only 0.1% of the parameters for comparable if not better performance than full finetuning.
We re-implement prefix-tuning and fine-tuning for GPT-2 Medium on the E2E NLG dataset [2] and show that prefix-tuning achieve comparable performance with full fine-tuning while being faster to train and memory efficient to store. This is the upper-left quadrant of Table 1 in [1]. We also perform an ablation study on prefix length. This is Figure 4 in [1].
.
├── data
│ └── e2e_data
├── e2e-metrics # imported submodule from official eval scripts
│ └── measure_scores.py
├── environment.yaml
├── evals # stores model outputs
├── models # store models
├── poster
│ └── poster.pdf
├── report
│ └── prefix_tuning_2page_report.png
└── src
├── configs
│ ├── hyperparameters.yaml # training
│ └── prefix_tuning.yaml # initialization
├── data_gen.py # create small debugging datasets
├── data_load.py # put data into dataloaders
├── output_gen.py # test-time inference
├── prefix_tuner.py # define prefix tuning wrapper
├── train.py # training loop
└── train_small.py
We deviate from the specified structure with the src
folder instead of code
and evals
instead of results
. Tables are produced by running the E2E evaluation script.
We import pretrained GPT-2 models from HuggingFace and wrap them via PyTorch. We follow the same training procedure as [1] with 5 epochs, prefix length 5, batch size 10, learning rate 8e-5 (and 5e-5 for fine tuning), linear learning schedule, and beam search with length 5. We evalute GPT-2 Small and Medium on the E2E NLG dataset using the official metrics scripts, so we have BLEU, NIST, METEOR, ROUGE-L and CIDEr. See figure below and [1] for more details.
Install the conda environment from environment.yaml
To use the E2E evaluation scripts, you must have Java 1.8 or higher and Perl 5.8.8 or higher with the XML::Twig CPAN module.
First, before training the model, set the hyperparameters you want to train with in src/configs/hyperparameters.yaml
train the model. To choose between prefix-tuning and full fine-tuning, set the hyperparameter tuner
to "prefix"
or "fine"
.
Then, you can train the model. (And create a models folder if there's an error about that)
python src/train.py
Then, run the model on the E2E dataset to generate an output file using
python src/output_gen.py
Finally, to run the evaluation metrics on the generated output, run
./e2e-metrics/measure_scores.py src/target.txt src/model-output.txt
We trained on a single GPU with 16GB of VRAM.
The results are that fine-tuning and prefix-tuning are comparable, so our prefix-tuning is slightly worse than that is found by [1]. Further, as prefix-length increases, so does the BLEU score for table-to-text generation.
The original paper found fine-tuning to take about 1.5 times as long as prefix-tuning to train. This ratio was around 1.8 for us.
We were able to get comparable performance with prefix-tuning to fine-tuning, and it trained faster. We learned to iterate early and often. Don't get caught up trying to figure out what's 100% right.
[1]: Xiang Lisa Li and Percy Liang. Prefix-tuning. https://aclanthology.org/2021.acl-long.353/
[2]: Jekaterina Novikova, Ondrej Dusek, and Verena Rieser. The E2E datatset. https://aclanthology.org/W17-5525/
This re-implementation was completed as a final project for CS 4782: Introduction to Deep Learning in Spring 2025.