Byte Pair Encoding (BPE) in Rust

This is a Rust implementation of Byte Pair Encoding (BPE), an industry-standard tokenization algorithm used in NLP models like GPT and BERT.

The implementation in main.rs builds a vocabulary by iteratively merging the most frequent pair of tokens in a text corpus, enabling efficient text encoding and decoding.

🚀 Key Features

✅ Trains a BPE model with a specified vocabulary size
🔒 Supports special tokens that are not merged
🔁 Encodes text into token IDs and decodes back to text
💾 Saves and loads models to/from files
🧠 Handles unknown characters gracefully

📦 Usage

1. Clone the Repository

git clone https://github.com/happybear-21/bpe-rs.git

2. Navigate to the Project

cd bpe-rs

3. Run the Program

cargo run

🧪 Example Code

use std::io;

fn main() -> io::Result<()> {
    let mut bpe = BPE::new();
    let training_text = "hello world hello there hello everyone";
    let vocab_size = 100;
    let special_tokens = vec!["<|endoftext|>".to_string()];
    
    bpe.train(training_text, vocab_size, special_tokens)?;
    let encoded = bpe.encode("hello world");
    let decoded = bpe.decode(&encoded);
    bpe.save("bpe_model.txt")?;
    
    Ok(())
}

🖨️ Sample Output

Running cargo run will produce output similar to:

Training BPE model with text: Jack embraced beauty through art and life.
Text to encode: hello world
Encoded token IDs: [15, 121, 52, 52, 65, 22, 29, 65, 54, 52, 114]
Decoded text: hello world
Vocabulary size: 168
Learned merges: 38
Model saved to: vocab.json and bpe_merges.json
Model loaded from: vocab.json and bpe_merges.json
Re-encoded token IDs (after loading): [15, 121, 52, 52, 65, 22, 29, 65, 54, 52, 114]
Re-decoded text (after loading): hello world

📚 Notes

Includes error handling for file operations and unknown characters.
For production use, consider:
- Adding more tests
- Optimizing for large datasets

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Byte Pair Encoding (BPE) in Rust

🚀 Key Features

📦 Usage

1. Clone the Repository

2. Navigate to the Project

3. Run the Program

🧪 Example Code

🖨️ Sample Output

📚 Notes

About

Uh oh!

Releases

Packages

Languages

License

happybear-21/bpe-rs

Folders and files

Latest commit

History

Repository files navigation

Byte Pair Encoding (BPE) in Rust

🚀 Key Features

📦 Usage

1. Clone the Repository

2. Navigate to the Project

3. Run the Program

🧪 Example Code

🖨️ Sample Output

📚 Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages