GitHub - teambits009/Digraphs-Model-Simulation: Swahili is spoken by over 100 million people across East and Central Africa — yet remains vastly underrepresented in AI. This project closes that gap by creating open tools to train, deploy, and simulate models tailored to African linguistics and culture.

🌍 Swahili NLP Model Trainer & Simulation Pipeline

A comprehensive project designed to simulate, train, and deploy Swahili language models for natural language understanding, generation, and speech integration. Built for researchers, AI practitioners, and edtech developers focusing on African languages.

📘 Overview This repository contains all components required to build, fine-tune, and simulate Swahili NLP models, including:

Custom data ingestion pipelines

Preprocessing workflows

Model training using transformer architectures

Evaluation tools and inference interfaces

Voice/text simulation modules (optional)

✨ Why Swahili? Swahili is spoken by over 100 million people across East and Central Africa — yet remains vastly underrepresented in AI. This project closes that gap by creating open tools to train, deploy, and simulate models tailored to African linguistics and culture.

🧩 Features 🧠 Custom NLP model training (BERT, GPT-2, mT5, LLaMA variants)

📚 Multisource Swahili data ingestion (news, YouTube, chat)

🪄 Preprocessing with Swahili-aware tokenization

🎯 Evaluation & visualization dashboard

🔊 Speech-to-text (ASR) + Text-to-speech (TTS) modules (optional)

⚡ Deployable via HuggingFace, TorchScript, or ONNX

🏗️ Project Structure

Edit swahili-nlp-simulator/ ├── data/ │ ├── raw/ # Unprocessed Swahili data │ ├── cleaned/ # Tokenized and filtered datasets ├── src/ │ ├── data_pipeline.py # Ingest, clean, and normalize text │ ├── tokenizer.py # Swahili-specific tokenizer (SentencePiece) │ ├── train.py # Model training loop (transformer-based) │ ├── evaluate.py # BLEU, F1, Perplexity metrics │ ├── simulate.py # Interactive chat + simulation environment ├── models/ │ └── checkpoints/ # Trained models & weights ├── notebooks/ │ └── analysis.ipynb # EDA, token stats, sample generation ├── README.md └── requirements.txt

🚀 Getting Started

Clone the Repository

Edit git clone https://github.com/yourusername/swahili-nlp-simulator.git

cd swahili-nlp-simulator

Install Dependencies

Edit pip install -r requirements.txt 3. Prepare Dataset Edit python src/data_pipeline.py --source news,youtube --lang sw 4. Train the Model Edit python src/train.py --model bert --epochs 5 --batch_size 32 5. Simulate Swahili Chat Edit python src/simulate.py 🧪 Model Training Flow Ingest Data → from Swahili corpora, YouTube captions, social text

Clean & Tokenize → normalize language, filter noise, segment words

Train Model → fine-tune a transformer using HuggingFace Transformers

Evaluate → test for coherence, generation fluency, and comprehension

Simulate & Export → interactive chat + deployment artifacts

📊 Evaluation Metrics BLEU Score — translation-like quality check

F1 Score — understanding correctness in classification tasks

Perplexity — model fluency/performance metric

Tokenization Quality — word splits, slang support, subword mapping

🗣️ Voice Integration (Optional Modules)

Edit pip install speechrecognition gtts 🔊 Speech-to-Text (ASR) — Recognize spoken Swahili phrases

🗣️ Text-to-Speech (TTS) — Generate natural-sounding Swahili responses

🔐 Model Export & Deployment Export trained model as .pt, .onnx, or .pkl

Upload to HuggingFace Model Hub or serve via FastAPI

Edit python src/export.py --format onnx 📈 Diagrams & Simulations Architecture and flow diagrams available in docs/diagrams/

Simulations include:

Token coverage visualization

Real-time response simulation

Confidence/attention scores (via HuggingFace trainer)

📄 License MIT License — see LICENSE

🙋‍♀️ Contributing We welcome contributions to support additional African languages, ASR/TTS integrations, and data pipelines.

📬 Contact GitHub: https://github.com/teambits009

Email: brandonopere6@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Digraph Extraction.py		Digraph Extraction.py
Digraphs.csv		Digraphs.csv
README.md		README.md
csvset.py		csvset.py
digraphs.txt		digraphs.txt
docex.py		docex.py
embed.py		embed.py
extract.py		extract.py
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

teambits009/Digraphs-Model-Simulation

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages