🚦 Multi-Label Toxic Comment Classifier

A modular, extensible end-to-end NLP pipeline for detecting and categorizing various types of toxicity in online comments. Built for the Jigsaw Toxic Comment Classification Challenge, this project provides robust data preprocessing, diverse model architectures, advanced evaluation, and experiment tracking—all in one place.

🚀 Features

Multi-label classification: toxic, severe toxic, obscene, threat, insult, identity hate
Flexible pipelines for classical (RNN, vectorizers) & deep learning (BERT, GPT, LoRA, etc.)
Comprehensive preprocessing: advanced cleaning, tokenization, class balancing, feature extraction
Class balancing: oversampling, adaptive focal loss, stratified splitting
Experimentation framework: modular scripts and Jupyter notebooks for ablation, benchmarking, visualization
Extendable utilities: add new models or plug in new data with ease

📁 Project Structure

The repository is organized for clarity and modularity, allowing easy extension and reproducibility:

Utils/:
Core Python modules and subfolders for all main tasks:
- preprocessing/: Text cleaning, tokenization, normalization, and feature extraction.
- rnn_models/: Architectures and training scripts for RNN-based models (GRU, LSTM, BiLSTM+Attention).
- transformer_models/: Code for loading, fine-tuning, and evaluating Transformer models (BERT, GPT, etc.).
- Vectorizers/: Classical text representation (TF-IDF, CountVectorizer) and shallow model utilities.
- comment_generator/: Tools for generating synthetic or adversarial comments.
- config/: Configuration files for paths, hyperparameters, and settings.
Playground/:
Jupyter notebooks for interactive exploration, EDA, model training/ablation, and prototyping.
- See Playground directory for all files and scripts.
pipeline/:
Project diagrams, including:
- flowchart.drawio: Editable pipeline flowchart for draw.io.
- flowchart.png: High-res static image of the pipeline (included below).
NLP Project Poster _ Multi-Class Toxic Comment Classification/:
Scientific poster, LaTeX sources, and supplementary materials.

🛠️ Pipeline Flowchart

A visual overview of the full pipeline:

📊 Dataset

Source: Wikipedia Comments Dataset
Description: 150,000+ comments, each annotated with one or more toxicity labels; highly imbalanced classes.

🧹 Preprocessing

Advanced normalization:
Utilizes Ekphrasis to normalize URLs, emails, numbers, hashtags, allcaps, elongated/repeated/censored words, and more. HTML correction and Twitter-specific segmentation/correction included.
Robust cleaning:
Truncates ultra-long comments and overly long words, and collapses extreme character repetitions (e.g., "soooooooo" → "sooo"). Falls back to simple tokenization for edge cases.
Semantic annotation:
Converts Ekphrasis tags to meaningful tokens, e.g.:
- <user> → PERSON
- <url> → WEBSITE
- <allcaps> → CAPS
- <elongated> → EMPHASIS
- <repeated> → INTENSE
- <hashtag> → TOPIC
Configurable:
All file paths and data locations are set via a YAML config.

For technical details, see preprocessor.py.

🤖 Model Architectures

Classical models & vectorizers:
TF-IDF, CountVectorizer, Logistic Regression, Random Forest, SVM, batch runners.
RNN-based models:
GRU, LSTM, BiLSTM (+attention), GloVe embeddings, adaptive focal loss.
Transformer-based & LLMs:
BERT, RoBERTa, GPT-1/2, FLAN-T5, LoRA fine-tuning, Few-shot prompting.

📒 Playground Notebooks

EDA.ipynb
Exploratory Data Analysis: Visualizes class distributions, label imbalance, text length stats, word clouds, and key examples to understand the dataset’s structure and challenge.
Preprocessing.ipynb
Step-by-step demonstration of the text preprocessing pipeline, showing cleaning, normalization, tokenization, and annotation on real data samples.
Basic_model_trainer.ipynb
Quick-start template for shallow models (Logistic Regression, SVM, Random Forest) using vectorized features; includes metrics and simple cross-validation.
Vectorizers_model.ipynb
In-depth experiments with TF-IDF, CountVectorizer, and other classical feature extraction techniques, comparing their impact on different classifiers.
RNN_models.ipynb
Trains and evaluates GRU, LSTM, and BiLSTM models (optionally with attention). Includes embeddings setup (e.g., GloVe), class imbalance handling, and analysis of sequence learning.
Transformer_based_models.ipynb
Fine-tunes and benchmarks BERT, RoBERTa, and similar transformer models on the toxic comment dataset, tracking validation metrics and class-wise performance.
finetuning-gpt-1-full-training.ipynb
Shows end-to-end fine-tuning of a GPT-1 language model for multi-label toxicity classification task, including data formatting and evaluation.
finetuning-flant5-base.ipynb
Demonstrates fine-tuning of FLAN-T5 for the multi-label toxicity classification task and evaluation of the same.
flan-t5-base-prompting.ipynb Using specialized few-shot prompts on a FLan-T5-Base model to perform inference on the test dataset and evaluation of the same.
flan_t5_xl_prompting.ipynb Using specialized few-shot prompts on a FLan-T5-XL model to perform inference on the test dataset and evaluation of the same.
GPT_Transformer_Toxic_Classifier.ipynb
Specialized experiments with GPT-2 based architectures for toxicity classification, including model adaptation and results.
Comment_generator.ipynb
Utility notebook for generating synthetic, adversarial, or “hard” toxic comments to augment the training set.
Comment_context_generator.ipynb
Generates contextualized comment examples for advanced augmentation or adversarial testing.
comparison.py
Script for aggregating, comparing, and ranking results across all models and approaches; useful for ablation studies and leaderboard creation.

For more notebooks, result scripts, and data generators, see the full Playground directory.

⚡ Installation

git clone https://github.com/SuneshSundarasami/Multi_Label_Toxic_Comment_Classifier.git
cd Multi_Label_Toxic_Comment_Classifier

# Recommended: Create conda environment from environment.yml
conda env create -f environment.yml
conda activate toxic-comment-classification

For RNN models:
Download glove.840B.300d.zip (2GB) and place the extracted .txt in your data/ directory.

▶️ Usage

Preprocessing:
Use the script in Utils/preprocessing/preprocessor.py:

python Utils/preprocessing/preprocessor.py

Or import functions:

from Utils.preprocessing.preprocessor import preprocess_text, parallel_preprocess

EDA:
Playground/EDA.ipynb
Classical models:
Playground/Vectorizers_model.ipynb or vectorizer_runner.py
Deep learning:
Playground/RNN_models.ipynb or Transformer_based_models.ipynb
LLM fine-tuning:
finetuning-gpt-1-full-training.ipynb or finetuning-flant5-base.ipynb
Evaluation & comparison:
comparison.py

📈 Results

Transformers (BERT, FLAN-T5, GPT) outperform RNNs and classical models, especially on minority classes.
Class balancing + advanced preprocessing = robust detection.
See the scientific poster and EDA notebook for detailed results.

📚 References

🙏 Acknowledgements

Thanks to Prof. Dr. Jörn Hees and Tim Metzler, M.Sc., for their guidance and support.

👥 Contributions

Sunesh Praveen Raja Sundarasami

Developed and implemented all classical, RNN-based, and transformer-based models (including BERT, GPT-2, and more)
Designed and executed all data preprocessing, EDA, class balancing, and ablation/evaluation studies
Created all experiment notebooks, scripts, and co-created the scientific poster
Contributed to project structure, code integration, and reproducibility

Aaron Cuthinho

Focused on transformer model training with LoRA (Low-Rank Adaptation)
Co-created the scientific poster

📬 Contact

For questions, suggestions, or collaboration, open an issue or visit:
GitHub: Multi_Label_Toxic_Comment_Classifier

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
NLP Project Poster _ Multi-Class Toxic Comment Classification		NLP Project Poster _ Multi-Class Toxic Comment Classification
Playground		Playground
Utils		Utils
analysis_plots		analysis_plots
pipeline		pipeline
.gitignore		.gitignore
=1.3.0		=1.3.0
README.md		README.md
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt
vectorizers.py		vectorizers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚦 Multi-Label Toxic Comment Classifier

📑 Table of Contents

🚀 Features

📁 Project Structure

🛠️ Pipeline Flowchart

📊 Dataset

🧹 Preprocessing

🤖 Model Architectures

📒 Playground Notebooks

⚡ Installation

▶️ Usage

📈 Results

📚 References

🙏 Acknowledgements

👥 Contributions

📬 Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SuneshSundarasami/Multilabel-Toxicity-Detection-Using-Classical-RNN-and-Transformer-Architectures

Folders and files

Latest commit

History

Repository files navigation

🚦 Multi-Label Toxic Comment Classifier

📑 Table of Contents

🚀 Features

📁 Project Structure

🛠️ Pipeline Flowchart

📊 Dataset

🧹 Preprocessing

🤖 Model Architectures

📒 Playground Notebooks

⚡ Installation

▶️ Usage

📈 Results

📚 References

🙏 Acknowledgements

👥 Contributions

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages