A modular, extensible end-to-end NLP pipeline for detecting and categorizing various types of toxicity in online comments. Built for the Jigsaw Toxic Comment Classification Challenge, this project provides robust data preprocessing, diverse model architectures, advanced evaluation, and experiment tracking—all in one place.
- Features
- Project Structure
- Pipeline Flowchart
- Dataset
- Preprocessing
- Model Architectures
- Playground Notebooks
- Installation
- Usage
- Results
- References
- Acknowledgements
- Contributions
- Contact
- Multi-label classification: toxic, severe toxic, obscene, threat, insult, identity hate
- Flexible pipelines for classical (RNN, vectorizers) & deep learning (BERT, GPT, LoRA, etc.)
- Comprehensive preprocessing: advanced cleaning, tokenization, class balancing, feature extraction
- Class balancing: oversampling, adaptive focal loss, stratified splitting
- Experimentation framework: modular scripts and Jupyter notebooks for ablation, benchmarking, visualization
- Extendable utilities: add new models or plug in new data with ease
The repository is organized for clarity and modularity, allowing easy extension and reproducibility:
-
Utils/
:
Core Python modules and subfolders for all main tasks:preprocessing/
: Text cleaning, tokenization, normalization, and feature extraction.rnn_models/
: Architectures and training scripts for RNN-based models (GRU, LSTM, BiLSTM+Attention).transformer_models/
: Code for loading, fine-tuning, and evaluating Transformer models (BERT, GPT, etc.).Vectorizers/
: Classical text representation (TF-IDF, CountVectorizer) and shallow model utilities.comment_generator/
: Tools for generating synthetic or adversarial comments.config/
: Configuration files for paths, hyperparameters, and settings.
-
Playground/
:
Jupyter notebooks for interactive exploration, EDA, model training/ablation, and prototyping.- See Playground directory for all files and scripts.
-
pipeline/
:
Project diagrams, including:flowchart.drawio
: Editable pipeline flowchart for draw.io.flowchart.png
: High-res static image of the pipeline (included below).
-
NLP Project Poster _ Multi-Class Toxic Comment Classification/
:
Scientific poster, LaTeX sources, and supplementary materials.
A visual overview of the full pipeline:
- Source: Wikipedia Comments Dataset
- Description: 150,000+ comments, each annotated with one or more toxicity labels; highly imbalanced classes.
-
Advanced normalization:
Utilizes Ekphrasis to normalize URLs, emails, numbers, hashtags, allcaps, elongated/repeated/censored words, and more. HTML correction and Twitter-specific segmentation/correction included. -
Robust cleaning:
Truncates ultra-long comments and overly long words, and collapses extreme character repetitions (e.g.,"soooooooo"
→"sooo"
). Falls back to simple tokenization for edge cases. -
Semantic annotation:
Converts Ekphrasis tags to meaningful tokens, e.g.:<user>
→PERSON
<url>
→WEBSITE
<allcaps>
→CAPS
<elongated>
→EMPHASIS
<repeated>
→INTENSE
<hashtag>
→TOPIC
-
Configurable:
All file paths and data locations are set via a YAML config.
For technical details, see preprocessor.py
.
-
Classical models & vectorizers:
TF-IDF, CountVectorizer, Logistic Regression, Random Forest, SVM, batch runners. -
RNN-based models:
GRU, LSTM, BiLSTM (+attention), GloVe embeddings, adaptive focal loss. -
Transformer-based & LLMs:
BERT, RoBERTa, GPT-1/2, FLAN-T5, LoRA fine-tuning, Few-shot prompting.
-
EDA.ipynb
Exploratory Data Analysis: Visualizes class distributions, label imbalance, text length stats, word clouds, and key examples to understand the dataset’s structure and challenge. -
Preprocessing.ipynb
Step-by-step demonstration of the text preprocessing pipeline, showing cleaning, normalization, tokenization, and annotation on real data samples. -
Basic_model_trainer.ipynb
Quick-start template for shallow models (Logistic Regression, SVM, Random Forest) using vectorized features; includes metrics and simple cross-validation. -
Vectorizers_model.ipynb
In-depth experiments with TF-IDF, CountVectorizer, and other classical feature extraction techniques, comparing their impact on different classifiers. -
RNN_models.ipynb
Trains and evaluates GRU, LSTM, and BiLSTM models (optionally with attention). Includes embeddings setup (e.g., GloVe), class imbalance handling, and analysis of sequence learning. -
Transformer_based_models.ipynb
Fine-tunes and benchmarks BERT, RoBERTa, and similar transformer models on the toxic comment dataset, tracking validation metrics and class-wise performance. -
finetuning-gpt-1-full-training.ipynb
Shows end-to-end fine-tuning of a GPT-1 language model for multi-label toxicity classification task, including data formatting and evaluation. -
finetuning-flant5-base.ipynb
Demonstrates fine-tuning of FLAN-T5 for the multi-label toxicity classification task and evaluation of the same. -
flan-t5-base-prompting.ipynb Using specialized few-shot prompts on a FLan-T5-Base model to perform inference on the test dataset and evaluation of the same.
-
flan_t5_xl_prompting.ipynb Using specialized few-shot prompts on a FLan-T5-XL model to perform inference on the test dataset and evaluation of the same.
-
GPT_Transformer_Toxic_Classifier.ipynb
Specialized experiments with GPT-2 based architectures for toxicity classification, including model adaptation and results. -
Comment_generator.ipynb
Utility notebook for generating synthetic, adversarial, or “hard” toxic comments to augment the training set. -
Comment_context_generator.ipynb
Generates contextualized comment examples for advanced augmentation or adversarial testing. -
comparison.py
Script for aggregating, comparing, and ranking results across all models and approaches; useful for ablation studies and leaderboard creation.
For more notebooks, result scripts, and data generators, see the full Playground directory.
git clone https://github.com/SuneshSundarasami/Multi_Label_Toxic_Comment_Classifier.git
cd Multi_Label_Toxic_Comment_Classifier
# Recommended: Create conda environment from environment.yml
conda env create -f environment.yml
conda activate toxic-comment-classification
For RNN models:
Download glove.840B.300d.zip (2GB) and place the extracted .txt
in your data/
directory.
-
Preprocessing:
Use the script inUtils/preprocessing/preprocessor.py
:python Utils/preprocessing/preprocessor.py
Or import functions:
from Utils.preprocessing.preprocessor import preprocess_text, parallel_preprocess
-
EDA:
Playground/EDA.ipynb
-
Classical models:
Playground/Vectorizers_model.ipynb
orvectorizer_runner.py
-
Deep learning:
Playground/RNN_models.ipynb
orTransformer_based_models.ipynb
-
LLM fine-tuning:
finetuning-gpt-1-full-training.ipynb
orfinetuning-flant5-base.ipynb
-
Evaluation & comparison:
comparison.py
- Transformers (BERT, FLAN-T5, GPT) outperform RNNs and classical models, especially on minority classes.
- Class balancing + advanced preprocessing = robust detection.
- See the scientific poster and EDA notebook for detailed results.
Thanks to Prof. Dr. Jörn Hees and Tim Metzler, M.Sc., for their guidance and support.
Sunesh Praveen Raja Sundarasami
- Developed and implemented all classical, RNN-based, and transformer-based models (including BERT, GPT-2, and more)
- Designed and executed all data preprocessing, EDA, class balancing, and ablation/evaluation studies
- Created all experiment notebooks, scripts, and co-created the scientific poster
- Contributed to project structure, code integration, and reproducibility
Aaron Cuthinho
- Focused on transformer model training with LoRA (Low-Rank Adaptation)
- Co-created the scientific poster
For questions, suggestions, or collaboration, open an issue or visit:
GitHub: Multi_Label_Toxic_Comment_Classifier