Skip to content

End-to-end ML workflow for multi-label toxic comment detection using NLP. Implements advanced text preprocessing, multi-label vectorization, and models (Logistic Regression, RNNs, Transformers). Includes scripts for data cleaning, training, and per-label metrics.

Notifications You must be signed in to change notification settings

SuneshSundarasami/Multilabel-Toxicity-Detection-Using-Classical-RNN-and-Transformer-Architectures

Repository files navigation

🚦 Multi-Label Toxic Comment Classifier

A modular, extensible end-to-end NLP pipeline for detecting and categorizing various types of toxicity in online comments. Built for the Jigsaw Toxic Comment Classification Challenge, this project provides robust data preprocessing, diverse model architectures, advanced evaluation, and experiment tracking—all in one place.


📑 Table of Contents


🚀 Features

  • Multi-label classification: toxic, severe toxic, obscene, threat, insult, identity hate
  • Flexible pipelines for classical (RNN, vectorizers) & deep learning (BERT, GPT, LoRA, etc.)
  • Comprehensive preprocessing: advanced cleaning, tokenization, class balancing, feature extraction
  • Class balancing: oversampling, adaptive focal loss, stratified splitting
  • Experimentation framework: modular scripts and Jupyter notebooks for ablation, benchmarking, visualization
  • Extendable utilities: add new models or plug in new data with ease

📁 Project Structure

The repository is organized for clarity and modularity, allowing easy extension and reproducibility:

  • Utils/:
    Core Python modules and subfolders for all main tasks:

    • preprocessing/: Text cleaning, tokenization, normalization, and feature extraction.
    • rnn_models/: Architectures and training scripts for RNN-based models (GRU, LSTM, BiLSTM+Attention).
    • transformer_models/: Code for loading, fine-tuning, and evaluating Transformer models (BERT, GPT, etc.).
    • Vectorizers/: Classical text representation (TF-IDF, CountVectorizer) and shallow model utilities.
    • comment_generator/: Tools for generating synthetic or adversarial comments.
    • config/: Configuration files for paths, hyperparameters, and settings.
  • Playground/:
    Jupyter notebooks for interactive exploration, EDA, model training/ablation, and prototyping.

  • pipeline/:
    Project diagrams, including:

    • flowchart.drawio: Editable pipeline flowchart for draw.io.
    • flowchart.png: High-res static image of the pipeline (included below).
  • NLP Project Poster _ Multi-Class Toxic Comment Classification/:
    Scientific poster, LaTeX sources, and supplementary materials.


🛠️ Pipeline Flowchart

A visual overview of the full pipeline:

Pipeline Flowchart


📊 Dataset

  • Source: Wikipedia Comments Dataset
  • Description: 150,000+ comments, each annotated with one or more toxicity labels; highly imbalanced classes.

🧹 Preprocessing

  • Advanced normalization:
    Utilizes Ekphrasis to normalize URLs, emails, numbers, hashtags, allcaps, elongated/repeated/censored words, and more. HTML correction and Twitter-specific segmentation/correction included.

  • Robust cleaning:
    Truncates ultra-long comments and overly long words, and collapses extreme character repetitions (e.g., "soooooooo""sooo"). Falls back to simple tokenization for edge cases.

  • Semantic annotation:
    Converts Ekphrasis tags to meaningful tokens, e.g.:

    • <user>PERSON
    • <url>WEBSITE
    • <allcaps>CAPS
    • <elongated>EMPHASIS
    • <repeated>INTENSE
    • <hashtag>TOPIC
  • Configurable:
    All file paths and data locations are set via a YAML config.

For technical details, see preprocessor.py.


🤖 Model Architectures

  • Classical models & vectorizers:
    TF-IDF, CountVectorizer, Logistic Regression, Random Forest, SVM, batch runners.

  • RNN-based models:
    GRU, LSTM, BiLSTM (+attention), GloVe embeddings, adaptive focal loss.

  • Transformer-based & LLMs:
    BERT, RoBERTa, GPT-1/2, FLAN-T5, LoRA fine-tuning, Few-shot prompting.


📒 Playground Notebooks

  • EDA.ipynb
    Exploratory Data Analysis: Visualizes class distributions, label imbalance, text length stats, word clouds, and key examples to understand the dataset’s structure and challenge.

  • Preprocessing.ipynb
    Step-by-step demonstration of the text preprocessing pipeline, showing cleaning, normalization, tokenization, and annotation on real data samples.

  • Basic_model_trainer.ipynb
    Quick-start template for shallow models (Logistic Regression, SVM, Random Forest) using vectorized features; includes metrics and simple cross-validation.

  • Vectorizers_model.ipynb
    In-depth experiments with TF-IDF, CountVectorizer, and other classical feature extraction techniques, comparing their impact on different classifiers.

  • RNN_models.ipynb
    Trains and evaluates GRU, LSTM, and BiLSTM models (optionally with attention). Includes embeddings setup (e.g., GloVe), class imbalance handling, and analysis of sequence learning.

  • Transformer_based_models.ipynb
    Fine-tunes and benchmarks BERT, RoBERTa, and similar transformer models on the toxic comment dataset, tracking validation metrics and class-wise performance.

  • finetuning-gpt-1-full-training.ipynb
    Shows end-to-end fine-tuning of a GPT-1 language model for multi-label toxicity classification task, including data formatting and evaluation.

  • finetuning-flant5-base.ipynb
    Demonstrates fine-tuning of FLAN-T5 for the multi-label toxicity classification task and evaluation of the same.

  • flan-t5-base-prompting.ipynb Using specialized few-shot prompts on a FLan-T5-Base model to perform inference on the test dataset and evaluation of the same.

  • flan_t5_xl_prompting.ipynb Using specialized few-shot prompts on a FLan-T5-XL model to perform inference on the test dataset and evaluation of the same.

  • GPT_Transformer_Toxic_Classifier.ipynb
    Specialized experiments with GPT-2 based architectures for toxicity classification, including model adaptation and results.

  • Comment_generator.ipynb
    Utility notebook for generating synthetic, adversarial, or “hard” toxic comments to augment the training set.

  • Comment_context_generator.ipynb
    Generates contextualized comment examples for advanced augmentation or adversarial testing.

  • comparison.py
    Script for aggregating, comparing, and ranking results across all models and approaches; useful for ablation studies and leaderboard creation.

For more notebooks, result scripts, and data generators, see the full Playground directory.


⚡ Installation

git clone https://github.com/SuneshSundarasami/Multi_Label_Toxic_Comment_Classifier.git
cd Multi_Label_Toxic_Comment_Classifier

# Recommended: Create conda environment from environment.yml
conda env create -f environment.yml
conda activate toxic-comment-classification

For RNN models:
Download glove.840B.300d.zip (2GB) and place the extracted .txt in your data/ directory.


▶️ Usage

  • Preprocessing:
    Use the script in Utils/preprocessing/preprocessor.py:

    python Utils/preprocessing/preprocessor.py

    Or import functions:

    from Utils.preprocessing.preprocessor import preprocess_text, parallel_preprocess
  • EDA:
    Playground/EDA.ipynb

  • Classical models:
    Playground/Vectorizers_model.ipynb or vectorizer_runner.py

  • Deep learning:
    Playground/RNN_models.ipynb or Transformer_based_models.ipynb

  • LLM fine-tuning:
    finetuning-gpt-1-full-training.ipynb or finetuning-flant5-base.ipynb

  • Evaluation & comparison:
    comparison.py


📈 Results

  • Transformers (BERT, FLAN-T5, GPT) outperform RNNs and classical models, especially on minority classes.
  • Class balancing + advanced preprocessing = robust detection.
  • See the scientific poster and EDA notebook for detailed results.

📚 References


🙏 Acknowledgements

Thanks to Prof. Dr. Jörn Hees and Tim Metzler, M.Sc., for their guidance and support.


👥 Contributions

Sunesh Praveen Raja Sundarasami

  • Developed and implemented all classical, RNN-based, and transformer-based models (including BERT, GPT-2, and more)
  • Designed and executed all data preprocessing, EDA, class balancing, and ablation/evaluation studies
  • Created all experiment notebooks, scripts, and co-created the scientific poster
  • Contributed to project structure, code integration, and reproducibility

Aaron Cuthinho

  • Focused on transformer model training with LoRA (Low-Rank Adaptation)
  • Co-created the scientific poster

📬 Contact

For questions, suggestions, or collaboration, open an issue or visit:
GitHub: Multi_Label_Toxic_Comment_Classifier

About

End-to-end ML workflow for multi-label toxic comment detection using NLP. Implements advanced text preprocessing, multi-label vectorization, and models (Logistic Regression, RNNs, Transformers). Includes scripts for data cleaning, training, and per-label metrics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •