Skip to content

Bangla NLP dataset. Bangla NER,POStag, text summarization, stopword, translate, sentiment analysis, wiki articles, root word, dataset etc.

License

Notifications You must be signed in to change notification settings

Foysal87/Bangla-NLP-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‡§πŸ‡© Bangla NLP Dataset

Bangla NLP Datasets License Contribution

A comprehensive collection of Bangla NLP datasets for researchers and developers


⚠️ IMPORTANT NOTICES ⚠️

πŸ”„ OUR DATASET IS IN LFS MODE! SO YOU HAVE TO CLONE IT FOR GETTING DATA!

πŸš€ WE WILL SOON UPLOAD ALL DEEP LEARNING BASED DATASETS!

πŸ“‘ Table of Contents

πŸ“– About

Bangla NLP dataset repository containing sbnltk datasets, which were used in Bangla nlp toolkit - sbnltk.

This repository also serves as a comprehensive collection of existing Bangla NLP datasets created by the amazing Bangla NLP research community.

🎯 sbnltk Dataset List (DUMP & HUMAN Evaluated) (sbnltk Dataset)

Dataset Description Link
Number List Bangla Number List πŸ“₯ Download
Root Word List Bangla root word List πŸ“₯ Download
Word List Bangla Word List (highest to lowest occurrence) πŸ“₯ Download
Wiki Dump Bangla Wiki Dump word πŸ“₯ Download
POS Tag Static Bangla POStag static dataset (single word) πŸ“₯ Download
NER Static Bangla NER Static Dataset (single word) πŸ“₯ Download
Stop Words Bangla Stop word list πŸ“₯ Download
Dump POS Tag Bangla Dump Pos tag πŸ“₯ Download
Question Classification Bangla Dump question Classification Dataset πŸ“₯ Download
Sentiment Analysis Bangla Dump Sentiment Analysis πŸ“₯ Download
Translation Dataset Google Translation Dataset πŸ“₯ Download
NER Enhanced NER Existing Dataset (Modified + adding Date entity) πŸ“₯ Download
News Articles News Article Dataset πŸ“₯ Download
POS Converted POS tag converted Data πŸ“₯ Download
POS Human Evaluated POS tag human evaluated Data πŸ“₯ Download
NER Dump (Both) DUMP NER data (active and passive both) πŸ“₯ Download
NER Dump (Active) DUMP NER data (active only) πŸ“₯ Download
Extractive Summarization Extractive Text Summarization πŸ”— GitHub
Abstractive Summarization Abstractive Text Summarization (newspaper) πŸ“₯ Drive | πŸ“Š Kaggle
Text Classification News Article Classification (text Classification) πŸ“₯ Drive | πŸ“Š Kaggle
Keywords Classification Topic Keywords classification (keywords generator) πŸ“₯ Drive | πŸ“Š Kaggle

πŸ€– Pre-trained Language Models

BERT-based Models

Model Description Parameters Link
BanglaBERT ELECTRA-based model, state-of-the-art Bangla NLU 110M πŸ€— HuggingFace
BanglishBERT Bilingual (Bangla+English) BERT 110M πŸ€— HuggingFace
BanglaBERT (Small) Lightweight version for resource-constrained environments 13M πŸ€— HuggingFace
BanglaBERT (Large) Large variant with enhanced performance 335M πŸ€— HuggingFace
Bangla BERT Base Another popular BERT implementation 110M πŸ€— HuggingFace
Bangla Electra ELECTRA-based model for Bangla 13.5M πŸ€— HuggingFace

Generative Models (T5/GPT-style)

Model Description Parameters Link
BanglaT5 T5-based sequence-to-sequence model 247M πŸ€— HuggingFace
BanglaByT5 Byte-level T5 model for Bangla Small πŸ“„ Research Paper
TituLLMs Family of Bangla LLMs (1B & 3B) 1B/3B πŸ“„ Research Paper
TigerLLM Bangla Large Language Models family Various πŸ“„ Research Paper
GPT2-Bangla GPT-2 adapted for Bangla text generation 117M πŸ€— HuggingFace
BanglaNLG Natural language generation for Bangla Various πŸ€— HuggingFace

Speech Models

Model Description Performance Link
Wav2Vec2-Bangla-300M Self-supervised speech recognition 17.8% WER πŸ€— HuggingFace
Whisper-Bangla OpenAI Whisper fine-tuned for Bangla Various sizes πŸ€— HuggingFace
BanglaASR Fine-tuned ASR model 14.73% WER πŸ”— GitHub

Multilingual Models with Strong Bangla Support

Model Description Languages Link
MuRIL Google's multilingual model with Bangla support 17 Indian πŸ€— HuggingFace
IndicBERT BERT for Indian languages including Bangla 12 Indian πŸ€— HuggingFace
sahajBERT ALBERT-based model for Bangla 18M πŸ€— HuggingFace

πŸ“„ Research Papers

Latest Research (2024-2025)

🧠 Knowledge Graphs and Semantic Analysis

  • BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering - πŸ“– LREC-COLING 2024 | πŸ’» Code
    • First framework for automatic Bangla KG construction using multilingual LLMs
    • GNN-based semantic filtering for improved accuracy
  • Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis - πŸ“– IEEE Access 2024
    • FAIR-compliant agricultural knowledge graph for sustainable farming

πŸ—£οΈ Speech and Multimodal Processing

  • BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization - πŸ“– arXiv 2024
    • First end-to-end pipeline for Bangla dialect standardization
    • Achieved 0.8% CER and 1.5% WER for Noakhali dialect
  • Wav2Vec2-Bangla (300M) - πŸ€— HuggingFace
    • Self-supervised speech model with 17.8% WER
    • Trained on OpenSLR Bangla dataset

🌐 Large Language Models and Generation

πŸ“Š Evaluation and Benchmarking

  • BLUB: A Comprehensive Evaluation Benchmark for Bangla Language Understanding - πŸ“– Research
    • First comprehensive Bangla NLP benchmark with 15+ tasks
  • BanglaBook: Large-scale Bangla Dataset for Sentiment Analysis - πŸ“– ACL 2023
    • 158K+ book reviews for sentiment analysis
  • Cross-lingual Transfer Learning for Bangla: What Works and What Doesn't - πŸ“– Findings of ACL 2024

Foundational Papers

πŸ—οΈ Language Models and Pretraining

πŸ“Š Task-Specific Research

πŸ—£οΈ Speech and Multimodal

🌐 Cross-lingual and Multilingual Studies

πŸ”§ Modern NLP Tools and Libraries

Python Libraries

Library Description Features Link
BNLP Bengali Natural Language Processing Toolkit Tokenization, Embedding, POS, NER πŸ”— GitHub
BNLTK Bangla Natural Language Processing Toolkit Tokenization, Stemming, POS Tagging πŸ”— GitHub
sbnltk Bangla NLP toolkit (this repository's toolkit) Comprehensive NLP suite πŸ”— GitHub
bnunicode Unicode normalization for Bangla text Bijoy to Unicode, normalization πŸ”— GitHub
pyBanglaKit Comprehensive Bangla text processing Tokenization, spell checking, sentiment πŸ”— GitHub
Indic NLP Library Multi-Indic language processing Script conversion, transliteration πŸ”— GitHub
BanglaTextProcessor Advanced text processing pipeline Dependency parsing, coreference πŸ”— GitHub

OCR and Vision Tools

Tool Description Features Link
BanglaOCR Comprehensive OCR system for Bangla Print & handwriting recognition πŸ”— GitHub
EasyOCR-Bangla Ready-to-use OCR solution Simple Python API πŸ”— GitHub
TesseractBN Tesseract with Bangla support Command-line & API access πŸ”— GitHub
BanglaHWR Handwriting recognition system Real-time recognition πŸ”— GitHub

Speech Processing Tools

Tool Description Features Link
BanglaVoice Neural TTS system Natural speech synthesis πŸ”— GitHub
FastSpeech-Bangla Fast and robust TTS Real-time synthesis πŸ”— GitHub
BanglaPhoneme Phoneme analysis toolkit IPA transcription support πŸ”— GitHub

Installation Examples

# BNLP installation
pip install bnlp_toolkit

# BNLTK installation  
pip install bnltk

πŸ“Š Benchmarking and Evaluation

Bangla Language Understanding Benchmark (BLUB)

Task Dataset Metric Best Model Score
Sentiment Classification SentNoB Macro-F1 BanglaBERT 72.89
Natural Language Inference BNLI Accuracy BanglaBERT (Large) 83.41
Named Entity Recognition MultiCoNER Micro-F1 BanglaBERT (Large) 79.20
Question Answering BQA/TyDiQA EM/F1 BanglaBERT (Large) 76.10/81.50

Recent Datasets for Benchmarking

Dataset Task Size Description Link
BanglaBook Sentiment Analysis 158,065 samples Book reviews sentiment analysis πŸ”— GitHub
SentMix-3L Code-Mixed Sentiment 1,007 samples Bangla-English-Hindi code-mixed πŸ”— GitHub
Awesome Bangla Datasets Various Multiple Comprehensive collection πŸ”— GitHub

🌟 Existing Datasets

πŸ“ Note: I am not the owner of these following datasets. It's just a collection to find amazing peoples and their works.
πŸ™ Please give them support! Your support will encourage them to do more amazing things.

πŸ”— Awesome Dataset Sources

πŸ“° News Articles and Documents

Dataset Description Link
Wiki Articles Wikipedia Articles in Bangla πŸ“Š Kaggle
Bangladesh Protidin News from Bangladesh Protidin πŸ“Š Kaggle
40k News Articles 40k Bangla Newspaper Articles πŸ“Š Kaggle
Largest News Dataset Bangla Largest Newspaper Dataset πŸ“Š Kaggle
Wikipedia Dumps All types of Wikipedia Articles πŸ”— Wiki Dumps
bdNews24 Corpus bdNews24 largest dataset πŸ“Š Kaggle

🎀 Speech to Text / Text to Speech

Dataset Description Size Link
OpenSLR Bangla Large-scale speech corpus 250+ hours, 2000+ speakers πŸ”— OpenSLR
Common Voice Bangla Crowdsourced speech data 500+ hours (growing) πŸ”— Mozilla
FLEURS Bangla Cross-lingual speech corpus 12 hours πŸ€— HuggingFace
BanglaASR Dataset Fine-tuned ASR corpus 23.8 hours πŸ”— GitHub
Text to Speech Bengali Text to Speech Dataset Studio quality πŸ”— Bengali.ai
Speech Recognition Bengali Automatic Speech Recognition Dataset Various speakers πŸ”— Bengali.ai
Regional Dialect ASR Dialect-specific speech recognition 100+ hours, 8 dialects πŸ”— GitHub
Multi-Speaker TTS Multiple speaker TTS corpus 20 hours, 10 speakers πŸ”— GitHub
Expressive TTS Dataset Emotional speech synthesis 15 hours, 8 emotions πŸ”— GitHub
Handwritten Digits Numta Handwritten Bengali Digits Visual recognition πŸ”— Bengali.ai

😊 Sentiment Analysis / Sentence Classification

Dataset Description Link
BanglaBook Large-scale book reviews (158K samples) πŸ”— GitHub
SentMix-3L Code-mixed sentiment (Bangla-English-Hindi) πŸ”— GitHub
Social Media Comments Bangla Text Dataset from Social Media πŸ”— GitHub
Sentiment Analysis Bengali Sentiment Text πŸ“Š Kaggle
News Classification Classification Bengali News Articles πŸ“Š Kaggle
Drama Review Bangla Drama Review Dataset πŸ“Š Figshare
News Comments Bengali News Comments Sentiment πŸ“Š Kaggle
News Headlines News Headline Classification πŸ“Š Kaggle
Big News Classification Bangla Newspaper Article Classification (Large) πŸ“Š Kaggle
YouTube Sentiment Bangla YouTube Sentiment/Emotion Dataset πŸ“Š Kaggle
Multilingual Sentiment Sentiment Lexicons for 81 Languages πŸ“Š Kaggle
Twitter Dataset Twitter Sentiment Analysis Dataset πŸ”— GitHub
EmoNoBa Emotion analysis on noisy Bangla texts πŸ”— GitHub
SentiGOLD Multi-domain sentiment analysis πŸ”— GitHub
Bangla Emotion Corpus Comprehensive emotion detection πŸ”— GitHub
Social Media Sentiment Social media specific sentiment πŸ”— GitHub
Bangla Fake News Detection Misinformation detection dataset πŸ“Š Kaggle
BanglaSarc Sarcasm detection dataset πŸ”— GitHub
Complaint Classification Customer complaint categorization πŸ”— GitHub

πŸ”„ Bangla Machine Translation Dataset

Dataset Description Link
2.5M Pairs 2.5M pair sentences - NOT low resource anymore πŸ”— GitHub
WMT24 Seed Dataset High-quality manual translations πŸ“– Paper
TED Dataset TED dataset (small) πŸ“₯ Download
Bangla Dictionary Bengali Dictionary πŸ”— GitHub
SUPERA Dataset SUPARA08M Balanced English-Bangla Parallel Corpus πŸ“Š IEEE DataPort
Samanantar Large-scale parallel corpus πŸ”— AI4Bharat
OPUS Collections Multiple parallel corpora πŸ”— OPUS
Indic-Indic Translation Inter-Indic language translation πŸ”— GitHub
BanglaDialectTranslation Regional dialect to standard Bangla πŸ”— GitHub
Vashantor Multi-regional dialect corpus πŸ”— GitHub
Legal Translation Corpus Legal document translation πŸ”— GitHub
Medical Translation Dataset Healthcare translation πŸ”— GitHub

🏷️ Bangla POS Tag Dataset

Dataset Description Link
3k Sentences 3k POS tag sentences πŸ”— GitHub
100k+ Words Single word tagging 100k+ πŸ“Š Kaggle

🏷️ Bangla NER Dataset

Dataset Description Link
70k Sentences 70k sentences with 5 types of NER πŸ”— GitHub
400k+ Words Word-level NER 400k+ πŸ“Š Kaggle
B-NER Comprehensive Bangla NER dataset πŸ”— GitHub
BanglaPersonNER Person name extraction πŸ”— GitHub
Complex NER Dataset Multi-type entity recognition πŸ”— GitHub
Medical NER Dataset Healthcare entity recognition πŸ”— GitHub
Financial NER Corpus Finance domain entities πŸ”— GitHub
Legal Entity Recognition Legal document entity extraction πŸ”— GitHub
Bangladesh Geographic NER Location entity recognition πŸ”— GitHub

❓ Question Answering Dataset

Dataset Description Link
Squad 2.0 Style Question Answering Squad 2.0 in Bangla πŸ“Š Kaggle
BanglaRQA Reading comprehension dataset πŸ”— GitHub
SQuAD-BN Bangla version of SQuAD πŸ”— GitHub
Contextual QA Dataset Multi-context question answering πŸ”— GitHub
Medical QA Bangla Healthcare question answering πŸ”— GitHub
Legal QA Dataset Legal question answering πŸ”— GitHub
Educational QA Corpus Academic question answering πŸ”— GitHub
Bangla Conversational QA Multi-turn question answering πŸ”— GitHub

πŸ“ Bangla Text Summarization

Dataset Description Link
Article Summarization Articles Summarization (extractive & abstractive) πŸ“Š Kaggle
BANSData Dataset for Bengali Abstractive News Summarization πŸ“Š Kaggle
3 Human Evaluated Articles with 3 human evaluated summaries πŸ”— BNLPC
BenSum Bangla news summarization πŸ”— GitHub
BanglaNewsSummarization Extended news corpus πŸ”— GitHub
BUSUM Multi-document summarization πŸ”— GitHub
Academic Paper Summarization Research paper summarization πŸ”— GitHub
Book Chapter Summarization Literature summarization πŸ”— GitHub

πŸ•΅οΈ Bangla Fake News Detection

Dataset Description Link
50k Fake News 50k Bangla fake news dataset πŸ“Š Kaggle

πŸ–ŠοΈ Handwriting Recognition / OCR

Dataset Description Link
Ekush Bangla Handwritten Characters πŸ”— Website
Bayanno Multi-purpose handwritten dataset πŸ“Š Mendeley
BN-HTRd Document Level Offline Bangla HTR (108k words) πŸ“Š Mendeley
Bongabdo Bangla handwritten script dataset πŸ“„ Research Paper
BanglaOCR Dataset Comprehensive OCR training data πŸ”— GitHub
BanglaHWR Dataset Handwriting recognition corpus πŸ”— GitHub
Document Layout Analysis Document understanding dataset πŸ”— GitHub

🌐 Knowledge Graphs and Information Extraction

Dataset Description Link
BanglaAutoKG Automatic knowledge graph construction πŸ”— GitHub
Bangladesh Agricultural KG Agricultural data integration πŸ“„ IEEE Access
Bangla Wikipedia Knowledge Graph Structured Wikipedia knowledge πŸ”— GitHub
Bangla Event Extraction News event extraction πŸ”— GitHub
Social Media Event Detection Real-time event detection πŸ”— GitHub
Bangla Relation Extraction Entity relationship extraction πŸ”— GitHub
Knowledge Base Relations Structured knowledge extraction πŸ”— GitHub
Aspect-Based Opinion Mining Detailed opinion analysis πŸ”— GitHub
Bangla Semantic Textual Similarity Sentence similarity dataset πŸ”— GitHub
Concept Mapping Dataset Conceptual relationship mapping πŸ”— GitHub
Bangla WordNet Lexical semantic network πŸ”— GitHub

πŸ“š Corpus and Language Modeling

Dataset Description Size Link
BanglaLM Large language modeling corpus 27.5 GB πŸ”— GitHub
Indic Corpus Multi-lingual Indic corpus 6.5 GB Bangla πŸ”— AI4Bharat
CC-100 Bangla CommonCrawl Bangla subset 8.3 GB πŸ”— StatMT
OSCAR Bangla Web-crawled multilingual corpus 12 GB πŸ”— OSCAR
Bangla Poetry Corpus Classical and modern poetry 25,000+ poems πŸ”— GitHub
Literary Text Collection Bangla literature corpus 10,000+ books πŸ”— GitHub
Academic Text Corpus Scholarly text collection 50,000+ papers πŸ”— GitHub
Bangla Morphological Analyzer Morphological analysis dataset 100,000+ word-morpheme pairs πŸ”— GitHub
Phonetic Transcription Corpus IPA transcription dataset 50,000+ word-pronunciation pairs πŸ”— GitHub

πŸ–ΌοΈ Multimodal Datasets

Dataset Description Size Link
Bangla Image Captioning Image description generation 50,000+ image-caption pairs πŸ”— GitHub
Visual Question Answering Bangla Visual reasoning dataset 25,000+ image-question-answer πŸ”— GitHub
Bangla Video Captioning Video description dataset 5,000+ video-caption pairs πŸ”— GitHub
Sign Language Recognition Bangla sign language dataset 10,000+ sign videos πŸ”— GitHub
Music-Text Alignment Song lyrics alignment 2,000+ song-lyric pairs πŸ”— GitHub

πŸ”§ Miscellaneous

Dataset Description Link
Numbers with Words Bengali numbers with words πŸ“Š Kaggle
Image to Text Bangla Natural Language Image to Text (BnLiT) πŸ“Š Kaggle

πŸ’‘ Motivation

Coming soon...

🀝 Usage and Contribute

Documentation for usage and contribution guidelines coming soon...

How to Get Started

  1. For Pre-trained Models: Visit HuggingFace model hub links above
  2. For Tools: Install Python libraries like BNLP or BNLTK
  3. For Datasets: Follow the individual dataset links and instructions
  4. For Research: Check out the latest papers and benchmarks

Contributing Guidelines

  • πŸ“ Submit new datasets through pull requests
  • πŸ› Report issues or broken links
  • πŸ’‘ Suggest improvements to the documentation
  • πŸ”¬ Share your research findings

⭐ If you find this repository helpful, please give it a star! ⭐

🀝 Contributions are welcome! Feel free to submit issues and pull requests.

πŸ“¬ Questions? Open an issue or contact the maintainers.

🌟 Special thanks to all the researchers and developers who contributed to Bangla NLP!


β˜• Support This Project

If this repository has been helpful to you, consider supporting the project:

Buy Me A Coffee

Your support helps maintain and improve this resource for the Bangla NLP community! πŸ’š

About

Bangla NLP dataset. Bangla NER,POStag, text summarization, stopword, translate, sentiment analysis, wiki articles, root word, dataset etc.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published