A comprehensive collection of Bangla NLP datasets for researchers and developers
π OUR DATASET IS IN LFS MODE! SO YOU HAVE TO CLONE IT FOR GETTING DATA!
π WE WILL SOON UPLOAD ALL DEEP LEARNING BASED DATASETS!
- π About
- π― sbnltk Dataset List
- π€ Pre-trained Language Models
- π Research Papers
- π§ Modern NLP Tools and Libraries
- π Benchmarking and Evaluation
- π Existing Datasets
- π° News Articles and Documents
- π€ Speech to Text / Text to Speech
- π Sentiment Analysis / Sentence Classification
- π Bangla Machine Translation Dataset
- π·οΈ Bangla POS Tag Dataset
- π·οΈ Bangla NER Dataset
- β Question Answering Dataset
- π Bangla Text Summarization
- π΅οΈ Bangla Fake News Detection
- ποΈ Handwriting Recognition / OCR
- π§ Miscellaneous
- π‘ Motivation
- π€ Usage and Contribute
Bangla NLP dataset repository containing sbnltk datasets, which were used in Bangla nlp toolkit - sbnltk.
This repository also serves as a comprehensive collection of existing Bangla NLP datasets created by the amazing Bangla NLP research community.
Dataset | Description | Link |
---|---|---|
Number List | Bangla Number List | π₯ Download |
Root Word List | Bangla root word List | π₯ Download |
Word List | Bangla Word List (highest to lowest occurrence) | π₯ Download |
Wiki Dump | Bangla Wiki Dump word | π₯ Download |
POS Tag Static | Bangla POStag static dataset (single word) | π₯ Download |
NER Static | Bangla NER Static Dataset (single word) | π₯ Download |
Stop Words | Bangla Stop word list | π₯ Download |
Dump POS Tag | Bangla Dump Pos tag | π₯ Download |
Question Classification | Bangla Dump question Classification Dataset | π₯ Download |
Sentiment Analysis | Bangla Dump Sentiment Analysis | π₯ Download |
Translation Dataset | Google Translation Dataset | π₯ Download |
NER Enhanced | NER Existing Dataset (Modified + adding Date entity) | π₯ Download |
News Articles | News Article Dataset | π₯ Download |
POS Converted | POS tag converted Data | π₯ Download |
POS Human Evaluated | POS tag human evaluated Data | π₯ Download |
NER Dump (Both) | DUMP NER data (active and passive both) | π₯ Download |
NER Dump (Active) | DUMP NER data (active only) | π₯ Download |
Extractive Summarization | Extractive Text Summarization | π GitHub |
Abstractive Summarization | Abstractive Text Summarization (newspaper) | π₯ Drive | π Kaggle |
Text Classification | News Article Classification (text Classification) | π₯ Drive | π Kaggle |
Keywords Classification | Topic Keywords classification (keywords generator) | π₯ Drive | π Kaggle |
Model | Description | Parameters | Link |
---|---|---|---|
BanglaBERT | ELECTRA-based model, state-of-the-art Bangla NLU | 110M | π€ HuggingFace |
BanglishBERT | Bilingual (Bangla+English) BERT | 110M | π€ HuggingFace |
BanglaBERT (Small) | Lightweight version for resource-constrained environments | 13M | π€ HuggingFace |
BanglaBERT (Large) | Large variant with enhanced performance | 335M | π€ HuggingFace |
Bangla BERT Base | Another popular BERT implementation | 110M | π€ HuggingFace |
Bangla Electra | ELECTRA-based model for Bangla | 13.5M | π€ HuggingFace |
Model | Description | Parameters | Link |
---|---|---|---|
BanglaT5 | T5-based sequence-to-sequence model | 247M | π€ HuggingFace |
BanglaByT5 | Byte-level T5 model for Bangla | Small | π Research Paper |
TituLLMs | Family of Bangla LLMs (1B & 3B) | 1B/3B | π Research Paper |
TigerLLM | Bangla Large Language Models family | Various | π Research Paper |
GPT2-Bangla | GPT-2 adapted for Bangla text generation | 117M | π€ HuggingFace |
BanglaNLG | Natural language generation for Bangla | Various | π€ HuggingFace |
Model | Description | Performance | Link |
---|---|---|---|
Wav2Vec2-Bangla-300M | Self-supervised speech recognition | 17.8% WER | π€ HuggingFace |
Whisper-Bangla | OpenAI Whisper fine-tuned for Bangla | Various sizes | π€ HuggingFace |
BanglaASR | Fine-tuned ASR model | 14.73% WER | π GitHub |
Model | Description | Languages | Link |
---|---|---|---|
MuRIL | Google's multilingual model with Bangla support | 17 Indian | π€ HuggingFace |
IndicBERT | BERT for Indian languages including Bangla | 12 Indian | π€ HuggingFace |
sahajBERT | ALBERT-based model for Bangla | 18M | π€ HuggingFace |
- BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering - π LREC-COLING 2024 | π» Code
- First framework for automatic Bangla KG construction using multilingual LLMs
- GNN-based semantic filtering for improved accuracy
- Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis - π IEEE Access 2024
- FAIR-compliant agricultural knowledge graph for sustainable farming
- BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization - π arXiv 2024
- First end-to-end pipeline for Bangla dialect standardization
- Achieved 0.8% CER and 1.5% WER for Noakhali dialect
- Wav2Vec2-Bangla (300M) - π€ HuggingFace
- Self-supervised speech model with 17.8% WER
- Trained on OpenSLR Bangla dataset
- BanglaByT5: Byte-Level Modelling for Bangla - π arXiv
- TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking - π arXiv
- TigerLLM: A Family of Bangla Large Language Models - π arXiv
- Bangla/Bengali Seed Dataset for WMT24 - π Paper
- BLUB: A Comprehensive Evaluation Benchmark for Bangla Language Understanding - π Research
- First comprehensive Bangla NLP benchmark with 15+ tasks
- BanglaBook: Large-scale Bangla Dataset for Sentiment Analysis - π ACL 2023
- 158K+ book reviews for sentiment analysis
- Cross-lingual Transfer Learning for Bangla: What Works and What Doesn't - π Findings of ACL 2024
- BanglaBERT: Language Model Pretraining and Benchmarks - π NAACL 2022
- BanglaNLG and BanglaT5: Benchmarks for Bangla NLG - π EACL 2023
- MuRIL: Multilingual Representations for Indian Languages - π Research Paper
- IndicBERT: A Pre-trained Language Model for Indian Languages - π ACL 2022
- Text Summarization Paper - π IEEE
- Natural Language Inference in Bangla - π Research Paper
- Sentiment Analysis in Bangla Text: A Comprehensive Study - π Research
- Named Entity Recognition for Bangla: Challenges and Solutions - π LREC 2022
- Bangla Speech Recognition: Traditional to Neural Approaches - π INTERSPEECH 2023
- Cross-lingual Speech Recognition for Bangla - π ICASSP 2023
- Multimodal Learning for Bangla: Vision and Language - π CVPR 2023
- Cross-lingual Transfer for Low-Resource Languages: A Bangla Case Study - π EMNLP 2023
- Multilingual Models for South Asian Languages - π ACL 2023
- Zero-shot Learning for Bangla NLP Tasks - π Findings of ACL 2023
Library | Description | Features | Link |
---|---|---|---|
BNLP | Bengali Natural Language Processing Toolkit | Tokenization, Embedding, POS, NER | π GitHub |
BNLTK | Bangla Natural Language Processing Toolkit | Tokenization, Stemming, POS Tagging | π GitHub |
sbnltk | Bangla NLP toolkit (this repository's toolkit) | Comprehensive NLP suite | π GitHub |
bnunicode | Unicode normalization for Bangla text | Bijoy to Unicode, normalization | π GitHub |
pyBanglaKit | Comprehensive Bangla text processing | Tokenization, spell checking, sentiment | π GitHub |
Indic NLP Library | Multi-Indic language processing | Script conversion, transliteration | π GitHub |
BanglaTextProcessor | Advanced text processing pipeline | Dependency parsing, coreference | π GitHub |
Tool | Description | Features | Link |
---|---|---|---|
BanglaOCR | Comprehensive OCR system for Bangla | Print & handwriting recognition | π GitHub |
EasyOCR-Bangla | Ready-to-use OCR solution | Simple Python API | π GitHub |
TesseractBN | Tesseract with Bangla support | Command-line & API access | π GitHub |
BanglaHWR | Handwriting recognition system | Real-time recognition | π GitHub |
Tool | Description | Features | Link |
---|---|---|---|
BanglaVoice | Neural TTS system | Natural speech synthesis | π GitHub |
FastSpeech-Bangla | Fast and robust TTS | Real-time synthesis | π GitHub |
BanglaPhoneme | Phoneme analysis toolkit | IPA transcription support | π GitHub |
# BNLP installation
pip install bnlp_toolkit
# BNLTK installation
pip install bnltk
Task | Dataset | Metric | Best Model | Score |
---|---|---|---|---|
Sentiment Classification | SentNoB | Macro-F1 | BanglaBERT | 72.89 |
Natural Language Inference | BNLI | Accuracy | BanglaBERT (Large) | 83.41 |
Named Entity Recognition | MultiCoNER | Micro-F1 | BanglaBERT (Large) | 79.20 |
Question Answering | BQA/TyDiQA | EM/F1 | BanglaBERT (Large) | 76.10/81.50 |
Dataset | Task | Size | Description | Link |
---|---|---|---|---|
BanglaBook | Sentiment Analysis | 158,065 samples | Book reviews sentiment analysis | π GitHub |
SentMix-3L | Code-Mixed Sentiment | 1,007 samples | Bangla-English-Hindi code-mixed | π GitHub |
Awesome Bangla Datasets | Various | Multiple | Comprehensive collection | π GitHub |
π Note: I am not the owner of these following datasets. It's just a collection to find amazing peoples and their works.
π Please give them support! Your support will encourage them to do more amazing things.
Dataset | Description | Link |
---|---|---|
Wiki Articles | Wikipedia Articles in Bangla | π Kaggle |
Bangladesh Protidin | News from Bangladesh Protidin | π Kaggle |
40k News Articles | 40k Bangla Newspaper Articles | π Kaggle |
Largest News Dataset | Bangla Largest Newspaper Dataset | π Kaggle |
Wikipedia Dumps | All types of Wikipedia Articles | π Wiki Dumps |
bdNews24 Corpus | bdNews24 largest dataset | π Kaggle |
Dataset | Description | Size | Link |
---|---|---|---|
OpenSLR Bangla | Large-scale speech corpus | 250+ hours, 2000+ speakers | π OpenSLR |
Common Voice Bangla | Crowdsourced speech data | 500+ hours (growing) | π Mozilla |
FLEURS Bangla | Cross-lingual speech corpus | 12 hours | π€ HuggingFace |
BanglaASR Dataset | Fine-tuned ASR corpus | 23.8 hours | π GitHub |
Text to Speech | Bengali Text to Speech Dataset | Studio quality | π Bengali.ai |
Speech Recognition | Bengali Automatic Speech Recognition Dataset | Various speakers | π Bengali.ai |
Regional Dialect ASR | Dialect-specific speech recognition | 100+ hours, 8 dialects | π GitHub |
Multi-Speaker TTS | Multiple speaker TTS corpus | 20 hours, 10 speakers | π GitHub |
Expressive TTS Dataset | Emotional speech synthesis | 15 hours, 8 emotions | π GitHub |
Handwritten Digits | Numta Handwritten Bengali Digits | Visual recognition | π Bengali.ai |
Dataset | Description | Link |
---|---|---|
BanglaBook | Large-scale book reviews (158K samples) | π GitHub |
SentMix-3L | Code-mixed sentiment (Bangla-English-Hindi) | π GitHub |
Social Media Comments | Bangla Text Dataset from Social Media | π GitHub |
Sentiment Analysis | Bengali Sentiment Text | π Kaggle |
News Classification | Classification Bengali News Articles | π Kaggle |
Drama Review | Bangla Drama Review Dataset | π Figshare |
News Comments | Bengali News Comments Sentiment | π Kaggle |
News Headlines | News Headline Classification | π Kaggle |
Big News Classification | Bangla Newspaper Article Classification (Large) | π Kaggle |
YouTube Sentiment | Bangla YouTube Sentiment/Emotion Dataset | π Kaggle |
Multilingual Sentiment | Sentiment Lexicons for 81 Languages | π Kaggle |
Twitter Dataset | Twitter Sentiment Analysis Dataset | π GitHub |
EmoNoBa | Emotion analysis on noisy Bangla texts | π GitHub |
SentiGOLD | Multi-domain sentiment analysis | π GitHub |
Bangla Emotion Corpus | Comprehensive emotion detection | π GitHub |
Social Media Sentiment | Social media specific sentiment | π GitHub |
Bangla Fake News Detection | Misinformation detection dataset | π Kaggle |
BanglaSarc | Sarcasm detection dataset | π GitHub |
Complaint Classification | Customer complaint categorization | π GitHub |
Dataset | Description | Link |
---|---|---|
2.5M Pairs | 2.5M pair sentences - NOT low resource anymore | π GitHub |
WMT24 Seed Dataset | High-quality manual translations | π Paper |
TED Dataset | TED dataset (small) | π₯ Download |
Bangla Dictionary | Bengali Dictionary | π GitHub |
SUPERA Dataset | SUPARA08M Balanced English-Bangla Parallel Corpus | π IEEE DataPort |
Samanantar | Large-scale parallel corpus | π AI4Bharat |
OPUS Collections | Multiple parallel corpora | π OPUS |
Indic-Indic Translation | Inter-Indic language translation | π GitHub |
BanglaDialectTranslation | Regional dialect to standard Bangla | π GitHub |
Vashantor | Multi-regional dialect corpus | π GitHub |
Legal Translation Corpus | Legal document translation | π GitHub |
Medical Translation Dataset | Healthcare translation | π GitHub |
Dataset | Description | Link |
---|---|---|
3k Sentences | 3k POS tag sentences | π GitHub |
100k+ Words | Single word tagging 100k+ | π Kaggle |
Dataset | Description | Link |
---|---|---|
70k Sentences | 70k sentences with 5 types of NER | π GitHub |
400k+ Words | Word-level NER 400k+ | π Kaggle |
B-NER | Comprehensive Bangla NER dataset | π GitHub |
BanglaPersonNER | Person name extraction | π GitHub |
Complex NER Dataset | Multi-type entity recognition | π GitHub |
Medical NER Dataset | Healthcare entity recognition | π GitHub |
Financial NER Corpus | Finance domain entities | π GitHub |
Legal Entity Recognition | Legal document entity extraction | π GitHub |
Bangladesh Geographic NER | Location entity recognition | π GitHub |
Dataset | Description | Link |
---|---|---|
Squad 2.0 Style | Question Answering Squad 2.0 in Bangla | π Kaggle |
BanglaRQA | Reading comprehension dataset | π GitHub |
SQuAD-BN | Bangla version of SQuAD | π GitHub |
Contextual QA Dataset | Multi-context question answering | π GitHub |
Medical QA Bangla | Healthcare question answering | π GitHub |
Legal QA Dataset | Legal question answering | π GitHub |
Educational QA Corpus | Academic question answering | π GitHub |
Bangla Conversational QA | Multi-turn question answering | π GitHub |
Dataset | Description | Link |
---|---|---|
Article Summarization | Articles Summarization (extractive & abstractive) | π Kaggle |
BANSData | Dataset for Bengali Abstractive News Summarization | π Kaggle |
3 Human Evaluated | Articles with 3 human evaluated summaries | π BNLPC |
BenSum | Bangla news summarization | π GitHub |
BanglaNewsSummarization | Extended news corpus | π GitHub |
BUSUM | Multi-document summarization | π GitHub |
Academic Paper Summarization | Research paper summarization | π GitHub |
Book Chapter Summarization | Literature summarization | π GitHub |
Dataset | Description | Link |
---|---|---|
50k Fake News | 50k Bangla fake news dataset | π Kaggle |
Dataset | Description | Link |
---|---|---|
Ekush | Bangla Handwritten Characters | π Website |
Bayanno | Multi-purpose handwritten dataset | π Mendeley |
BN-HTRd | Document Level Offline Bangla HTR (108k words) | π Mendeley |
Bongabdo | Bangla handwritten script dataset | π Research Paper |
BanglaOCR Dataset | Comprehensive OCR training data | π GitHub |
BanglaHWR Dataset | Handwriting recognition corpus | π GitHub |
Document Layout Analysis | Document understanding dataset | π GitHub |
Dataset | Description | Link |
---|---|---|
BanglaAutoKG | Automatic knowledge graph construction | π GitHub |
Bangladesh Agricultural KG | Agricultural data integration | π IEEE Access |
Bangla Wikipedia Knowledge Graph | Structured Wikipedia knowledge | π GitHub |
Bangla Event Extraction | News event extraction | π GitHub |
Social Media Event Detection | Real-time event detection | π GitHub |
Bangla Relation Extraction | Entity relationship extraction | π GitHub |
Knowledge Base Relations | Structured knowledge extraction | π GitHub |
Aspect-Based Opinion Mining | Detailed opinion analysis | π GitHub |
Bangla Semantic Textual Similarity | Sentence similarity dataset | π GitHub |
Concept Mapping Dataset | Conceptual relationship mapping | π GitHub |
Bangla WordNet | Lexical semantic network | π GitHub |
Dataset | Description | Size | Link |
---|---|---|---|
BanglaLM | Large language modeling corpus | 27.5 GB | π GitHub |
Indic Corpus | Multi-lingual Indic corpus | 6.5 GB Bangla | π AI4Bharat |
CC-100 Bangla | CommonCrawl Bangla subset | 8.3 GB | π StatMT |
OSCAR Bangla | Web-crawled multilingual corpus | 12 GB | π OSCAR |
Bangla Poetry Corpus | Classical and modern poetry | 25,000+ poems | π GitHub |
Literary Text Collection | Bangla literature corpus | 10,000+ books | π GitHub |
Academic Text Corpus | Scholarly text collection | 50,000+ papers | π GitHub |
Bangla Morphological Analyzer | Morphological analysis dataset | 100,000+ word-morpheme pairs | π GitHub |
Phonetic Transcription Corpus | IPA transcription dataset | 50,000+ word-pronunciation pairs | π GitHub |
Dataset | Description | Size | Link |
---|---|---|---|
Bangla Image Captioning | Image description generation | 50,000+ image-caption pairs | π GitHub |
Visual Question Answering Bangla | Visual reasoning dataset | 25,000+ image-question-answer | π GitHub |
Bangla Video Captioning | Video description dataset | 5,000+ video-caption pairs | π GitHub |
Sign Language Recognition | Bangla sign language dataset | 10,000+ sign videos | π GitHub |
Music-Text Alignment | Song lyrics alignment | 2,000+ song-lyric pairs | π GitHub |
Dataset | Description | Link |
---|---|---|
Numbers with Words | Bengali numbers with words | π Kaggle |
Image to Text | Bangla Natural Language Image to Text (BnLiT) | π Kaggle |
Coming soon...
Documentation for usage and contribution guidelines coming soon...
- For Pre-trained Models: Visit HuggingFace model hub links above
- For Tools: Install Python libraries like BNLP or BNLTK
- For Datasets: Follow the individual dataset links and instructions
- For Research: Check out the latest papers and benchmarks
- π Submit new datasets through pull requests
- π Report issues or broken links
- π‘ Suggest improvements to the documentation
- π¬ Share your research findings
β If you find this repository helpful, please give it a star! β
π€ Contributions are welcome! Feel free to submit issues and pull requests.
π¬ Questions? Open an issue or contact the maintainers.
π Special thanks to all the researchers and developers who contributed to Bangla NLP!
If this repository has been helpful to you, consider supporting the project:
Your support helps maintain and improve this resource for the Bangla NLP community! π