This repository accompanies our ACL 2025 Findings paper: "Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval"
β¨ We provide a reproducible benchmark suite for Amharic information retrieval, including:
-
BM25 sparse baseline
-
Dense embedding models (RoBERTa / BERT variants fine-tuned for Amharic)
-
ColBERT-AM (late interaction retriever)
- Pretrained Amharic Retrieval Models Includes ( RoBERTa-Base-Amharic-Embd, RoBERTa-Medium-Amharic-Embd, BERT-Medium-Amharic-Embd, and ColBERT-AM for dense retrieval.)
- Hugging Face model & dataset links for easy access
- Training, evaluation, and inference scripts for reproducibility
- Benchmarks BM25 (sparse retrieval), bi-encoder dense retrieval, and ColBERT (late interaction retrieval) for Amharic.
- MS MARCO-style dataset conversion script & direct dataset links
amharic-ir-benchmarks/
βββ baselines/ # BM25, ColBERT, and dense Amharic retrievers
β βββ bm25_retriever/
β βββ ColBERT_AM/
β βββ colbert-amharic-pylate/
β βββ embedding_models/
βββ data/ # Scripts to download, preprocess, and prepare datasets
βββ scripts/ # Shell scripts for training, indexing, evaluation
βββ utils/ # Utility functions
βββ amharic_environment.yml # Conda environment
βββ requirements.txt
βββ README.md
conda env create -f amharic_environment.yml
conda activate amharic_ir
Or using pip:
pip install -r requirements.txt
We use two publicly available Amharic datasets:
Dataset | Description | Link |
---|---|---|
2AIRTC | Ad-hoc IR test collection | 2AIRTC Website |
Amharic News | Headlineβbody classification corpus | Hugging Face |
Scripts for downloading and preprocessing can be found in the data/
folder.
Our Amharic text embedding and ColBERT models can be found in the following Hugging Face Collection
bash scripts/train_colbert.sh
python baselines/bm25_retriever/run_bm25.ipynb
bash scripts/index_colbert.sh
bash scripts/retrieve_colbert.sh
This table presents the performance of Amharic-optimized vs multilingual dense retrieval models on the Amharic News dataset, using a bi-encoder architecture. We report MRR@10, NDCG@10, and Recall@10/50/100. Best scores are in bold, and β indicates statistically significant improvements (p < 0.05) over the strongest multilingual baseline.
Model | Params | MRR@10 | NDCG@10 | Recall@10 | Recall@50 | Recall@100 |
---|---|---|---|---|---|---|
Multilingual models | ||||||
gte-modernbert-base | 149M | 0.019 | 0.023 | 0.033 | 0.051 | 0.067 |
gte-multilingual-base | 305M | 0.600 | 0.638 | 0.760 | 0.851 | 0.882 |
multilingual-e5-large-instruct | 560M | 0.672 | 0.709 | 0.825 | 0.911 | 0.931 |
snowflake-arctic-embed-l-v2.0 | 568M | 0.659 | 0.701 | 0.831 | 0.922 | 0.942 |
Ours (Amharic-optimized models) | ||||||
BERT-Medium-Amharic-embed | 40M | 0.682 | 0.720 | 0.843 | 0.931 | 0.954 |
RoBERTa-Medium-Amharic-embed | 42M | 0.735 | 0.771 | 0.884 | 0.955 | 0.971 |
RoBERTa-Base-Amharic-embed | 110M | 0.775β | 0.808β | 0.913β | 0.964β | 0.979β |
π For further details on the baselines, see: Yu et al., 2024 β Multilingual-E5 Wang et al., 2024 β Snowflake Arctic Embed
The table below presents a comparison between sparse and dense retrieval approaches on the Amharic passage reterival dataset. While BM25 represents traditional lexical matching, RoBERTa and ColBERT models leverage semantic embeddings optimized for Amharic. All models were trained and evaluated under the same data splits.
ColBERT-RoBERTa-Base-Amharic, leveraging late interaction with a RoBERTa backbone, delivers the highest retrieval quality across most metrics. Statistically significant gains are marked with β (p < 0.05).
Type | Model | MRR@10 | NDCG@10 | Recall@10 | Recall@50 | Recall@100 |
---|---|---|---|---|---|---|
Sparse retrieval | BM25-AM | 0.657 | 0.682 | 0.774 | 0.847 | 0.871 |
Dense retrieval | RoBERTa-Base-Amharic-embed | 0.755 | 0.808 | 0.913 | 0.964 | 0.979 |
Dense retrieval | ColBERT-RoBERTa-Base-Amharic | 0.843β | 0.866β | 0.939β | 0.973β | 0.979 |
π Note
- ColBERT-RoBERTa-Base-Amharic significantly outperforms RoBERTa-Base-Amharic on all ranking metrics, except Recall@100, where both models converge. Significance assessed using a paired t-test.
- For additional experiments on the 2AIRTC dataset, refer to the Appendix section of our ACL 2025 Findings paper.
If you use this repository, please cite our ACL 2025 Findings paper:
@inproceedings{mekonnen-etal-2025-optimized,
title = "Optimized Text Embedding Models and Benchmarks for {A}mharic Passage Retrieval",
author = "Mekonnen, Kidist Amde and
Alemneh, Yosef Worku and
de Rijke, Maarten",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.543/",
pages = "10428--10445",
ISBN = "979-8-89176-256-5",
abstract = "Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6{\%} relative improvement in MRR@10 and a 9.86{\%} gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13$\times$ smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks."
}
Please open an issue for questions, feedback, or suggestions.
This project is licensed under the Apache 2.0 License.
This project builds on the ColBERT repository by Stanford FutureData Lab. We sincerely thank the authors for open-sourcing their work, which served as a strong foundation for our Amharic ColBERT implementation and experiments.