Skip to content

kasunw22/sinhala-word-embedding-alignment

Repository files navigation

Sinhala-English word-embedding Alignment

This repository contains the resources related to our research on English-Sinhala word embedding alignment.

  • alignment_matrices/ contains the alignment matrices obtained using different alignment techniques in different directions (i.e. Si --> En and En --> Si).
  • all_data/ contains all the datasets we used for the supervised alignment. The datasets have been created using the large datasets provided in this repository.
  • muse_content/ contains the scripts used for iterative Procrustes alignment which have been adopted from this repository by facebook-research.
  • rcsls_content/ contains the scripts used for RCSLS alignment which have been adopted from the FastText repository by facebook-research.
  • vecmap_content/ contains the scripts used for VecMap alignment which have been adopted from the VecMap repository.
  • contrastive_bli_content/ contains the scripts used for ContranstiveBLI alignment which have been adopted from the ContranstiveBLI repository.

Results from the Papers

Alignment results obtained for Sinhala-English alignment according to the paper:

Model

Comparison of Sinhala-English alignment with other language pairs according to the paper:

Model

Next Work

Alignment results obtained for Sinhala-English alignment from further studies (publication is under review):

Model

Studies on BLI (under review):

Impact of Language Inflection on BLI

Model Model

Impact of multilinguality on BLI

Model Model

Publications

If you are willing to use this work, please be kind enough to cite the following papers.

@INPROCEEDINGS{10253560,
  author={Wickramasinghe, Kasun and De Silva, Nisansa},
  booktitle={2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS)}, 
  title={Sinhala-English Parallel Word Dictionary Dataset}, 
  year={2023},
  volume={},
  number={},
  pages={61-66},
  keywords={Dictionaries;Annotations;Pipelines;Machine translation;Task analysis;Information systems;parallel corpus;alignment;English-Sinhala dictionary;word embedding alignment;lexicon induction},
  doi={10.1109/ICIIS58898.2023.10253560}}
@inproceedings{wickramasinghe-de-silva-2023-sinhala,
    title = "{S}inhala-{E}nglish Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language",
    author = "Wickramasinghe, Kasun  and
      de Silva, Nisansa",
    editor = "Huang, Chu-Ren  and
      Harada, Yasunari  and
      Kim, Jong-Bok  and
      Chen, Si  and
      Hsu, Yu-Yin  and
      Chersoni, Emmanuele  and
      A, Pranav  and
      Zeng, Winnie Huiheng  and
      Peng, Bo  and
      Li, Yuxi  and
      Li, Junlin",
    booktitle = "Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation",
    month = dec,
    year = "2023",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.paclic-1.42",
    pages = "424--435",
}
under review