Sentence splitting leads to translation artifacts

**Current Treatment of multi-sentence prompts:** Prompts/generations are split into single sentences for translation and then recombine them again, [here](https://github.com/for-ai/instruct-multilingual/blob/503b9d7a16e434364f95bce814c3d43a5f8282d2/instructmultilingual/translate_datasets.py#L134) with the [`sentence_splitter`](https://github.com/mediacloud/sentence-splitter) library. 

**Problem:** ⚠ The sentence splitter does not deal well with multiple separators, as e.g. in enumerations: `The deference between them\n\na. deliverable is a significant event in the project` (example from aya human annotated) would get split into `['The deference between them', '', 'a. deliverable is a significant event in the project']`

**Effect:**
- This results in us sending an empty string for translation, which is not filtered out anywhere.
- This empty string gets translated in every language into some random term or word.  Here e.g. for Portuguese it is `- Não`: `A deferência entre eles - Não. a. O produto a entregar é um evento significativo no projeto`
- This has repercussions on data quality, as any model trained on data like this will pick up these systematic patterns.

**Solution:** Ideally, sentence splitting would be whitespace preserving, as suggested in [this PR to the sentence splitting library](https://github.com/mediacloud/sentence-splitter/pull/8). If that is not feasible, we should at least filter out empty strings before sending them to the translation model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sentence splitting leads to translation artifacts #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sentence splitting leads to translation artifacts #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions