-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Current Treatment of multi-sentence prompts: Prompts/generations are split into single sentences for translation and then recombine them again, here with the sentence_splitter
library.
Problem: ⚠ The sentence splitter does not deal well with multiple separators, as e.g. in enumerations: The deference between them\n\na. deliverable is a significant event in the project
(example from aya human annotated) would get split into ['The deference between them', '', 'a. deliverable is a significant event in the project']
Effect:
- This results in us sending an empty string for translation, which is not filtered out anywhere.
- This empty string gets translated in every language into some random term or word. Here e.g. for Portuguese it is
- Não
:A deferência entre eles - Não. a. O produto a entregar é um evento significativo no projeto
- This has repercussions on data quality, as any model trained on data like this will pick up these systematic patterns.
Solution: Ideally, sentence splitting would be whitespace preserving, as suggested in this PR to the sentence splitting library. If that is not feasible, we should at least filter out empty strings before sending them to the translation model.