Skip to content

Sentence splitting leads to translation artifacts #12

@julia-cohere

Description

@julia-cohere

Current Treatment of multi-sentence prompts: Prompts/generations are split into single sentences for translation and then recombine them again, here with the sentence_splitter library.

Problem: ⚠ The sentence splitter does not deal well with multiple separators, as e.g. in enumerations: The deference between them\n\na. deliverable is a significant event in the project (example from aya human annotated) would get split into ['The deference between them', '', 'a. deliverable is a significant event in the project']

Effect:

  • This results in us sending an empty string for translation, which is not filtered out anywhere.
  • This empty string gets translated in every language into some random term or word. Here e.g. for Portuguese it is - Não: A deferência entre eles - Não. a. O produto a entregar é um evento significativo no projeto
  • This has repercussions on data quality, as any model trained on data like this will pick up these systematic patterns.

Solution: Ideally, sentence splitting would be whitespace preserving, as suggested in this PR to the sentence splitting library. If that is not feasible, we should at least filter out empty strings before sending them to the translation model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions