Skip to content

Segmenter removes space of English words in code-mixed sentence #43

@shivanraptor

Description

@shivanraptor

Describe the bug
Segmenter removes space of English words in code-mixed sentence, for example this sentence:

這是Career Centre

To reproduce
Here is the code:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
pyseg = pycantonese.segment("這是Career Centre", cls=segmenter)
for word in pyseg:
    print(word)

The output is:

這是
CareerCentre

Expected behavior
The expected output is:

這是
Career Centre

or

這是
Career
Centre

System (please complete the following information):

  • Operating System: macOS Sonoma 14.0 (23A344)
  • PyCantonese version: 3.4.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions