tokenize by script boundaries - only #327

mediabuff · 2024-03-08T09:59:30Z

I am trying to tokenize multilingual (rather multi script) strings - into components where each component is of only one script (as defined by Unicode). I tried using -segment_alphabet_change but this also breaks at spaces.
The following

the rootकृ in the sense of frequency; e.g. चर्करीति, चर्कर्ति, बोभवीति बोभोति

should break as 4 tokens

"the root" "कृ " "in the sense of frequency; e.g." "चर्करीति, चर्कर्ति, बोभवीति बोभोति"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize by script boundaries - only #327

tokenize by script boundaries - only #327

mediabuff commented Mar 8, 2024

tokenize by script boundaries - only #327

tokenize by script boundaries - only #327

Comments

mediabuff commented Mar 8, 2024