Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize by script boundaries - only #327

Open
mediabuff opened this issue Mar 8, 2024 · 0 comments
Open

tokenize by script boundaries - only #327

mediabuff opened this issue Mar 8, 2024 · 0 comments

Comments

@mediabuff
Copy link

I am trying to tokenize multilingual (rather multi script) strings - into components where each component is of only one script (as defined by Unicode). I tried using -segment_alphabet_change but this also breaks at spaces.
The following

the rootकृ in the sense of frequency; e.g. चर्करीति, चर्कर्ति, बोभवीति बोभोति

should break as 4 tokens

"the root" "कृ " "in the sense of frequency; e.g." "चर्करीति, चर्कर्ति, बोभवीति बोभोति"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant