Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization and word tokenization have issues. #145

Open
mirfan899 opened this issue Aug 15, 2021 · 1 comment
Open

Normalization and word tokenization have issues. #145

mirfan899 opened this issue Aug 15, 2021 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@mirfan899
Copy link

Tried to run the example given in the documentation for normalization and the results do not match.

normalize("پی ایس ایل میں 69 مقامی اور کرس گیل، ڈیرن سیمی، کیون پیٹرسن اور شین واٹسن سمیت29 غیر ملکی کھلاڑی شامل ہیں۔")
'پی ایس ایل میں 69 مقامی اور کرس گیل، ڈیرن سیمی، کیون پیٹرسن اور شین واٹسن سمیت29 غیر ملکی کھلاڑی شامل ہیں۔'

Does not normalized سمیت29.

Similarly used the word tokenizer and the results are not so good.

word_tokenizer("پی سی بی چیئرمین کے مطابق نوجوان کھلاڑیوں کو انٹرنیشنل کھلاڑیوں کے ساتھ کھیلنے سے فائدہ ہوگا۔")
['پی', 'سی', 'بی', 'چیئر', 'مین', 'کے', 'مطابق', 'نوجو', 'ان', 'کھلاڑیوں', 'کو', 'انٹرنیشنل', 'کھلاڑیوں', 'کے', 'ساتھ', 'کھیلنے', 'سے', 'فائدہ', 'ہو', 'گا۔']

چیئرمین and نوجوان are broken into multiple words.

Your Environment

  • Operating System: ubuntu 20
  • Python Version Used: 3.8
  • Urduhack Version Used: latest
  • Environment Information:
absl-py==0.12.0
astunparse==1.6.3
attrs==21.2.0
beautifulsoup4==4.9.3
bs4==0.0.1
cachetools==4.2.2
certifi==2021.5.30
charset-normalizer==2.0.4
clang==5.0
click==7.1.2
dill==0.3.4
flatbuffers==1.12
future==0.18.2
gast==0.4.0
google-auth==1.34.0
google-auth-oauthlib==0.4.5
google-pasta==0.2.0
googleapis-common-protos==1.53.0
grpcio==1.39.0
h5py==3.1.0
idna==3.2
keras==2.6.0
Keras-Preprocessing==1.1.2
Markdown==3.3.4
numpy==1.19.5
oauthlib==3.1.1
opt-einsum==3.3.0
promise==2.3
protobuf==3.17.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
regex==2021.8.3
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
six==1.15.0
soupsieve==2.2.1
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.6.0
tensorflow-addons==0.13.0
tensorflow-datasets==3.2.1
tensorflow-estimator==2.6.0
tensorflow-gpu==2.6.0
tensorflow-metadata==1.2.0
termcolor==1.1.0
tf2crf==0.1.32
tqdm==4.62.1
typeguard==2.12.1
typing-extensions==3.7.4.3
urduhack==1.1.1
urllib3==1.26.6
Werkzeug==2.0.1
wrapt==1.12.1
@mirfan899 mirfan899 changed the title Normalization and word tokenzation has issues. Normalization and word tokenization have issues. Aug 17, 2021
@akkefa akkefa self-assigned this Aug 18, 2021
@akkefa akkefa added the bug Something isn't working label Aug 18, 2021
@akkefa
Copy link
Member

akkefa commented Aug 18, 2021

@mirfan899 will look into this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants