Normalization and word tokenization have issues. #145

mirfan899 · 2021-08-15T09:46:00Z

Tried to run the example given in the documentation for normalization and the results do not match.

normalize("پی ایس ایل میں 69 مقامی اور کرس گیل، ڈیرن سیمی، کیون پیٹرسن اور شین واٹسن سمیت29 غیر ملکی کھلاڑی شامل ہیں۔")
'پی ایس ایل میں 69 مقامی اور کرس گیل، ڈیرن سیمی، کیون پیٹرسن اور شین واٹسن سمیت29 غیر ملکی کھلاڑی شامل ہیں۔'

Does not normalized سمیت29.

Similarly used the word tokenizer and the results are not so good.

word_tokenizer("پی سی بی چیئرمین کے مطابق نوجوان کھلاڑیوں کو انٹرنیشنل کھلاڑیوں کے ساتھ کھیلنے سے فائدہ ہوگا۔")
['پی', 'سی', 'بی', 'چیئر', 'مین', 'کے', 'مطابق', 'نوجو', 'ان', 'کھلاڑیوں', 'کو', 'انٹرنیشنل', 'کھلاڑیوں', 'کے', 'ساتھ', 'کھیلنے', 'سے', 'فائدہ', 'ہو', 'گا۔']

چیئرمین and نوجوان are broken into multiple words.

Your Environment

Operating System: ubuntu 20
Python Version Used: 3.8
Urduhack Version Used: latest
Environment Information:

absl-py==0.12.0
astunparse==1.6.3
attrs==21.2.0
beautifulsoup4==4.9.3
bs4==0.0.1
cachetools==4.2.2
certifi==2021.5.30
charset-normalizer==2.0.4
clang==5.0
click==7.1.2
dill==0.3.4
flatbuffers==1.12
future==0.18.2
gast==0.4.0
google-auth==1.34.0
google-auth-oauthlib==0.4.5
google-pasta==0.2.0
googleapis-common-protos==1.53.0
grpcio==1.39.0
h5py==3.1.0
idna==3.2
keras==2.6.0
Keras-Preprocessing==1.1.2
Markdown==3.3.4
numpy==1.19.5
oauthlib==3.1.1
opt-einsum==3.3.0
promise==2.3
protobuf==3.17.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
regex==2021.8.3
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
six==1.15.0
soupsieve==2.2.1
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.6.0
tensorflow-addons==0.13.0
tensorflow-datasets==3.2.1
tensorflow-estimator==2.6.0
tensorflow-gpu==2.6.0
tensorflow-metadata==1.2.0
termcolor==1.1.0
tf2crf==0.1.32
tqdm==4.62.1
typeguard==2.12.1
typing-extensions==3.7.4.3
urduhack==1.1.1
urllib3==1.26.6
Werkzeug==2.0.1
wrapt==1.12.1

The text was updated successfully, but these errors were encountered:

akkefa · 2021-08-18T16:46:59Z

@mirfan899 will look into this issue.

mirfan899 changed the title ~~Normalization and word tokenzation has issues.~~ Normalization and word tokenization have issues. Aug 17, 2021

akkefa self-assigned this Aug 18, 2021

akkefa added the bug Something isn't working label Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization and word tokenization have issues. #145

Normalization and word tokenization have issues. #145

mirfan899 commented Aug 15, 2021

akkefa commented Aug 18, 2021

Normalization and word tokenization have issues. #145

Normalization and word tokenization have issues. #145

Comments

mirfan899 commented Aug 15, 2021

Your Environment

akkefa commented Aug 18, 2021