-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apostrophes in English #74
Comments
Seems to be the behavior of default moses too =(
But that's because the
And also when using
In short, you should try to normalize the input and then detokenize it before tokenizing it again and finally detokenize. That being said, seems like the >>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')
>>> text = "yesterday ’s reception"
>>> mpn.normalize(text)
'yesterday "s reception'
>>> mt.tokenize(mpn.normalize(text))
['yesterday', '"', 's', 'reception']
>>> md.detokenize(mt.tokenize(mpn.normalize(text)))
'yesterday "s reception' Which also happens in Moses' perl script:
|
The normalization bug in sacremoses happens here: |
After the #78 fix, your cleaning workflow for your input would be something like:
And if necessary:
>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')
>>> text = "yesterday ’s reception"
>>> md.detokenize(mt.tokenize(md.detokenize(mpn.normalize(text).split())))
"yesterday's reception" |
So, to get the detokenized version of my text, and not the detokenized version of the normalized text, I would need to do perform double detokenization, find out which spaces got removed by the detokenizer, and remove those spaces from my original text. Otherwise, I would end up with something which is different to the original text. |
Yes, the example I gave is one of the typical pipeline that people use to clean the data for machine translation. What's the expected output of in your example? Do you want to detokenize or tokenize or normalize? |
I want to detokenize, without any changes to the text. |
Ah, do you mean something like: >>> from sacremoses import MosesDetokenizer
>>> md = MosesDetokenizer(lang='en')
>>> text = "yesterday 's reception"
>>> md.detokenize(text.split())
"yesterday's reception" But with the non-standard apostrophe:
|
Actually, this part on adding new apostrophe to the detokenization process isn't simple, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L678 Because:
I'll suggest, bearing with the normalization of the apostrophe instead: from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer
mpn = MosesPunctNormalizer()
md = MosesDetokenizer(lang='en')
text = "yesterday ’s reception"
md.detokenize(mpn.normalize(text).split()) [out]:
|
Sorry, this is not an option. I think I'll either try to embed the detokenizer, so that it returns an abstract representation of removed spaces that I can apply to the original text, or maybe there's some software out there that can do detokenization without touching the text. Are there many cases in which this function destroys the text (apart from apostrophes and probably also quotation marks)? |
Maybe try https://sjmielke.com/papers/tokenize/ or spacy for your use-case. I can take a look at this again without changing detokenization behavior but no promises. Because to support non-normalized text opens a can of worms to support all other non-normalized forms =( |
I had seen that, but it's overly simplistic and language-agnostic, it can't possibly get the job done in all languages. An enhancement that would allow me to use the sacremoses detokenizer would be to have it return length and offset of each part of the string that would be removed instead of a string with those parts already removed. |
I just reported the same issue to the mosestokenizer package: luismsgomes/mosestokenizer#1
The problem is that detokenization fails to handle apostrophes correcly:
prints
yesterday ’s reception
The text was updated successfully, but these errors were encountered: