Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophes in English #74

Open
j0hannes opened this issue Oct 30, 2019 · 12 comments
Open

Apostrophes in English #74

j0hannes opened this issue Oct 30, 2019 · 12 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@j0hannes
Copy link

I just reported the same issue to the mosestokenizer package: luismsgomes/mosestokenizer#1

The problem is that detokenization fails to handle apostrophes correcly:

import sacremoses                                                                                                                                     
tokens = 'yesterday ’s reception'.split(' ')                                                                                                          
print(sacremoses.MosesDetokenizer('en').detokenize(tokens))  

prints yesterday ’s reception

@alvations alvations added the bug Something isn't working label Nov 22, 2019
@alvations
Copy link
Contributor

alvations commented Nov 22, 2019

Seems to be the behavior of default moses too =(

$ echo "yesterday ’s reception" | perl tokenizer.perl -l en | perl detokenizer.perl 
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception


$ echo "yesterday 's reception" | perl tokenizer.perl -l en | perl detokenizer.perl 
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday 's reception

But that's because the 's is usually converted into 's during tokenization and detokenization only recognize the 's for the de-spacing.

$ echo "yesterday's reception" | perl tokenizer.perl -l en 
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday 's reception

And also when using ’s instead of 's, apostrophe didn't get escape to 's, thus the detokenization didn't work.

$ echo "yesterday ’s reception" | perl tokenizer.perl -l en 
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception

In short, you should try to normalize the input and then detokenize it before tokenizing it again and finally detokenize.

That being said, seems like the is not mapping to the right apostrophe in Sacremoses =(

>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer

>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday ’s reception"

>>> mpn.normalize(text)
'yesterday "s reception'

>>> mt.tokenize(mpn.normalize(text))
['yesterday', '"', 's', 'reception']

>>> md.detokenize(mt.tokenize(mpn.normalize(text)))
'yesterday "s reception'

Which also happens in Moses' perl script:

$ echo "yesterday ’s reception" | perl normalize-punctuation.perl 
yesterday "s reception

@alvations
Copy link
Contributor

@alvations
Copy link
Contributor

Thanks @j0hannes for catching this, #78 should fix it but it should be rechecked with the Moses decoder repo too.

@alvations
Copy link
Contributor

alvations commented Nov 22, 2019

After the #78 fix, your cleaning workflow for your input would be something like:

  1. First normalize your input
  2. Then detokenize it (that's assuming you know that the original input is tokenized)

And if necessary:

  1. Then tokenize it
  2. Finally detokenize it again
>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer

>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday ’s reception"
>>> md.detokenize(mt.tokenize(md.detokenize(mpn.normalize(text).split())))
"yesterday's reception"

@j0hannes
Copy link
Author

So, to get the detokenized version of my text, and not the detokenized version of the normalized text, I would need to do perform double detokenization, find out which spaces got removed by the detokenizer, and remove those spaces from my original text. Otherwise, I would end up with something which is different to the original text.

@alvations
Copy link
Contributor

Yes, the example I gave is one of the typical pipeline that people use to clean the data for machine translation.

What's the expected output of in your example? Do you want to detokenize or tokenize or normalize?

@j0hannes
Copy link
Author

I want to detokenize, without any changes to the text.

@alvations
Copy link
Contributor

Ah, do you mean something like:

>>> from sacremoses import MosesDetokenizer

>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday 's reception"
>>> md.detokenize(text.split())
"yesterday's reception"

But with the non-standard apostrophe:

>>> text = "yesterday ’s reception"
>>> md.detokenize(text.split())
'yesterday ’s reception'

@alvations
Copy link
Contributor

alvations commented Nov 25, 2019

Actually, this part on adding new apostrophe to the detokenization process isn't simple, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L678

Because:

  • There's some smart quote counting happening
  • And the de-spacing of apostrophe might be language dependent

I'll suggest, bearing with the normalization of the apostrophe instead:

from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer

mpn = MosesPunctNormalizer()
md = MosesDetokenizer(lang='en')

text = "yesterday ’s reception"
md.detokenize(mpn.normalize(text).split())

[out]:

yesterday's reception

@j0hannes
Copy link
Author

Sorry, this is not an option. I think I'll either try to embed the detokenizer, so that it returns an abstract representation of removed spaces that I can apply to the original text, or maybe there's some software out there that can do detokenization without touching the text. Are there many cases in which this function destroys the text (apart from apostrophes and probably also quotation marks)?

@alvations
Copy link
Contributor

Maybe try https://sjmielke.com/papers/tokenize/ or spacy for your use-case.

I can take a look at this again without changing detokenization behavior but no promises. Because to support non-normalized text opens a can of worms to support all other non-normalized forms =(

@j0hannes
Copy link
Author

I had seen that, but it's overly simplistic and language-agnostic, it can't possibly get the job done in all languages. An enhancement that would allow me to use the sacremoses detokenizer would be to have it return length and offset of each part of the string that would be removed instead of a string with those parts already removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants