Apostrophes in English #74

j0hannes · 2019-10-30T17:27:06Z

I just reported the same issue to the mosestokenizer package: luismsgomes/mosestokenizer#1

The problem is that detokenization fails to handle apostrophes correcly:

import sacremoses                                                                                                                                     
tokens = 'yesterday ’s reception'.split(' ')                                                                                                          
print(sacremoses.MosesDetokenizer('en').detokenize(tokens))

prints yesterday ’s reception

The text was updated successfully, but these errors were encountered:

alvations · 2019-11-22T12:23:53Z

Seems to be the behavior of default moses too =(

$ echo "yesterday ’s reception" | perl tokenizer.perl -l en | perl detokenizer.perl 
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception


$ echo "yesterday 's reception" | perl tokenizer.perl -l en | perl detokenizer.perl 
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday 's reception

But that's because the 's is usually converted into 's during tokenization and detokenization only recognize the 's for the de-spacing.

$ echo "yesterday's reception" | perl tokenizer.perl -l en 
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday &apos;s reception

And also when using ’s instead of 's, apostrophe didn't get escape to 's, thus the detokenization didn't work.

$ echo "yesterday ’s reception" | perl tokenizer.perl -l en 
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception

In short, you should try to normalize the input and then detokenize it before tokenizing it again and finally detokenize.

That being said, seems like the ’ is not mapping to the right apostrophe in Sacremoses =(

>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer

>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday ’s reception"

>>> mpn.normalize(text)
'yesterday "s reception'

>>> mt.tokenize(mpn.normalize(text))
['yesterday', '&quot;', 's', 'reception']

>>> md.detokenize(mt.tokenize(mpn.normalize(text)))
'yesterday "s reception'

Which also happens in Moses' perl script:

$ echo "yesterday ’s reception" | perl normalize-punctuation.perl 
yesterday "s reception

alvations · 2019-11-22T12:27:06Z

The normalization bug in sacremoses happens here:

alvations · 2019-11-22T12:31:49Z

Thanks @j0hannes for catching this, #78 should fix it but it should be rechecked with the Moses decoder repo too.

alvations · 2019-11-22T12:36:03Z

After the #78 fix, your cleaning workflow for your input would be something like:

First normalize your input
Then detokenize it (that's assuming you know that the original input is tokenized)

And if necessary:

Then tokenize it
Finally detokenize it again

>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer

>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday ’s reception"
>>> md.detokenize(mt.tokenize(md.detokenize(mpn.normalize(text).split())))
"yesterday's reception"

j0hannes · 2019-11-22T14:28:33Z

So, to get the detokenized version of my text, and not the detokenized version of the normalized text, I would need to do perform double detokenization, find out which spaces got removed by the detokenizer, and remove those spaces from my original text. Otherwise, I would end up with something which is different to the original text.

alvations · 2019-11-22T14:35:27Z

Yes, the example I gave is one of the typical pipeline that people use to clean the data for machine translation.

What's the expected output of in your example? Do you want to detokenize or tokenize or normalize?

j0hannes · 2019-11-22T14:37:22Z

I want to detokenize, without any changes to the text.

alvations · 2019-11-25T01:41:43Z

Ah, do you mean something like:

>>> from sacremoses import MosesDetokenizer

>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday 's reception"
>>> md.detokenize(text.split())
"yesterday's reception"

But with the non-standard apostrophe:

>>> text = "yesterday ’s reception"
>>> md.detokenize(text.split())
'yesterday ’s reception'

alvations · 2019-11-25T02:06:16Z

Actually, this part on adding new apostrophe to the detokenization process isn't simple, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L678

Because:

There's some smart quote counting happening
And the de-spacing of apostrophe might be language dependent

I'll suggest, bearing with the normalization of the apostrophe instead:

from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer

mpn = MosesPunctNormalizer()
md = MosesDetokenizer(lang='en')

text = "yesterday ’s reception"
md.detokenize(mpn.normalize(text).split())

[out]:

yesterday's reception

j0hannes · 2019-11-25T07:15:27Z

Sorry, this is not an option. I think I'll either try to embed the detokenizer, so that it returns an abstract representation of removed spaces that I can apply to the original text, or maybe there's some software out there that can do detokenization without touching the text. Are there many cases in which this function destroys the text (apart from apostrophes and probably also quotation marks)?

alvations · 2019-11-25T08:52:59Z

Maybe try https://sjmielke.com/papers/tokenize/ or spacy for your use-case.

I can take a look at this again without changing detokenization behavior but no promises. Because to support non-normalized text opens a can of worms to support all other non-normalized forms =(

j0hannes · 2019-11-25T09:10:24Z

I had seen that, but it's overly simplistic and language-agnostic, it can't possibly get the job done in all languages. An enhancement that would allow me to use the sacremoses detokenizer would be to have it return length and offset of each part of the string that would be removed instead of a string with those parts already removed.

alvations added the bug Something isn't working label Nov 22, 2019

alvations mentioned this issue Nov 22, 2019

Patching single quotes normalization #78

Merged

alvations added the wontfix This will not be worked on label Nov 25, 2019

alvations mentioned this issue Nov 25, 2019

Single quotes should be escaped as single quotes. moses-smt/mosesdecoder#215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apostrophes in English #74

Apostrophes in English #74

j0hannes commented Oct 30, 2019

alvations commented Nov 22, 2019 •

edited

Loading

alvations commented Nov 22, 2019

alvations commented Nov 22, 2019

alvations commented Nov 22, 2019 •

edited

Loading

j0hannes commented Nov 22, 2019

alvations commented Nov 22, 2019

j0hannes commented Nov 22, 2019

alvations commented Nov 25, 2019

alvations commented Nov 25, 2019 •

edited

Loading

j0hannes commented Nov 25, 2019

alvations commented Nov 25, 2019

j0hannes commented Nov 25, 2019

Apostrophes in English #74

Apostrophes in English #74

Comments

j0hannes commented Oct 30, 2019

alvations commented Nov 22, 2019 • edited Loading

alvations commented Nov 22, 2019

alvations commented Nov 22, 2019

alvations commented Nov 22, 2019 • edited Loading

j0hannes commented Nov 22, 2019

alvations commented Nov 22, 2019

j0hannes commented Nov 22, 2019

alvations commented Nov 25, 2019

alvations commented Nov 25, 2019 • edited Loading

j0hannes commented Nov 25, 2019

alvations commented Nov 25, 2019

j0hannes commented Nov 25, 2019

alvations commented Nov 22, 2019 •

edited

Loading

alvations commented Nov 22, 2019 •

edited

Loading

alvations commented Nov 25, 2019 •

edited

Loading