-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double [CLS] token in the first doc chunk #25
Comments
@mitchellgordon95 hey Mitchell, yes indeed you spotted a problem i knew about but did not address. however, my take is that multiple CLS tokens shouldn't harm things too much (could be totally wrong about that though) yes, you are correct that |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I noticed when we tokenize, we set
add_special_tokens
to True here:https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L72
which adds a [CLS] token to the beginning of the doc tokens. But when we embed the chunks with BERT, we also add a CLS token to the beginning of the chunk:
https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L240
So for some chunks (the first chunk in every doc) we will have two [CLS] tokens at the beginning of the chunk. I think the solution here is just to turn off
add_special_tokens
when going from text -> chunks? Is that correct?The text was updated successfully, but these errors were encountered: