-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need clarification for pre-training #13
Comments
The thing that the paper refers to is happening inside of So the input text file should be actual sentences, although feel free to add some noise if you want to make things more robust for fine-tuning (e.g., if your sentence segmenter always splits on For our sentence segmenter I just used some Google-internal library I found, but anything off the shelf like (SpaCy)[https://spacy.io/usage/spacy-101] should work. |
I see, especially the explanation in |
I added a paragraph in the README about this, thanks. |
Hello @jacobdevlin-google |
@xgk are you using Chinese or some other languages where one token correspond to one char and not one word ? That would explain this size augmentation. |
and additionally inclues Thai and Mongolian. -> and additionally includes Thai and Mongolian. FIX google-research#13
In the README.md, it says for the pre-training:
and the example
sample_text.txt
does have each line ends with either.
or;
.Whereas in the BERT paper, it says
So it becomes unclear whether this implementation does expect actual sentences per line or just documents be broken down into multiple lines arbitrarily.
The text was updated successfully, but these errors were encountered: