Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

作者您好,请教问题:tokenizer词表大小和模型embedding层对应不上 #39

Open
zhangzai666 opened this issue Apr 1, 2023 · 2 comments

Comments

@zhangzai666
Copy link

作者您好,感谢您分享模型。之前问过您问题如何预训练。
我发现加载模型后embedding层大小是31128但是加载tokenzier分词器词表大小32228.原因就是多了预训练需要的extra_0到extra_100.而这是预训练所需要的。所以如何基于您分享这个embedding的32128的模型预训练。
tokenizer的
image
model的
image

@joytianya
Copy link
Collaborator

已经修复,可以重新加载下

@zhangzai666
Copy link
Author

已经修复,可以重新加载下

您好,感谢您的回复。
刚才试了加载chatyuanV2。您是加载词表吧extra_id的数量设置为0了,所以tokinzer的vocab_size减少了100.但是T5模型预训练期间需要extra_0到extra_100把。不应该是把模型的embdding层的维度增加为32228来适应extra_0到extra_100这100个mask词么

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants