New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
如何基于YUAN模型进行无监督预训练? #19
Comments
请参考readme里预训练的代码 |
十分感谢,chatyuan无监督训练的数据集简单示例可以看一下么,用什么进行标记mask的 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
我参考了一下hugging face的无监督训练代码,简单测试了一下。代码如下:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
tokenizer = T5Tokenizer.from_pretrained("premodel/ChatYuan-large-v1")
model = T5ForConditionalGeneration.from_pretrained("premodel/ChatYuan-large-v1")
input_ids = tokenizer("一只<extra_id_0>走在<extra_id_1>大街上", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0>可爱的<extra_id_1>宽敞的<extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits
结果报错:
IndexError: index out of range in self
应该是embedding层索引越界,看了模型的词表,并没有<extra_id_0><extra_id_1>标记,但是tokenizer后没有报错
请问如何基于YUAN模型进行无监督预训练?无监督预训练的数据格式是什么,万分感谢
The text was updated successfully, but these errors were encountered: