Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何基于YUAN模型进行无监督预训练? #19

Open
zhangzai666 opened this issue Mar 12, 2023 · 3 comments
Open

如何基于YUAN模型进行无监督预训练? #19

zhangzai666 opened this issue Mar 12, 2023 · 3 comments

Comments

@zhangzai666
Copy link

我参考了一下hugging face的无监督训练代码,简单测试了一下。代码如下:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
tokenizer = T5Tokenizer.from_pretrained("premodel/ChatYuan-large-v1")
model = T5ForConditionalGeneration.from_pretrained("premodel/ChatYuan-large-v1")
input_ids = tokenizer("一只<extra_id_0>走在<extra_id_1>大街上", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0>可爱的<extra_id_1>宽敞的<extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits
结果报错:
IndexError: index out of range in self
应该是embedding层索引越界,看了模型的词表,并没有<extra_id_0><extra_id_1>标记,但是tokenizer后没有报错
请问如何基于YUAN模型进行无监督预训练?无监督预训练的数据格式是什么,万分感谢

@joytianya
Copy link
Collaborator

请参考readme里预训练的代码

@zhangzai666
Copy link
Author

十分感谢,chatyuan无监督训练的数据集简单示例可以看一下么,用什么进行标记mask的

@joytianya
Copy link
Collaborator

具体可以参考t5的构建规则哈https://github.com/google-research/text-to-text-transfer-transformer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants