Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get 'sbert_x.pt' and other files for new dataset ? #19

Open
naskk1 opened this issue Oct 3, 2024 · 5 comments
Open

How to get 'sbert_x.pt' and other files for new dataset ? #19

naskk1 opened this issue Oct 3, 2024 · 5 comments

Comments

@naskk1
Copy link

naskk1 commented Oct 3, 2024

If I want to train model on other dataset, how can I get model parameter file such as 'sbert_x.pt' in the dataset, it seems that there are no code for this.

@ChenRunjin
Copy link
Collaborator

Hi, you can find the code to generate sbert embedding in utils/data_process.py get_sbert_embedding() function.

@naskk1
Copy link
Author

naskk1 commented Oct 4, 2024

Thank you for your reply, but what is the input text template for embedding? I think this is different from the paradigm for Q&A. I have tried some, but it doesn't work well.

@ManuelSerna
Copy link

Hi,

May I ask how one would obtain the .jsonl files? Also the file processed_data.pt for a new dataset?

Thank you for your time.

@ChenRunjin
Copy link
Collaborator

Hi, due to variations in the raw data formats across different datasets, we don't have a single unified function for generating processed_data.pt. To create processed_data.pt for new datasets, you only need to generate a Data instance in PyG format, ensuring that edge_index is included in this instance. And ensuring data.label_texts to include all label name, data.raw_texts to include node text feature if you want to train on node description task.

In general, we follow the guidelines from this repo to generate the edge_index.

To create the *.jsonl file, the main task is to generate the node sequence that represents the structure surrounding each node using template in *.jsonl. You can use our get_fix_shape_subgraph_sequence_fast function in utils/data_process.py to generate the node sequence.

@honey0219
Copy link

honey0219 commented Oct 16, 2024

After obtaining the processed_data.pt, sampled_2_10_test.jsonl, sampled_2_10_train.jsonl, and sampled_2_10_val.jsonl files for a new dataset, could you please let me know what additional steps I should take to run experiments in the "single focus" setting for the node classification task on the new dataset?
Thank you for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants