Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry Regarding Vocabulary Size in CodonBERT PyTorch Model #6

Open
hjqqqq-bot opened this issue Jul 13, 2024 · 5 comments
Open

Inquiry Regarding Vocabulary Size in CodonBERT PyTorch Model #6

hjqqqq-bot opened this issue Jul 13, 2024 · 5 comments

Comments

@hjqqqq-bot
Copy link

I would like to express my gratitude for your excellent work on CodonBERT. I have been thoroughly impressed by your research and the accompanying code.

However, I have encountered a discrepancy that I would like to clarify. In your paper and code, the vocabulary size is mentioned as 555+5=130, based on the characters 'A', 'U', 'G', 'C', and 'N'. Yet, in the CodonBERT PyTorch model you provided, the vocabulary size is set to 69.

Could you please explain the rationale behind this difference in vocabulary size? Understanding this would greatly help me in comprehending and utilizing your model more effectively.

Thank you in advance for your assistance.

@a253324
Copy link

a253324 commented Jul 15, 2024

I also nocticed this problem. Did is this a mistake when the author provided the model? Cause of this problem, to run the fintune code successfully, I have to modified the model code. I modified the 69 to 130, and maintaned the original weight of the original model, adding 0 or randomly normalizing for the rest weight. But this operation may have an influence on the model performence.
If this is a mistake, I hope author could provide correct CondonBERT model code. I would apprecitae it.

@whql251
Copy link

whql251 commented Jul 16, 2024

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem.
Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

@a253324
Copy link

a253324 commented Jul 16, 2024

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

In the pretrain.py and finetune.py, after data processing, the variable dic_voc storage the vocabulary table, and the output is as:
{'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4, 'AAA': 5, 'AAU': 6, 'AAG': 7, 'AAC': 8, 'AAN': 9, 'AUA': 10, 'AUU': 11, 'AUG': 12, 'AUC': 13, 'AUN': 14, 'AGA': 15, 'AGU': 16, 'AGG': 17, 'AGC': 18, 'AGN': 19, 'ACA': 20, 'ACU': 21, 'ACG': 22, 'ACC': 23, 'ACN': 24, 'ANA': 25, 'ANU': 26, 'ANG': 27, 'ANC': 28, 'ANN': 29, 'UAA': 30, 'UAU': 31, 'UAG': 32, 'UAC': 33, 'UAN': 34, 'UUA': 35, 'UUU': 36, 'UUG': 37, 'UUC': 38, 'UUN': 39, 'UGA': 40, 'UGU': 41, 'UGG': 42, 'UGC': 43, 'UGN': 44, 'UCA': 45, 'UCU': 46, 'UCG': 47, 'UCC': 48, 'UCN': 49, 'UNA': 50, 'UNU': 51, 'UNG': 52, 'UNC': 53, 'UNN': 54, 'GAA': 55, 'GAU': 56, 'GAG': 57, 'GAC': 58, 'GAN': 59, 'GUA': 60, 'GUU': 61, 'GUG': 62, 'GUC': 63, 'GUN': 64, 'GGA': 65, 'GGU': 66, 'GGG': 67, 'GGC': 68, 'GGN': 69, 'GCA': 70, 'GCU': 71, 'GCG': 72, 'GCC': 73, 'GCN': 74, 'GNA': 75, 'GNU': 76, 'GNG': 77, 'GNC': 78, 'GNN': 79, 'CAA': 80, 'CAU': 81, 'CAG': 82, 'CAC': 83, 'CAN': 84, 'CUA': 85, 'CUU': 86, 'CUG': 87, 'CUC': 88, 'CUN': 89, 'CGA': 90, 'CGU': 91, 'CGG': 92, 'CGC': 93, 'CGN': 94, 'CCA': 95, 'CCU': 96, 'CCG': 97, 'CCC': 98, 'CCN': 99, 'CNA': 100, 'CNU': 101, 'CNG': 102, 'CNC': 103, 'CNN': 104, 'NAA': 105, 'NAU': 106, 'NAG': 107, 'NAC': 108, 'NAN': 109, 'NUA': 110, 'NUU': 111, 'NUG': 112, 'NUC': 113, 'NUN': 114, 'NGA': 115, 'NGU': 116, 'NGG': 117, 'NGC': 118, 'NGN': 119, 'NCA': 120, 'NCU': 121, 'NCG': 122, 'NCC': 123, 'NCN': 124, 'NNA': 125, 'NNU': 126, 'NNG': 127, 'NNC': 128, 'NNN': 129}

Total 130(0~129). But the model's config is 69.

@whql251
Copy link

whql251 commented Jul 16, 2024

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

In the pretrain.py and finetune.py, after data processing, the variable dic_voc storage the vocabulary table, and the output is as: {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4, 'AAA': 5, 'AAU': 6, 'AAG': 7, 'AAC': 8, 'AAN': 9, 'AUA': 10, 'AUU': 11, 'AUG': 12, 'AUC': 13, 'AUN': 14, 'AGA': 15, 'AGU': 16, 'AGG': 17, 'AGC': 18, 'AGN': 19, 'ACA': 20, 'ACU': 21, 'ACG': 22, 'ACC': 23, 'ACN': 24, 'ANA': 25, 'ANU': 26, 'ANG': 27, 'ANC': 28, 'ANN': 29, 'UAA': 30, 'UAU': 31, 'UAG': 32, 'UAC': 33, 'UAN': 34, 'UUA': 35, 'UUU': 36, 'UUG': 37, 'UUC': 38, 'UUN': 39, 'UGA': 40, 'UGU': 41, 'UGG': 42, 'UGC': 43, 'UGN': 44, 'UCA': 45, 'UCU': 46, 'UCG': 47, 'UCC': 48, 'UCN': 49, 'UNA': 50, 'UNU': 51, 'UNG': 52, 'UNC': 53, 'UNN': 54, 'GAA': 55, 'GAU': 56, 'GAG': 57, 'GAC': 58, 'GAN': 59, 'GUA': 60, 'GUU': 61, 'GUG': 62, 'GUC': 63, 'GUN': 64, 'GGA': 65, 'GGU': 66, 'GGG': 67, 'GGC': 68, 'GGN': 69, 'GCA': 70, 'GCU': 71, 'GCG': 72, 'GCC': 73, 'GCN': 74, 'GNA': 75, 'GNU': 76, 'GNG': 77, 'GNC': 78, 'GNN': 79, 'CAA': 80, 'CAU': 81, 'CAG': 82, 'CAC': 83, 'CAN': 84, 'CUA': 85, 'CUU': 86, 'CUG': 87, 'CUC': 88, 'CUN': 89, 'CGA': 90, 'CGU': 91, 'CGG': 92, 'CGC': 93, 'CGN': 94, 'CCA': 95, 'CCU': 96, 'CCG': 97, 'CCC': 98, 'CCN': 99, 'CNA': 100, 'CNU': 101, 'CNG': 102, 'CNC': 103, 'CNN': 104, 'NAA': 105, 'NAU': 106, 'NAG': 107, 'NAC': 108, 'NAN': 109, 'NUA': 110, 'NUU': 111, 'NUG': 112, 'NUC': 113, 'NUN': 114, 'NGA': 115, 'NGU': 116, 'NGG': 117, 'NGC': 118, 'NGN': 119, 'NCA': 120, 'NCU': 121, 'NCG': 122, 'NCC': 123, 'NCN': 124, 'NNA': 125, 'NNU': 126, 'NNG': 127, 'NNC': 128, 'NNN': 129}

Total 130(0~129). But the model's config is 69.

To my knowledge, There are only 64 kinds of valid codons. Codons and corresponding amino acids are listed below:
[['CCT', 'P'], ['TCC', 'S'], ['GGA', 'G'], ['TAA', 'X'], ['GAG', 'E'], ['TGG', 'W'], ['AAG', 'K'], ['CTG', 'L'], ['AGA', 'R'], ['TTA', 'L'], ['GTG', 'V'], ['CAG', 'Q'], ['ATG', 'M'], ['CGA', 'R'], ['ACT', 'T'], ['GCT', 'A'], ['TCT', 'S'], ['CCC', 'P'], ['CGG', 'R'], ['ATA', 'I'], ['CAA', 'Q'], ['GTA', 'V'], ['TTG', 'L'], ['AGG', 'R'], ['CTA', 'L'], ['AAA', 'K'], ['TGA', 'X'], ['GAA', 'E'], ['TAG', 'X'], ['GGG', 'G'], ['GCC', 'A'], ['ACC', 'T'], ['ACA', 'T'], ['GCA', 'A'], ['TCG', 'S'], ['GAC', 'D'], ['TGC', 'C'], ['TTT', 'F'], ['CGT', 'R'], ['AAC', 'N'], ['CTC', 'L'], ['GTC', 'V'], ['GGT', 'G'], ['TAT', 'Y'], ['AGT', 'S'], ['CAC', 'H'], ['ATC', 'I'], ['CCA', 'P'], ['CCG', 'P'], ['CTT', 'L'], ['AAT', 'N'], ['CGC', 'R'], ['TTC', 'F'], ['TGT', 'C'], ['GAT', 'D'], ['ATT', 'I'], ['CAT', 'H'], ['AGC', 'S'], ['TAC', 'Y'], ['GGC', 'G'], ['GTT', 'V'], ['TCA', 'S'], ['GCG', 'A'], ['ACG', 'T']]
This may explain the vocabulary size(69) in the model config

@a253324
Copy link

a253324 commented Jul 16, 2024

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

In the pretrain.py and finetune.py, after data processing, the variable dic_voc storage the vocabulary table, and the output is as: {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4, 'AAA': 5, 'AAU': 6, 'AAG': 7, 'AAC': 8, 'AAN': 9, 'AUA': 10, 'AUU': 11, 'AUG': 12, 'AUC': 13, 'AUN': 14, 'AGA': 15, 'AGU': 16, 'AGG': 17, 'AGC': 18, 'AGN': 19, 'ACA': 20, 'ACU': 21, 'ACG': 22, 'ACC': 23, 'ACN': 24, 'ANA': 25, 'ANU': 26, 'ANG': 27, 'ANC': 28, 'ANN': 29, 'UAA': 30, 'UAU': 31, 'UAG': 32, 'UAC': 33, 'UAN': 34, 'UUA': 35, 'UUU': 36, 'UUG': 37, 'UUC': 38, 'UUN': 39, 'UGA': 40, 'UGU': 41, 'UGG': 42, 'UGC': 43, 'UGN': 44, 'UCA': 45, 'UCU': 46, 'UCG': 47, 'UCC': 48, 'UCN': 49, 'UNA': 50, 'UNU': 51, 'UNG': 52, 'UNC': 53, 'UNN': 54, 'GAA': 55, 'GAU': 56, 'GAG': 57, 'GAC': 58, 'GAN': 59, 'GUA': 60, 'GUU': 61, 'GUG': 62, 'GUC': 63, 'GUN': 64, 'GGA': 65, 'GGU': 66, 'GGG': 67, 'GGC': 68, 'GGN': 69, 'GCA': 70, 'GCU': 71, 'GCG': 72, 'GCC': 73, 'GCN': 74, 'GNA': 75, 'GNU': 76, 'GNG': 77, 'GNC': 78, 'GNN': 79, 'CAA': 80, 'CAU': 81, 'CAG': 82, 'CAC': 83, 'CAN': 84, 'CUA': 85, 'CUU': 86, 'CUG': 87, 'CUC': 88, 'CUN': 89, 'CGA': 90, 'CGU': 91, 'CGG': 92, 'CGC': 93, 'CGN': 94, 'CCA': 95, 'CCU': 96, 'CCG': 97, 'CCC': 98, 'CCN': 99, 'CNA': 100, 'CNU': 101, 'CNG': 102, 'CNC': 103, 'CNN': 104, 'NAA': 105, 'NAU': 106, 'NAG': 107, 'NAC': 108, 'NAN': 109, 'NUA': 110, 'NUU': 111, 'NUG': 112, 'NUC': 113, 'NUN': 114, 'NGA': 115, 'NGU': 116, 'NGG': 117, 'NGC': 118, 'NGN': 119, 'NCA': 120, 'NCU': 121, 'NCG': 122, 'NCC': 123, 'NCN': 124, 'NNA': 125, 'NNU': 126, 'NNG': 127, 'NNC': 128, 'NNN': 129}
Total 130(0~129). But the model's config is 69.

To my knowledge, There are only 64 kinds of valid codons. Codons and corresponding amino acids are listed below: [['CCT', 'P'], ['TCC', 'S'], ['GGA', 'G'], ['TAA', 'X'], ['GAG', 'E'], ['TGG', 'W'], ['AAG', 'K'], ['CTG', 'L'], ['AGA', 'R'], ['TTA', 'L'], ['GTG', 'V'], ['CAG', 'Q'], ['ATG', 'M'], ['CGA', 'R'], ['ACT', 'T'], ['GCT', 'A'], ['TCT', 'S'], ['CCC', 'P'], ['CGG', 'R'], ['ATA', 'I'], ['CAA', 'Q'], ['GTA', 'V'], ['TTG', 'L'], ['AGG', 'R'], ['CTA', 'L'], ['AAA', 'K'], ['TGA', 'X'], ['GAA', 'E'], ['TAG', 'X'], ['GGG', 'G'], ['GCC', 'A'], ['ACC', 'T'], ['ACA', 'T'], ['GCA', 'A'], ['TCG', 'S'], ['GAC', 'D'], ['TGC', 'C'], ['TTT', 'F'], ['CGT', 'R'], ['AAC', 'N'], ['CTC', 'L'], ['GTC', 'V'], ['GGT', 'G'], ['TAT', 'Y'], ['AGT', 'S'], ['CAC', 'H'], ['ATC', 'I'], ['CCA', 'P'], ['CCG', 'P'], ['CTT', 'L'], ['AAT', 'N'], ['CGC', 'R'], ['TTC', 'F'], ['TGT', 'C'], ['GAT', 'D'], ['ATT', 'I'], ['CAT', 'H'], ['AGC', 'S'], ['TAC', 'Y'], ['GGC', 'G'], ['GTT', 'V'], ['TCA', 'S'], ['GCG', 'A'], ['ACG', 'T']] This may explain the vocabulary size(69) in the model config

Year that's correct. Thank you for reply. After my last reply, I immediately checked the code again. Maybe I knew why the difference appeared. In the finetune.py and pretrain.py, the variable lst_ele( line 70 in pretrain.py) is a list of ('AUGCN'),so after processing, there are 130 vocabulary. When I modified the list to ('AUGC'), there are 69 vocabulary, which is consisdent with the model's config. In terms of this result, the another problem appeared: the model's config author provided is 69, did this means they pretained the model on the setting of list('AUGC') rather than list('AUGCN')? If it is, the model's code will be not consisdent with the paper, where they mentionend they acctually pretrained on the setting of list('AUGCN').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants