No EOS token appended #10

ZhiyuanChen · 2024-06-12T17:44:49Z

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.

import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet


# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))

The length of sequence is 22, so inputs should have 24 tokens (with and )
But it only has 23 tokens.

The inputs is:

tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])

Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}),
it appears <eos> token is not appended by the vocab.

The text was updated successfully, but these errors were encountered:

tBai1994 · 2024-07-22T03:07:12Z

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.
import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet


# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))
The length of sequence is 22, so inputs should have 24 tokens (with and ) But it only has 23 tokens.

The inputs is:
tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])
Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}), it appears <eos> token is not appended by the vocab.
Hi, I would like to ask if your problem has been solved? I also encountered similar problems

yikunpku · 2024-07-22T06:28:42Z

Hi, @ZhiyuanChen @tBai1994 the reason for this issue is that we did not append the eos token, which is consistent with MSA Transformer. If you would like to append this special token, you can do so by setting append_eos = True at the following link:

RNA-MSM/msm/data.py

Lines 166 to 172 in 43d3d93

    
           elif name in ("rna language"): 
        
               standard_toks = rnaseq_toks["toks"] 
        
               prepend_toks = ("<cls>", "<pad>", "<eos>", "<unk>") 
        
               append_toks = ("<mask>",) 
        
               prepend_bos = True 
        
               append_eos = False 
        
               use_msa = True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No EOS token appended #10

No EOS token appended #10

ZhiyuanChen commented Jun 12, 2024 •

edited

Loading

tBai1994 commented Jul 22, 2024

yikunpku commented Jul 22, 2024 •

edited

Loading

No EOS token appended #10

No EOS token appended #10

Comments

ZhiyuanChen commented Jun 12, 2024 • edited Loading

tBai1994 commented Jul 22, 2024

yikunpku commented Jul 22, 2024 • edited Loading

ZhiyuanChen commented Jun 12, 2024 •

edited

Loading

yikunpku commented Jul 22, 2024 •

edited

Loading