Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No EOS token appended #10

Open
ZhiyuanChen opened this issue Jun 12, 2024 · 2 comments
Open

No EOS token appended #10

ZhiyuanChen opened this issue Jun 12, 2024 · 2 comments

Comments

@ZhiyuanChen
Copy link

ZhiyuanChen commented Jun 12, 2024

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.

import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet


# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))

The length of sequence is 22, so inputs should have 24 tokens (with and )
But it only has 23 tokens.

The inputs is:

tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])

Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}),
it appears <eos> token is not appended by the vocab.

@tBai1994
Copy link

Hi,

Thank you for this wonderful work!

When I was trying to reproduce your results, I faced some challenges when getting a minimum working example to run.

import __main__

import torch

from model import MSATransformer
from utils.tokenization import Vocab
from msm.data import Alphabet


# evil hack
__main__.Config = dict()
__main__.OptimizerConfig = dict()
__main__.MSATransformerModelConfig = dict()
__main__.DataConfig = dict()
__main__.TrainConfig = dict()
__main__.LoggingConfig = dict()

pretrained = "RNA_MSM_pretrained.ckpt"

alphabet = Alphabet.from_architecture("rna language")
vocab = Vocab.from_esm_alphabet(alphabet)
tokenizer = vocab.encode
model = MSATransformer(vocab, num_layers=10)
model.load_state_dict(torch.load(pretrained, map_location='cpu')['state_dict'])
model.eval()

sequence = "UAGCNUAUCAGACUGAUGUUGA"
inputs = torch.tensor(tokenizer(sequence))[None, None, :]

o = model(inputs, need_head_weights=True, repr_layers=list(range(13)))

The length of sequence is 22, so inputs should have 24 tokens (with and ) But it only has 23 tokens.

The inputs is:

tensor([[[0, 7, 4, 5, 6, 9, 7, 4, 7, 6, 4, 5, 4, 6, 7, 5, 4, 7, 5, 7, 7, 5, 4]]])

Since vocab is Vocab({'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, 'A': 4, 'G': 5, 'C': 6, 'U': 7, 'X': 8, 'N': 9, '-': 10, '<mask>': 11}), it appears <eos> token is not appended by the vocab.
Hi, I would like to ask if your problem has been solved? I also encountered similar problems

@yikunpku
Copy link
Owner

yikunpku commented Jul 22, 2024

Hi, @ZhiyuanChen @tBai1994 the reason for this issue is that we did not append the eos token, which is consistent with MSA Transformer. If you would like to append this special token, you can do so by setting append_eos = True at the following link:

RNA-MSM/msm/data.py

Lines 166 to 172 in 43d3d93

elif name in ("rna language"):
standard_toks = rnaseq_toks["toks"]
prepend_toks = ("<cls>", "<pad>", "<eos>", "<unk>")
append_toks = ("<mask>",)
prepend_bos = True
append_eos = False
use_msa = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants