You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DeepPavlov version:
The latest docker container deeppavlov/deeppavlov, published last month
Python version:
3.10
Operating system:
The latest docker container deeppavlov/deeppavlov, published last month.
Docker is running on CentOS/AlmaLinux.
Issue:
I’m looking to understand how to prevent this crash from happening.
input sequence after bert tokenization shouldn’t exceed 512 tokens.
I’m using the REST API, so I’m calling ner_bert_base like this:
{
"x": [
"A huge text. Blah blah blah... No line breaks. I'm a 28 year-old person called John Smith, etc..."
]
}
While researching about this error, I found: #839 (comment)
which says:
Sorry, but the BERT model has positional embeddings only for first 512 subtokens. So, the model can’t work with longer sequences. It is a deliberate architecture restriction. Subtokens are produced by WordPiece tokenizer (BPE). 512 subtokens correspond approximately to 300-350 regular tokens for multilingual model. Make sure that you performed sentence tokenization before dumping the data. Every sentence in the dumped data should be separated by an empty line.
But I don’t fully understand what I need to do to resolve the problem.
What does “Make sure that you performed sentence tokenization before dumping the data” mean? Is it some function I need to call first, that returns the list of tokens? Is it something that I can call with the REST API from my application/code?
I was also looking to see if I could have my application (the caller) to somehow tokenize the words and punctuation, and then only send the first 512, but the thing is that it’s hard to preserve the spacing, and even if I send 512, it somehow still passes that limit in the model, crashing anyway.
I feel like I’m trying to reinvent the wheel.
Can’t we have the API and/or the model just silently (or by setting a flag/parameter in the input) truncate the text input past 512 tokens?
(Note that my application is not made in Python)
Thank you very much!
The text was updated successfully, but these errors were encountered:
DeepPavlov version:
The latest docker container
deeppavlov/deeppavlov
, published last monthPython version:
3.10
Operating system:
The latest docker container
deeppavlov/deeppavlov
, published last month.Docker is running on CentOS/AlmaLinux.
Issue:
I’m looking to understand how to prevent this crash from happening.
I’m using the REST API, so I’m calling
ner_bert_base
like this:While researching about this error, I found: #839 (comment)
which says:
But I don’t fully understand what I need to do to resolve the problem.
What does “Make sure that you performed sentence tokenization before dumping the data” mean? Is it some function I need to call first, that returns the list of tokens? Is it something that I can call with the REST API from my application/code?
I was also looking to see if I could have my application (the caller) to somehow tokenize the words and punctuation, and then only send the first 512, but the thing is that it’s hard to preserve the spacing, and even if I send 512, it somehow still passes that limit in the model, crashing anyway.
I feel like I’m trying to reinvent the wheel.
Can’t we have the API and/or the model just silently (or by setting a flag/parameter in the input) truncate the text input past 512 tokens?
(Note that my application is not made in Python)
Thank you very much!
The text was updated successfully, but these errors were encountered: