-
Notifications
You must be signed in to change notification settings - Fork 18
Add decode stream #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/tokenizers/decode_stream.ex
Outdated
@doc """ | ||
Steps through the decode stream with the given tokenizer and token ID. | ||
|
||
Returns `{:ok, string}` if there's a decoded string, or `{:ok, nil}` if there's nothing more to decode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would likely do :done
for when there is nothing more to decode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a problem here. While it looks like it's stream process, it's actually not. It's not exhausted. One can call
step(12345678)
and then call step(0)
- it's valid from original library perspective. Or one can call step(n)
multiple times (this is even not idempotent).
https://docs.rs/tokenizers/0.21.1/tokenizers/tokenizer/struct.DecodeStream.html
use tokenizers::Tokenizer;
let tokenizer = Tokenizer::from_file("data/roberta.json").unwrap();
let mut decode_stream = tokenizer.decode_stream(false);
assert_eq!(decode_stream.step(713).unwrap(), Some("This".to_string()));
assert_eq!(decode_stream.step(16).unwrap(), Some(" is".to_string()));
assert_eq!(decode_stream.step(41).unwrap(), Some(" an".to_string()));
assert_eq!(
decode_stream.step(1246).unwrap(),
Some(" example".to_string())
);
From this perspective, :done
looks inappropriate for having semantics of "last action", while it's not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe :out_of_range
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well
Enum.at/2
returns nil for out of range.
Erlang's libs (like array) raise argumentError.
I've never seen :out_of_range
anywhere across ecosystem, so may be it will not work good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum.at
does not wrap it in a tuple though. Enum.fetch
, which returns {:ok, _}
or :error
, is a better example but there are only two states. My understanding is that this code has three states:
- In range
- Out of range
- Error when decoding
So you need three states accordingly. I don't mind what we call it but I'd say the three states should be distinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to :out_of_range
lib/tokenizers/decode_stream.ex
Outdated
case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do | ||
{:ok, result} -> {:ok, result} | ||
{:error, reason} -> {:error, reason} | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do | |
{:ok, result} -> {:ok, result} | |
{:error, reason} -> {:error, reason} | |
end | |
Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually efficient going step after step by giving specific indexes? Or does the upstream code works better by calling something .next
? Is this something we should benchmark before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR I translated API as is. Some abstraction can be definitely written on top of it using Stream
, but it seems to be a part of separate effort.
Decoding large chunks this way looks better for BEAM because control is given back from NIF. There is a possibility to write decoding without dirty, but i'm not sure somebody will invest into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I have dropped some comments. :)
Co-authored-by: José Valim <[email protected]>
@jonatanklosko Can you review please? |
let tk = tokenizer.resource.0.clone(); | ||
let mut ds = decode_stream.resource.inner.write().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If each step involves cloning, it may result in bad performance, even compared to decoding at once.
This may actually be a good reason to make the public API be a more high-level stream. For example, DecodeStream.new(...)
can return %DecodeStream{}
and it would implement Enumerable
API, where the step NIF is called internally and we don't need to clone multiple times, since intermediate resources are not exposed to the user.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, the ids need to be an input, so it's not so much of Enumerable
, but more like Stream.transform
on top of a stream of ids. But I can see such API being cumbersome where the user doesn't want to block :<
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Virviil Actually, does ds.step
mutate the tokenizer itself? Maybe there's no need for that particular clone in the first place.
@jonatanklosko It was effectively possible to remove cloning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was effectively possible to remove cloning.
@Virviil great!
Bump a version of tokenizers
Bring new functionality. Mostly port from https://github.com/huggingface/tokenizers/pull/1678/files - here.