Skip to content

Add decode stream #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 29, 2025
Merged

Add decode stream #64

merged 12 commits into from
Apr 29, 2025

Conversation

Virviil
Copy link
Contributor

@Virviil Virviil commented Apr 18, 2025

Bump a version of tokenizers

Bring new functionality. Mostly port from https://github.com/huggingface/tokenizers/pull/1678/files - here.

@doc """
Steps through the decode stream with the given tokenizer and token ID.

Returns `{:ok, string}` if there's a decoded string, or `{:ok, nil}` if there's nothing more to decode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would likely do :done for when there is nothing more to decode.

Copy link
Contributor Author

@Virviil Virviil Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem here. While it looks like it's stream process, it's actually not. It's not exhausted. One can call
step(12345678) and then call step(0) - it's valid from original library perspective. Or one can call step(n) multiple times (this is even not idempotent).

https://docs.rs/tokenizers/0.21.1/tokenizers/tokenizer/struct.DecodeStream.html

use tokenizers::Tokenizer;
let tokenizer = Tokenizer::from_file("data/roberta.json").unwrap();

let mut decode_stream = tokenizer.decode_stream(false);
assert_eq!(decode_stream.step(713).unwrap(), Some("This".to_string()));
assert_eq!(decode_stream.step(16).unwrap(), Some(" is".to_string()));
assert_eq!(decode_stream.step(41).unwrap(), Some(" an".to_string()));
assert_eq!(
    decode_stream.step(1246).unwrap(),
    Some(" example".to_string())
);

From this perspective, :done looks inappropriate for having semantics of "last action", while it's not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe :out_of_range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well

Enum.at/2 returns nil for out of range.
Erlang's libs (like array) raise argumentError.

I've never seen :out_of_range anywhere across ecosystem, so may be it will not work good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum.at does not wrap it in a tuple though. Enum.fetch, which returns {:ok, _} or :error, is a better example but there are only two states. My understanding is that this code has three states:

  1. In range
  2. Out of range
  3. Error when decoding

So you need three states accordingly. I don't mind what we call it but I'd say the three states should be distinct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to :out_of_range

Comment on lines 31 to 34
case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do
{:ok, result} -> {:ok, result}
{:error, reason} -> {:error, reason}
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do
{:ok, result} -> {:ok, result}
{:error, reason} -> {:error, reason}
end
Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually efficient going step after step by giving specific indexes? Or does the upstream code works better by calling something .next? Is this something we should benchmark before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR I translated API as is. Some abstraction can be definitely written on top of it using Stream, but it seems to be a part of separate effort.

Decoding large chunks this way looks better for BEAM because control is given back from NIF. There is a possibility to write decoding without dirty, but i'm not sure somebody will invest into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thank you!

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I have dropped some comments. :)

@Virviil Virviil requested a review from josevalim April 19, 2025 08:43
@Virviil
Copy link
Contributor Author

Virviil commented Apr 21, 2025

@jonatanklosko Can you review please?

Comment on lines 97 to 98
let tk = tokenizer.resource.0.clone();
let mut ds = decode_stream.resource.inner.write().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each step involves cloning, it may result in bad performance, even compared to decoding at once.

This may actually be a good reason to make the public API be a more high-level stream. For example, DecodeStream.new(...) can return %DecodeStream{} and it would implement Enumerable API, where the step NIF is called internally and we don't need to clone multiple times, since intermediate resources are not exposed to the user.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, the ids need to be an input, so it's not so much of Enumerable, but more like Stream.transform on top of a stream of ids. But I can see such API being cumbersome where the user doesn't want to block :<

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Virviil Actually, does ds.step mutate the tokenizer itself? Maybe there's no need for that particular clone in the first place.

@Virviil
Copy link
Contributor Author

Virviil commented Apr 28, 2025

@jonatanklosko It was effectively possible to remove cloning.
Check 703f0e7 for diff.

Copy link
Member

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was effectively possible to remove cloning.

@Virviil great!

@jonatanklosko jonatanklosko merged commit d812344 into elixir-nx:main Apr 29, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants