Add decode stream #64

Virviil · 2025-04-18T21:55:00Z

Bump a version of tokenizers

Bring new functionality. Mostly port from https://github.com/huggingface/tokenizers/pull/1678/files - here.

lib/tokenizers/decode_stream.ex

josevalim · 2025-04-19T07:26:51Z

lib/tokenizers/decode_stream.ex

+  @doc """
+  Steps through the decode stream with the given tokenizer and token ID.
+
+  Returns `{:ok, string}` if there's a decoded string, or `{:ok, nil}` if there's nothing more to decode.


I would likely do :done for when there is nothing more to decode.

There is a problem here. While it looks like it's stream process, it's actually not. It's not exhausted. One can call
step(12345678) and then call step(0) - it's valid from original library perspective. Or one can call step(n) multiple times (this is even not idempotent).

https://docs.rs/tokenizers/0.21.1/tokenizers/tokenizer/struct.DecodeStream.html

use tokenizers::Tokenizer; let tokenizer = Tokenizer::from_file("data/roberta.json").unwrap(); let mut decode_stream = tokenizer.decode_stream(false); assert_eq!(decode_stream.step(713).unwrap(), Some("This".to_string())); assert_eq!(decode_stream.step(16).unwrap(), Some(" is".to_string())); assert_eq!(decode_stream.step(41).unwrap(), Some(" an".to_string())); assert_eq!( decode_stream.step(1246).unwrap(), Some(" example".to_string()) );

From this perspective, :done looks inappropriate for having semantics of "last action", while it's not

Maybe :out_of_range?

Well

Enum.at/2 returns nil for out of range.
Erlang's libs (like array) raise argumentError.

I've never seen :out_of_range anywhere across ecosystem, so may be it will not work good.

Enum.at does not wrap it in a tuple though. Enum.fetch, which returns {:ok, _} or :error, is a better example but there are only two states. My understanding is that this code has three states:

In range

Out of range

Error when decoding

So you need three states accordingly. I don't mind what we call it but I'd say the three states should be distinct.

Switched to :out_of_range

josevalim · 2025-04-19T07:27:00Z

lib/tokenizers/decode_stream.ex

+    case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do
+      {:ok, result} -> {:ok, result}
+      {:error, reason} -> {:error, reason}
+    end


Suggested change

case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do

{:ok, result} -> {:ok, result}

{:error, reason} -> {:error, reason}

end

Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id)

Is it actually efficient going step after step by giving specific indexes? Or does the upstream code works better by calling something .next? Is this something we should benchmark before?

In this PR I translated API as is. Some abstraction can be definitely written on top of it using Stream, but it seems to be a part of separate effort.

Decoding large chunks this way looks better for BEAM because control is given back from NIF. There is a possibility to write decoding without dirty, but i'm not sure somebody will invest into it.

Got it! Thank you!

lib/tokenizers/decode_stream.ex

josevalim

Thank you, I have dropped some comments. :)

lib/tokenizers/native.ex

Co-authored-by: José Valim <[email protected]>

Virviil · 2025-04-21T20:16:26Z

@jonatanklosko Can you review please?

jonatanklosko · 2025-04-22T02:45:04Z

native/ex_tokenizers/src/decode_stream.rs

+    let tk = tokenizer.resource.0.clone();
+    let mut ds = decode_stream.resource.inner.write().unwrap();


If each step involves cloning, it may result in bad performance, even compared to decoding at once.

This may actually be a good reason to make the public API be a more high-level stream. For example, DecodeStream.new(...) can return %DecodeStream{} and it would implement Enumerable API, where the step NIF is called internally and we don't need to clone multiple times, since intermediate resources are not exposed to the user.

Thoughts?

Gah, the ids need to be an input, so it's not so much of Enumerable, but more like Stream.transform on top of a stream of ids. But I can see such API being cumbersome where the user doesn't want to block :<

@Virviil Actually, does ds.step mutate the tokenizer itself? Maybe there's no need for that particular clone in the first place.

lib/tokenizers/decode_stream.ex

Virviil · 2025-04-28T15:49:35Z

@jonatanklosko It was effectively possible to remove cloning.
Check 703f0e7 for diff.

jonatanklosko

It was effectively possible to remove cloning.

@Virviil great!

Virviil added 8 commits April 18, 2025 23:52

Add rust part for decode stream

f74228e

Glue elixir with rust

b7bb0f6

Add tests

7f00aa5

Fix error

f11d515

Fix rustler back

8041e12

Fix format

c4f45f8

Fix rust format

d2a91ce

Fix clippy

7ba0b5d