You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Uses of tokenizer should restrict to only this methods.
In particular: encode should do both encoding and padding. So jetstream doesnt do any padding itself; the engine can choose how to pad (or not to pad) by returning a custom tokenizer object whose encode also does the padding.
Issue
Currently we assume few things in jetstream which hinders it's generalization:
These assumptions hinders generalization (i.e. support wider varieties of models).
Examples:
torch.Tensor
to hold the data, which is not jax-pytreeable.Proposal:
EngineAPI.get_tokenizer
which returns the tokenizer, should be any object that implements the following interface:Uses of tokenizer should restrict to only this methods.
In particular:
encode
should do both encoding and padding. So jetstream doesnt do any padding itself; the engine can choose how to pad (or not to pad) by returning a custom tokenizer object whose encode also does the padding.ResultTokens
; same asPrefix
andDecodeState
. Implementations of the Engine can choose implementation of ResultTokens. jestream should interact with it only through it's 3 public methods (https://github.com/google/JetStream/blob/main/jetstream/engine/engine_api.py#L83)and are not allowed to access it's fields directly.
The text was updated successfully, but these errors were encountered: