You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rename the "float" compute type to "float32" for clarity. "float" is still accepted for backward compatibility.
New features
Add the environment variable CT2_CUDA_TRUE_FP16_GEMM. This flag is enabled by default so that FP16 GEMMs are running in full FP16. When disabled, the compute type of FP16 GEMMs is set to FP32, which is what PyTorch and TensorFlow do by default.
Fixes and improvements
Improve the numerical precision of Whisper models running in FP16 by setting the FP32 compute type for GEMMs (same behavior as PyTorch)
Improve support for running the Whisper models with INT16 quantization
Ensure the Whisper decoding does not continue past max_length, which could previously happen when the prompt was longer than max_length/2
Include the EOS score in the score returned by Whisper during greedy search