Skip to content

Latest commit

 

History

History
86 lines (59 loc) · 3.42 KB

MODEL_UPDATES.md

File metadata and controls

86 lines (59 loc) · 3.42 KB

Model Updates

Note

Please refer to the front-page README for the latest verified release for each model.

December 2, 2024

  • Improved the decode performance of the 1B/3B/8B/11B text models (for 8B, increased from ~23 t/s/u to ~28 t/s/u) by using BFP4 weights (instead of BFP8) for FF1 and FF3 in the MLP.
  • Added the option to specify custom model configurations, with two defaults for performance and accuracy already provided.

November 18, 2024

  • Created a new shared codebase for the Llama3 family of models, with newly added support for Llama3.2-1B/3B/11B.
  • Added support for the ttnn.experimental.rotary_embedding_llama op in decode mode, eliminating unnecessary device transfers of rotation matrices.

October 21, 2024

  • Enabled prefill workloads to pad to multiples of 1024 instead of powers of 2, improving overall performance for longer sequences

October 7, 2024

  • Added support for continuous batching
  • Added paged caching support for PagedAttention
  • Added a demo which runs with TT-NN tracing (23 t/s/u decode on main)

September 23, 2024

  • Added support for 128K context length using PagedAttention
  • Added a continuous batching demo for running multiple batches of users consecutively
  • Added the option to enable TT-NN tracing

September 9, 2024

Note: This feature is available as of release v0.52.0-rc1

  • Added support for any user prompt size up to a maximum of 32k tokens

August 26, 2024

  • Added data parallel demo for a single Galaxy (32 chips)
  • Refactored all modules and tests to use ttnn multi-device tensors

Note: This feature is available as of release v0.51.0-rc33

  • Added multi-batching support to the demo for running multiple batches of users consecutively
  • Improved end-to-end performance through optimizations to the attention mask in flash decoding

August 12, 2024

  • Added support for flash decoding
  • Updated the demo to support multiple batches of users
  • Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode
  • Added support for decode with 32K context length using flash decoding
  • Fused mixture of experts into a single operation using ttnn.moe

July 29, 2024

  • Added support for LLaMA 3.1 - 8B
  • Runs fast prefill for sequence lengths of up to 512 tokens
  • Supports a maximum context length of 8K tokens
  • Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
  • Prefill and decode now support 8K context length with batch size 16
  • Added prefill support for 4K context length, using scaled dot product attention