Model Updates

Note

Please refer to the front-page README for the latest verified release for each model.

December 2, 2024

Llama 3.1/3.2

Improved the decode performance of the 1B/3B/8B/11B text models (for 8B, increased from ~23 t/s/u to ~28 t/s/u) by using BFP4 weights (instead of BFP8) for FF1 and FF3 in the MLP.
Added the option to specify custom model configurations, with two defaults for performance and accuracy already provided.

November 18, 2024

Llama 3.2 - 1B/3B/11B

Created a new shared codebase for the Llama3 family of models, with newly added support for Llama3.2-1B/3B/11B.

Llama 3/3.1 - 70B

Added support for the ttnn.experimental.rotary_embedding_llama op in decode mode, eliminating unnecessary device transfers of rotation matrices.

October 21, 2024

Llama 3/3.1 - 70B

Enabled prefill workloads to pad to multiples of 1024 instead of powers of 2, improving overall performance for longer sequences

October 7, 2024

Llama 3.1 - 8B

Added support for continuous batching
Added paged caching support for PagedAttention
Added a demo which runs with TT-NN tracing (23 t/s/u decode on main)

September 23, 2024

Llama 3/3.1 - 70B

Added support for 128K context length using PagedAttention
Added a continuous batching demo for running multiple batches of users consecutively
Added the option to enable TT-NN tracing

September 9, 2024

Mixtral7Bx8

Note: This feature is available as of release v0.52.0-rc1

Added support for any user prompt size up to a maximum of 32k tokens

August 26, 2024

Falcon7B

Added data parallel demo for a single Galaxy (32 chips)
Refactored all modules and tests to use ttnn multi-device tensors

Llama 3.1 - 8B

Note: This feature is available as of release v0.51.0-rc33

Added multi-batching support to the demo for running multiple batches of users consecutively

Mixtral7Bx8

Improved end-to-end performance through optimizations to the attention mask in flash decoding

August 12, 2024

Llama 3.1 - 8B

Added support for flash decoding

Mistral7B

Updated the demo to support multiple batches of users

Mamba-2.8B

Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode

Mixtral7Bx8

Added support for decode with 32K context length using flash decoding
Fused mixture of experts into a single operation using ttnn.moe

July 29, 2024

Llama 3.1 - 8B

Added support for LLaMA 3.1 - 8B
Runs fast prefill for sequence lengths of up to 512 tokens
Supports a maximum context length of 8K tokens

Llama 3/3.1 - 70B

Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
Prefill and decode now support 8K context length with batch size 16

Mistral7B

Added prefill support for 4K context length, using scaled dot product attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_UPDATES.md

MODEL_UPDATES.md

Model Updates

December 2, 2024

Llama 3.1/3.2

November 18, 2024

Llama 3.2 - 1B/3B/11B

Llama 3/3.1 - 70B

October 21, 2024

Llama 3/3.1 - 70B

October 7, 2024

Llama 3.1 - 8B

September 23, 2024

Llama 3/3.1 - 70B

September 9, 2024

Mixtral7Bx8

August 26, 2024

Falcon7B

Llama 3.1 - 8B

Mixtral7Bx8

August 12, 2024

Llama 3.1 - 8B

Mistral7B

Mamba-2.8B

Mixtral7Bx8

July 29, 2024

Llama 3.1 - 8B

Llama 3/3.1 - 70B

Mistral7B

Files

MODEL_UPDATES.md

Latest commit

History

MODEL_UPDATES.md

File metadata and controls

Model Updates

December 2, 2024

November 18, 2024

October 21, 2024

October 7, 2024

September 23, 2024

September 9, 2024

August 26, 2024

August 12, 2024

July 29, 2024