Release 0.12.0 🚀 · b4rtaz/distributed-llama

This version brings major changes to the project after a month-long refactor. The restructuring improves organization, making maintenance and future development significantly easier. Some details about the refactor you can find in this pull request.

List of changes:

✅ Introduced an abstract neural network model with opcodes to describe neural network behavior
✅ Completely restructure the project
✅ Batch processing to support evaluation and prediction #138
✅ Speed up matmul operation for the evaluation (used sgemm llamafile)
✅ Improve tokenizer
✅ Fixed obvious memory leaks (detected by -fsanitize=address)
✅ ARM and AVX2 optimalisation for all op codes

Fixes:

Fixed a bug in rope scaling
Fixed a bug in the tokenizer that caused invalidly tokenized special tokens

Tokenizer

The most important part of this change was delivering a stable version of Distributed Llama that can be used daily. Until now, this project was in the experimental stage and, due to multiple bugs, was not usable. This version focuses solely on the Llama 3 model family; other models are not supported at the moment. Llama 2 models may still work, but they are not a priority right now.

The tokenizer for Llama 3 has been improved, and the API and chat mode now function correctly.

Chat Example

To paste the output I needed to transform "```" into "` ` `".

📄 BosId: 128000 (<|begin_of_text|>)
📄 EosId: 128001 (<|end_of_text|>)
📄 ChatEosId: 128009 (<|end_of_text|>)
📄 RegularVocabSize: 128000
📄 SpecialVocabSize: 256
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 4096
💡 HiddenDim: 14336
💡 VocabSize: 128256
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 4096
💡 NormEpsilon: 0.000010
💡 RopeType: Llama3.1
💡 RopeTheta: 500000
💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 7329040 kB
🧠 CPU: neon dotprod fp16
💿 Weights loaded
⭐ Chat template: llama3
🛑 Stop: <|eot_id|>
💻 System prompt (optional): 

👱 User
> hello

🤖 Assistant
Hello! How can I help you today?
👱 User
> what is 1+1?

🤖 Assistant
 The answer is:

2!
👱 User
> could you implement a simple application that calculates a sum provided by a user? c++ please. the input is stdin.

🤖 Assistant
Here's a simple C++ program that calculates the sum of two numbers provided by the user:

#include <iostream>

` ` `
int main() {
    int num1, num2, sum;

    // Get input from user
    std::cout << "Enter the first number: ";
    std::cin >> num1;

    std::cout << "Enter the second number: ";
    std::cin >> num2;

    // Calculate sum
    sum = num1 + num2;

    // Print result
    std::cout << "The sum is: " << sum << std::endl;

    return 0;
}
` ` `

This program uses the `std::cin` function to read input from the user, and the `std::cout` function to print the result.

If you want to make the program more robust, you can add error handling to ensure that the user enters valid input. For example, you can use a loop to keep asking for input until the user enters a valid number.

Here's an example of how you could add error handling:

` ` `
#include <iostream>

int main() {
    int num1, num2, sum;

    // Get input from user
    while (!(std::cin >> num1)) {
        std::cout << "Invalid input. Please enter a valid number: ";
        std::cin.clear();
        std::cin.ignore(10000, '\n');
    }

    while (!(std::cin >> num2)) {
        std::cout << "Invalid input. Please enter a valid number: ";
        std::cin.clear();
        std::cin.ignore(10000, '\n');
    }

    // Calculate sum
    sum = num1 + num2;

    // Print result
    std::cout << "The sum is: " << sum << std::endl;

    return 0;
}
` ` `

In this version of the program, the `while` loop keeps asking for input until the user enters a valid number. The `std::cin.clear()` and `std::cin.ignore(10000, '\n')` statements are used to clear the input buffer and ignore any remaining input.
👱 User
>

🚨 You need to re-download the models and tokenizers using launch.py.

Performance

This version introduces significant performance improvements on the CPU. All operations are optimized for NEON and AVX2. The most important change is that inference is now split into evaluation and prediction. Evaluation is much faster than prediction due to the use of the SGEMM operation.

Llama 3.1 8B Q40, MacBook M1 Pro 16 GB RAM	Evaluation	Prediction
Distributed Llama 0.11.2	-	10.28 tok/s
Distributed Llama 0.12.0	48.00 tok/s	🚀 19.70 tok/s

This version is not as fast as llama.cpp in evaluation, but it is slightly faster in prediction on the Raspberry Pi 5 8GB.

Llama 3.1 8B Q40, Raspberry Pi 5 8GB	Evaluation	Prediction
llama.cpp 4667	12.52 tok/s	2.03 tok/s
Distributed Llama 0.12.0	6.70 tok/s	🚀 2.47 tok/s

Llama 3.1 8B Q40 - llama.cpp

build: 4667 (d2fe216f) with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from ../../../meta-llama-3.1-8b-instruct-q4_0.gguf?download=true (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type q4_0:  224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 5.61 GiB (6.01 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta Llama 3.1 8B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:  CPU_AARCH64 model buffer size =  3744.00 MiB
load_tensors:   CPU_Mapped model buffer size =  5749.02 MiB
....................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2016
llama_init_from_model: n_ctx_per_seq = 2016
llama_init_from_model: n_batch       = 2016
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2016) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2016, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   252.00 MiB
llama_init_from_model: KV self size  =  252.00 MiB, K (f16):  126.00 MiB, V (f16):  126.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:        CPU compute buffer size =   258.50 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2016
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 1056835838
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2016
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 2016, n_batch = 2048, n_predict = 96, n_keep = 1

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing computation across multiple devices. Read more about our tensor parallelism technology here: https://www.tensorparallelism.com/
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing computation across multiple devices. Read more about our tensor parallelism technology here: https://www.tensorparallelism.com/ #TensorParallelism #LLMs #AI
Tensor parallelism is all you need. Run LLMs on weak devices or

llama_perf_sampler_print:    sampling time =      15.19 ms /   121 runs   (    0.13 ms per token,  7964.19 tokens per second)
llama_perf_context_print:        load time =   44885.24 ms
llama_perf_context_print: prompt eval time =    1997.48 ms /    25 tokens (   79.90 ms per token,    12.52 tokens per second)
llama_perf_context_print:        eval time =   46819.21 ms /    95 runs   (  492.83 ms per token,     2.03 tokens per second)
llama_perf_context_print:       total time =   48993.09 ms /   120 tokens

Llama 3.1 8B Q40 - Distributed Llama

📄 BosId: 128000 (<|begin_of_text|>)
📄 EosId: 128009 (<|eot_id|>)
📄 ChatEosId: 128009 (<|eot_id|>)
📄 RegularVocabSize: 128000
📄 SpecialVocabSize: 256
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 4096
💡 HiddenDim: 14336
💡 VocabSize: 128256
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 1024
💡 NormEpsilon: 0.000010
💡 RopeType: Llama3.1
💡 RopeTheta: 500000
💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 6493072 kB
🧠 CPU: neon dotprod fp16
...
🔶 P  449 ms S     0 kB R     0 kB However
🔶 P  416 ms S     0 kB R     0 kB ,
🔶 P  421 ms S     0 kB R     0 kB  I
🔶 P  468 ms S     0 kB R     0 kB  can

Evaluation
   nBatches: 32
    nTokens: 24
   tokens/s: 6.70 (149.29 ms/tok)
Prediction
    nTokens: 40
   tokens/s: 2.47 (405.38 ms/tok)

A lot of CPU optimizations were made by DeepSeek-R1. 🤖

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.12.0 🚀

Tokenizer

Performance