Llama 2 is out #163
Replies: 6 comments
-
https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ Some key info bits:
Not LLaMA related, but Microsoft also published a paper about a Transformer replacement yesterday, a new architecture they're calling Retentive Network (RetNet). The promises made are pretty wild. It's definitely worth reading. |
Beta Was this translation helpful? Give feedback.
-
So, it looks like LLaMA 2 13B is close enough to LLaMA 1 that ExLlama already works on it. I assume 7B works too but don't care enough to test. Here are a few benchmarks for 13B on a single 3090: python test_benchmark_inference.py -d G:\models\Llama2-13B-128g-actorder-GPTQ\ -p -ppl gptq-for-llama -l 4096
python test_benchmark_inference.py -d G:\models\Llama2-13B-128g-actorder-GPTQ\ -p -ppl gptq-for-llama -a 6 -l 16384
python test_benchmark_inference.py -d G:\models\Llama2-13B-128g-actorder-GPTQ\ -p -ppl gptq-for-llama -a 8 -l 16384
Subjectively, NTK RoPE scaling with alpha=4 loses coherence at some point before 16384 context, next I tried alpha=8 (as that was roughly what you'd want for 4x context with NTK v1 on LLaMA v1) and that's fine, and lastly I tried alpha=6 and that seems to work fine too, so perhaps NTK scaling works differently on the longer base context length of these models. Anyway, yeah, 70B is GQA with 64 heads, and 8 groups: config.json for 70B{
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 28672,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"pad_token_id": 0,
"rms_norm_eps": 1e-05,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0.dev0",
"use_cache": true,
"vocab_size": 32000
} A 4-bit 128-groupsize quant of 65B is 34.7 GB, while an equivalent quant of 70B is 36.7GB. So to figure out how much context two 24GB cards should fit with the new model, a close approximation would be to find out how much you can currently fit using 65B with 2GB of VRAM free, then multiply whatever that number is by 8, and you should be pretty close. I wonder what they've done for 34B; the old models had 32/40/52/64 heads for 7/13/33/65B respectively. LLaMA 2 are 32/40/64 heads for 7/13/70B. 52 is not divisible by 8, which I think would make the math weird when you try applying GQA with 8 kv heads like 70B, so perhaps it's been bumped up slightly to 56? |
Beta Was this translation helpful? Give feedback.
-
CQA shouldn't be hard to add. I'm on it now. But the way they've done it in transformers is pretty wasteful, requiring reshaping the entire key/value cache for every token. That's a lot of large, temporary tensors and a bunch of extra copying which it should be possible to achieve with broadcasting instead. |
Beta Was this translation helpful? Give feedback.
-
@turboderp We are trying to also implement GQA for the 13b llama-2 model, in bid to see if it's memory usage can be optimised. In a naive experiment, we tried to change the
The reason we want to do this is because running in 4bit severely affects the model performance, and even when running in 8bit, the memory usage of 13b fringes the memory limits of a single RTX 3090 with 24gb of Vram, after a few inferences. Do you know how perhaps is the correct way to implement GQA on the 13b models? |
Beta Was this translation helpful? Give feedback.
-
The model has to be trained with GQA from the beginning. I've heard of people merging the key and value projections successfully on a pretrained model, but it's a lot more involved than just changing the config. If you want to do this on the running model, i.e. without producing a new .safetensors file with the modified weights, initially you should probably do it in Transformers, since ExLlama will have many places the same changes would need to be applied. Start in the attention function after the key and value projections are applied, then do whatever merging (averaging I suppose) over the head dimension of the key and value state tensors, and if you can make that work you should be able to do the same to the weights and produce a GQA model that way. Of course, how well the model would perform after that kind of brain surgery, I have no idea. |
Beta Was this translation helpful? Give feedback.
-
The original paper that introduced GQA was also largely focused on converting an existing MHA model: |
Beta Was this translation helpful? Give feedback.
-
https://ai.meta.com/llama/
Beta Was this translation helpful? Give feedback.
All reactions