Llama 2 is out #163

nikshepsvn · 2023-07-18T17:38:54Z

nikshepsvn
Jul 18, 2023

https://ai.meta.com/llama/

EyeDeck · 2023-07-18T21:07:40Z

EyeDeck
Jul 18, 2023

https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Some key info bits:

New models sizes are 7B, 13B, 34B*, and 70B.
34B has not been released, with the note: "We are delaying the release of the 34B model due to a lack of time to sufficiently red team"
There's a chart which shows 34B as an outlier on a "safety" graph, which is probably why.
(obviously) These are clean slate trains, and not continuations of LLaMA v1.
All models were trained on 2T tokens (v1 was 1T for 7/13B, and 1.4T for 33/65B).
Base context length has doubled to 4096.
They've simultaneously released regular foundational models (like LLaMA v1) which are not heavily censored, and chat finetunes which are.
7B and 13B still use regular old Multi-Head Attention.
34B and 70B use Grouped-Query Attention, which cuts the size of the KV cache by some factor. I believe they've chosen an 8-group configuration, though the paper doesn't seem to be entirely entirely clear on this.
GQA should reduce the KV cache size by a factor equal to the number of groups, so although 70B is larger than 65B, I think you'll still probably be able to fit more context in the same space, and that'll scale 8x better as 70B is squished down with improved quantization methods and so on. Once 34B is out, it should easily fit at least 16K context on a single 24GB card, assuming RoPE scaling stuff still works.
Multi-Query Attention was also tested but they ultimately decided against using it, apparently because it made parallelism across 8 GPUs more difficult. The single head's worth of KV cache can't be split across GPUs, apparently, it has to be duplicated, and if you're going to do that then you may as well use GQA instead for a small model performance improvement. Which is a little sad for people like us without access to datacenter-level GPU clusters, because MQA would let us push context length a lot.

Not LLaMA related, but Microsoft also published a paper about a Transformer replacement yesterday, a new architecture they're calling Retentive Network (RetNet). The promises made are pretty wild. It's definitely worth reading.
https://arxiv.org/abs/2307.08621
https://github.com/microsoft/unilm/blob/master/retnet/README.md

0 replies

EyeDeck · 2023-07-19T04:39:01Z

EyeDeck
Jul 19, 2023

So, it looks like LLaMA 2 13B is close enough to LLaMA 1 that ExLlama already works on it. I assume 7B works too but don't care enough to test.

Here are a few benchmarks for 13B on a single 3090:

python test_benchmark_inference.py -d G:\models\Llama2-13B-128g-actorder-GPTQ\ -p -ppl gptq-for-llama -l 4096

 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 128
 -- - Chunk size: 2048 -> 2048
 -- - Chunk overlap: 0
 -- - Min. chunk size: 0
 -- - Key: text
 -- Tokenizer: G:\models\Llama2-13B-128g-actorder-GPTQ\tokenizer.model
 -- Model config: G:\models\Llama2-13B-128g-actorder-GPTQ\config.json
 -- Model: G:\models\Llama2-13B-128g-actorder-GPTQ\gptq_model-4bit-128g.safetensors
 -- Sequence length: 4096
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf', 'perplexity']
 ** Time, Load model: 4.20 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 6,874.52 MB - [cuda:1] 0.00 MB
 -- Warmup pass 1...
 ** Time, Warmup: 2.37 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 1.46 seconds
 -- Inference, first pass.
 ** Time, Inference: 1.49 seconds
 ** Speed: 2660.47 tokens/second
 -- Generating 128 tokens, 3968 token prompt...
 ** Speed: 41.95 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 54.00 tokens/second
 ** VRAM, Inference: [cuda:0] 3,905.81 MB - [cuda:1] 0.00 MB
 ** VRAM, Total: [cuda:0] 10,780.33 MB - [cuda:1] 0.00 MB
 -- Loading dataset...
 -- Testing 128 chunks.............
 ** Perplexity: 4.9813

python test_benchmark_inference.py -d G:\models\Llama2-13B-128g-actorder-GPTQ\ -p -ppl gptq-for-llama -a 6 -l 16384

 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 128
 -- - Chunk size: 2048 -> 2048
 -- - Chunk overlap: 0
 -- - Min. chunk size: 0
 -- - Key: text
 -- Tokenizer: G:\models\Llama2-13B-128g-actorder-GPTQ\tokenizer.model
 -- Model config: G:\models\Llama2-13B-128g-actorder-GPTQ\config.json
 -- Model: G:\models\Llama2-13B-128g-actorder-GPTQ\gptq_model-4bit-128g.safetensors
 -- Sequence length: 16384
 -- RoPE alpha factor: 6.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf', 'perplexity']
 ** Time, Load model: 4.60 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 6,880.52 MB - [cuda:1] 0.00 MB
 -- Warmup pass 1...
 ** Time, Warmup: 14.88 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 12.94 seconds
 -- Inference, first pass.
 ** Time, Inference: 12.89 seconds
 ** Speed: 1261.51 tokens/second
 -- Generating 128 tokens, 16256 token prompt...
 ** Speed: 29.72 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 58.34 tokens/second
 ** VRAM, Inference: [cuda:0] 13,505.90 MB - [cuda:1] 0.00 MB
 ** VRAM, Total: [cuda:0] 20,386.42 MB - [cuda:1] 0.00 MB
 -- Loading dataset...
 -- Testing 128 chunks.............
 ** Perplexity: 5.5495

python test_benchmark_inference.py -d G:\models\Llama2-13B-128g-actorder-GPTQ\ -p -ppl gptq-for-llama -a 8 -l 16384

-- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 128
 -- - Chunk size: 2048 -> 2048
 -- - Chunk overlap: 0
 -- - Min. chunk size: 0
 -- - Key: text
 -- Tokenizer: G:\models\Llama2-13B-128g-actorder-GPTQ\tokenizer.model
 -- Model config: G:\models\Llama2-13B-128g-actorder-GPTQ\config.json
 -- Model: G:\models\Llama2-13B-128g-actorder-GPTQ\gptq_model-4bit-128g.safetensors
 -- Sequence length: 16384
 -- RoPE alpha factor: 8.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf', 'perplexity']
 ** Time, Load model: 4.54 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 6,880.52 MB - [cuda:1] 0.00 MB
 -- Warmup pass 1...
 ** Time, Warmup: 14.71 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 13.05 seconds
 -- Inference, first pass.
 ** Time, Inference: 13.22 seconds
 ** Speed: 1229.92 tokens/second
 -- Generating 128 tokens, 16256 token prompt...
 ** Speed: 28.59 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 55.25 tokens/second
 ** VRAM, Inference: [cuda:0] 13,505.90 MB - [cuda:1] 0.00 MB
 ** VRAM, Total: [cuda:0] 20,386.42 MB - [cuda:1] 0.00 MB
 -- Loading dataset...
 -- Testing 128 chunks.............
 ** Perplexity: 5.7747

Subjectively, NTK RoPE scaling with alpha=4 loses coherence at some point before 16384 context, next I tried alpha=8 (as that was roughly what you'd want for 4x context with NTK v1 on LLaMA v1) and that's fine, and lastly I tried alpha=6 and that seems to work fine too, so perhaps NTK scaling works differently on the longer base context length of these models.

Anyway, yeah, 70B is GQA with 64 heads, and 8 groups:

config.json for 70B

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-05,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

A 4-bit 128-groupsize quant of 65B is 34.7 GB, while an equivalent quant of 70B is 36.7GB. So to figure out how much context two 24GB cards should fit with the new model, a close approximation would be to find out how much you can currently fit using 65B with 2GB of VRAM free, then multiply whatever that number is by 8, and you should be pretty close.

I wonder what they've done for 34B; the old models had 32/40/52/64 heads for 7/13/33/65B respectively. LLaMA 2 are 32/40/64 heads for 7/13/70B. 52 is not divisible by 8, which I think would make the math weird when you try applying GQA with 8 kv heads like 70B, so perhaps it's been bumped up slightly to 56?

0 replies

turboderp · 2023-07-19T10:46:51Z

turboderp
Jul 19, 2023
Maintainer

CQA shouldn't be hard to add. I'm on it now. But the way they've done it in transformers is pretty wasteful, requiring reshaping the entire key/value cache for every token. That's a lot of large, temporary tensors and a bunch of extra copying which it should be possible to achieve with broadcasting instead.

0 replies

calvintwr · 2023-08-11T05:13:43Z

calvintwr
Aug 11, 2023

@turboderp We are trying to also implement GQA for the 13b llama-2 model, in bid to see if it's memory usage can be optimised. In a naive experiment, we tried to change the num_key_value_heads in the config.json, but we ran into problems during inference:

RuntimeError: shape '[1, 34, 8, 128]' is invalid for input of size 174080

The reason we want to do this is because running in 4bit severely affects the model performance, and even when running in 8bit, the memory usage of 13b fringes the memory limits of a single RTX 3090 with 24gb of Vram, after a few inferences.

Do you know how perhaps is the correct way to implement GQA on the 13b models?

0 replies

turboderp · 2023-08-11T05:26:48Z

turboderp
Aug 11, 2023
Maintainer

The model has to be trained with GQA from the beginning. I've heard of people merging the key and value projections successfully on a pretrained model, but it's a lot more involved than just changing the config. If you want to do this on the running model, i.e. without producing a new .safetensors file with the modified weights, initially you should probably do it in Transformers, since ExLlama will have many places the same changes would need to be applied.

Start in the attention function after the key and value projections are applied, then do whatever merging (averaging I suppose) over the head dimension of the key and value state tensors, and if you can make that work you should be able to do the same to the weights and produce a GQA model that way.

Of course, how well the model would perform after that kind of brain surgery, I have no idea.

0 replies

EyeDeck · 2023-08-11T07:42:46Z

EyeDeck
Aug 11, 2023

The original paper that introduced GQA was also largely focused on converting an existing MHA model:
https://arxiv.org/pdf/2305.13245.pdf

You do indeed do some KV head brain surgery, and according to the paper, it's not that damaging to the model even without any further training.
I'd be curious to see how much brain damage you could repair by training a LoRA rather than a full finetune like they did.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 2 is out #163

{{title}}

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Llama 2 is out #163

nikshepsvn Jul 18, 2023

Replies: 6 comments

EyeDeck Jul 18, 2023

EyeDeck Jul 19, 2023

turboderp Jul 19, 2023 Maintainer

calvintwr Aug 11, 2023

turboderp Aug 11, 2023 Maintainer

EyeDeck Aug 11, 2023

nikshepsvn
Jul 18, 2023

EyeDeck
Jul 18, 2023

EyeDeck
Jul 19, 2023

turboderp
Jul 19, 2023
Maintainer

calvintwr
Aug 11, 2023

turboderp
Aug 11, 2023
Maintainer

EyeDeck
Aug 11, 2023