Have you tried moving MLP layer to CPU RAM #184

kaiokendev · 2023-07-22T18:50:29Z

kaiokendev
Jul 22, 2023

It's one thing I want to try (to fit 70b on a single 3090) -> move just the MLP layers to CPU RAM and move the hidden state to CPU right before the post attention layer norming then move it back before the block norm, but I would have to touch the 4-bit cuda matmul kernels. Im wondering if you already tried it? I think the performance hit would not be too egregious and maybe even better than running with llama.cpp while offloading ~30% memory

turboderp · 2023-07-22T19:55:32Z

turboderp
Jul 22, 2023
Maintainer

Moving the hidden state back and forth wouldn't be that much of a bottleneck, but why specifically run the MLP on the CPU?

8 replies

turboderp Jul 22, 2023
Maintainer

Really?

Yes, three very large matmuls. Depending on the sequence length (i.e. how large the attention matrix is), they account for the majority of the computation in the forward pass. The attn projections are all hidden_size^2 whereas the MLP is two times hidden_size*intermediate_size and then one intermediate_size * hidden_size.

As for SVD, it does work to some extent, yes. Of course at rank hidden_size/2 you've got two matrices each half as large as the original, so there's nothing to gain from that. But especially for the earlier layers I can get below 1 bit per weight with SVD followed by quantization, and no noticeable loss. The weights are highly correlated in the early layers, it seems. Later on it gets a lot messier, of course. Which is why I'm trying an adaptive approach next.

kaiokendev Jul 22, 2023
Author

Yes, three very large matmuls. Depending on the sequence length (i.e. how large the attention matrix is), they account for the majority of the computation in the forward pass. The attn projections are all hidden_size^2 whereas the MLP is two times hidden_size*intermediate_size and then one intermediate_size * hidden_size.

Yes but the projection calculations themselves will be dwarfed by the computation time of self attention Q*K?

turboderp Jul 22, 2023
Maintainer

Only for extremely long sequences. In token generation, seqlen_q = 1 and seqlen_k grows linearly. So the shape of the attention matrix is a very manageable [seqlen_k, num_heads].

Note the benchmarks in the readme: generating the 2000th token is only some 20-30% slower than generating the first tokens because it's dominated by the linear layers throughout. Attention really isn't much of a bottleneck until either seqlen_k grows very large (like 5k+ tokens), or if seqlen_k == seqlen_q in e.g. prompt processing, where attention scales quadratically. But in the latter case you'd suffer even more of a penalty moving the MLP to a device that only has maybe 16 cores instead of 16,000.

kaiokendev Jul 22, 2023
Author

Only for extremely long sequences.

It's the only case I really think about lol

But in the latter case you'd suffer even more of a penalty moving the MLP to a device that only has maybe 16 cores instead of 16,000

Ok I did not expect it, thanks for the clarification

turboderp Jul 22, 2023
Maintainer

I did a quick test:

import torch
import time

gpu_hidden_state = torch.rand((1, 8192), dtype = torch.float32, device = "cuda:0") / 1000.0
gpu_mlp_a = torch.rand((8192, 28672), dtype = torch.float32, device = "cuda:0") / 1000.0
gpu_mlp_b = torch.rand((28672, 8192), dtype = torch.float32, device = "cuda:0") / 1000.0

# Warmup

for i in range(4):
    blah = torch.matmul(gpu_hidden_state, gpu_mlp_a)

# Approximate performance of 70b MLP

a = time.time()

for i in range(20):
    x = torch.matmul(gpu_hidden_state, gpu_mlp_a)
    y = torch.matmul(gpu_hidden_state, gpu_mlp_a)
    x *= y
    x = torch.matmul(x, gpu_mlp_b)
    torch.cuda.synchronize()

b = time.time()
print(f"GPU time: {b - a:.3f}")

# CPU

cpu_hidden_state = torch.rand((1, 8192), dtype = torch.float32, device = "cpu") / 1000.0
cpu_mlp_a = torch.rand((8192, 28672), dtype = torch.float32, device = "cpu") / 1000.0
cpu_mlp_b = torch.rand((28672, 8192), dtype = torch.float32, device = "cpu") / 1000.0

a = time.time()

for i in range(20):
    x = torch.matmul(cpu_hidden_state, cpu_mlp_a)
    y = torch.matmul(cpu_hidden_state, cpu_mlp_a)
    x *= y
    x = torch.matmul(x, cpu_mlp_b)

b = time.time()
print(f"CPU time: {b - a:.3f}")

Had to be in float32 because there's no CPU support for float16. It still gives some idea:

GPU time: 0.074
CPU time: 1.083

Although, while Torch's CPU matmul is very well optimized, multithreaded, all that, it's still not as efficient as llama.cpp. Properly frankensteining in some code from llama.cpp would likely produce better results. Then again GPTQ is also much faster than float32 matmul on GPU, so idk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have you tried moving MLP layer to CPU RAM #184

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Have you tried moving MLP layer to CPU RAM #184

kaiokendev Jul 22, 2023

Replies: 1 comment · 8 replies

turboderp Jul 22, 2023 Maintainer

turboderp Jul 22, 2023 Maintainer

kaiokendev Jul 22, 2023 Author

turboderp Jul 22, 2023 Maintainer

kaiokendev Jul 22, 2023 Author

turboderp Jul 22, 2023 Maintainer

kaiokendev
Jul 22, 2023

Replies: 1 comment 8 replies

turboderp
Jul 22, 2023
Maintainer

turboderp Jul 22, 2023
Maintainer

kaiokendev Jul 22, 2023
Author

turboderp Jul 22, 2023
Maintainer

kaiokendev Jul 22, 2023
Author

turboderp Jul 22, 2023
Maintainer