Have you tried moving MLP layer to CPU RAM #184
Closed
kaiokendev
started this conversation in
Ideas
Replies: 1 comment 8 replies
-
Moving the hidden state back and forth wouldn't be that much of a bottleneck, but why specifically run the MLP on the CPU? |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
It's one thing I want to try (to fit 70b on a single 3090) -> move just the MLP layers to CPU RAM and move the hidden state to CPU right before the post attention layer norming then move it back before the block norm, but I would have to touch the 4-bit cuda matmul kernels. Im wondering if you already tried it? I think the performance hit would not be too egregious and maybe even better than running with llama.cpp while offloading ~30% memory
Beta Was this translation helpful? Give feedback.
All reactions