Is it possible and efficient if load layer on demand? #267

fahadh4ilyas · 2023-08-30T11:21:09Z

I have a gpu that I want to load multiple model in it. Your exllama model is loading all weight to gpu after instantiate the ExLlama. Is it possible if I load every decoder layer to cpu first and then load it to gpu when forward is called to that layer (then move it back to cpu if done forward pass). It seems possible but I'm not sure how the generation time will be affected.

The text was updated successfully, but these errors were encountered:

turboderp · 2023-08-30T12:09:10Z

I experimented with this early on, but I couldn't find a way to make it even remotely usable. The bottleneck during text generation is largely memory bandwidth, since every parameter of the model* is read at least once during a forward pass. If you're streaming layers from system RAM, even if you can get it running completely asynchronously, your inference speed will be limited by PCIe bandwidth. So you can expect a slowdown of whatever is the ratio between your PCIe and VRAM bandwidths, likely a slowdown on the order of 30x.

*) The exception is the token embedding layer, which ExLlama already keeps in system RAM.

fahadh4ilyas · 2023-08-30T15:07:05Z

I experimented with this early on, but I couldn't find a way to make it even remotely usable. The bottleneck during text generation is largely memory bandwidth, since every parameter of the model* is read at least once during a forward pass. If you're streaming layers from system RAM, even if you can get it running completely asynchronously, your inference speed will be limited by PCIe bandwidth. So you can expect a slowdown of whatever is the ratio between your PCIe and VRAM bandwidths, likely a slowdown on the order of 30x.

*) The exception is the token embedding layer, which ExLlama already keeps in system RAM.

Yeah, I've been testing by copying tensors variable from your exllama to gpu (I skip the process copying tensors to gpu in ExLlama.__init__ method) whenever q4 property is called from Ex4bitLinear. The generation process takes a really long time. I was thinking maybe putting the model to gpu everytime generation called and then putting it back to cpu when idle will works better. But, then again why bother do that because the number of model in gpu wont be increased.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible and efficient if load layer on demand? #267

Is it possible and efficient if load layer on demand? #267

fahadh4ilyas commented Aug 30, 2023

turboderp commented Aug 30, 2023 •

edited

Loading

fahadh4ilyas commented Aug 30, 2023

Is it possible and efficient if load layer on demand? #267

Is it possible and efficient if load layer on demand? #267

Comments

fahadh4ilyas commented Aug 30, 2023

turboderp commented Aug 30, 2023 • edited Loading

fahadh4ilyas commented Aug 30, 2023

turboderp commented Aug 30, 2023 •

edited

Loading