-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PostMessage: Data cannot be cloned, out of memory #12
Comments
Yes seems like the issue is due to the way multiple files are being copied onto web worker: we're currently copying all shards at once, which may cause it to run out of memory. The fix would be:
|
This is becoming a bit of a show stopper unfortunately. It seems to even affect small models that would load under If you could help fix this issue, or give some pointers on how I could attempt to do so myself, that would be greatly appreciated. At this point I don't mind if a fix is slow or sub-optimal. I just want wllama to be reliable. |
I'm planning to work on this issue in next days. It maybe more complicated than it looks, so I'll need time to figure that out. Please be patient. |
That's great news! Thank you so much! |
FYI, v1.7.0 has been released. It also come with support for wllama/examples/advanced/index.html Lines 53 to 57 in d1ceeb6
This issue (out-of-memory) is hopefully fixed by #14 , but I'm not 100% sure. Please try again & let me know if it works. Also, it's now recommended to split the model into chunks of 256MB or 512MB. Again, see "advanced" example: wllama/examples/advanced/index.html Lines 38 to 45 in d1ceeb6
Also have a look at updated README: https://github.com/ngxson/wllama/tree/master?tab=readme-ov-file#prepare-your-model Thank you! |
The readme mentions the progress feature (very nice bonus, thank you!), but just to be sure: does this also address the memory issue? Or is the intended fix for that to make the chunks smaller? Ah, reading again..
OK, I'll do that. Thank you. |
@flatsiedatsie FYI, I uploaded v1.8.0 which should display a better error message (I don't know if it fixes the mentioned issue or not). Could you try again and see what's the error? Thanks. |
I'm also still looking into your suggestion that it may be that the model is trying to load twice. |
You screenshot still shows "_wllama_decode_exception", which is already been removed in 1.8.0. Maybe your code is not using the latest version. |
Correct, those are screenshots from yesterdy. I'm updating it now. |
OK, I've done some more testing. TL/DR: Thing are running a lot smoother now! It's just the big models or big contexts that run out of memory. But before I get into that, let me give a little context about what I'm trying to achieve. I'm trying to create a 100% browser-based online tool where people can not only chat with AI, but use it to work on documents. For that I need two types of models:
Mistral 7B with 32K context could be a good "middle of the road do-it-all" option, so I've been trying to run that with Wllama today. I started with by using your example code to eliminate the possiblity of bugs in my project being the cause of issues. I also rebooted my laptop first (Macbook Pro with 16Gb of ram) to have as much available memory as possible. Once I found that I got the same results with the example as with my code, I mostly reverted back to my project.
The only model I've been able to get to work with 16K context. Crashes on it's theoretical maximum, 32K.
In my main code I can now load NeuralReyna. Howver, if I try to use the full 32K, or even 16K, there are once again memory issues. With 8K it doesn't crash.
I chunked it in 250Mb parts, and it loads! Nice!
Here I tried to directly load a 1.96Gb .gguf file (Q3_K) and even that worked! This is pretty great, as Llama.cpp support for this model is right around the corner. To be clear, I used it with a 4K context, since Llama.cpp doesn't support bigger context yet.
This model has memory issues. To make sure it wasn't my code I tried loading the model in the advanced example too. Same result. Even setting the context to 1K doesn't help. The chunks I'm using are available here: https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked With version 1.8 Wllama doesn't seem to raise an error though? It just just states the issue in the console. But my code thinks the model has loaded OK, even though it hasn't. Is there a way to get the failed state? In summary, only the bigger models/contexts now seem to run into issues.
I stilll have to test what happens on devices with less memory (e.g. 8Gb Macbook Air). Finally, I just want to say: thank you for all your work on this! It's such an exciting development. Everybody talks about bringing AI to the masses, but too few people realize the browser is the best platform to do that with. Wllama is awesome! |
Just a quick check:
|
Thank you for the very detailed info! It's true that we will definitely be struggle with the memory issue, because AFAIK browsers does have some limits on memory usage. Optimizing memory usage will surely be an area that I'll need to invest my time into.
FYI, n_ctx doesn't have to be power of 2. It can be multiple of 1024, for example Another trick to reduce memory usage is by using quantize wllama.loadModelFromUrl(MODEL, {
n_ctx: 10 * 1024,
cache_type_k: 'q4_0',
});
Yes, WebLLM offload model weight and KV cache to GPU (not just apple silicon, but also nvidia/AMD/Intel Arc GPUs). I couldn't find on google what's the hard limit for WebGPU memory, so I suppose that it can use all available GPU VRAM. It would be ideal to have support of WebGPU built directly into llama.cpp itself, but that far too complicated, so for now there's not much choice left for us.
If you're not using the model for embedding, 1024 is probably fine. However, embedding models like BERT are non-causal, meaning they need |
I've got a 7B Q2_K model working! (Total file size: 2.72 GB) I was able to use a context up to The inference speed was around 2 tokens per second when using 6 threads. I've uploaded the split-gguf here. To try it, you can use this model URL array: Array.from({ length: 45 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-smashed-WizardLM-2-7B/resolve/main/WizardLM-2-7B.Q2_K.shard-${(i + 1).toString().padStart(5, "0")}-of-00045.gguf`) |
Now I've got a 7B Q3_K_M working! (Total file size: 3.52 GB) Array.from({ length: 43 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-Mistral-7B-OpenOrca/resolve/main/Mistral-7B-OpenOrca-Q3_K_M.shard-${(i + 1).toString().padStart(5, "0")}-of-00043.gguf`) |
*stops watching this space ;-) |
@flatsiedatsie, please confirm if you have set |
@felladrin You're right! I accidentally had that commented out for some testing. And.. it's working!! Thank you both so much! Mistral! On CPU! In the browser! This is a game changer! |
Does |
I'm happy to see it too! I usually leave the |
I'm trying to load Mistral 7B 32K. I've chunked the 4.3GB model and uploaded it to huggingface.
When the download is seemingly complete, there is a warning about being out of memory:
It's a little odd, as I normally load bigger chunked models (Llama 8B) with WebLLM. The task manager also indicates that memory pressure is medium.
The text was updated successfully, but these errors were encountered: