Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TornadoVM for faster inferencing #25

Open
kevintanhongann opened this issue Nov 19, 2024 · 4 comments
Open

Use TornadoVM for faster inferencing #25

kevintanhongann opened this issue Nov 19, 2024 · 4 comments

Comments

@kevintanhongann
Copy link

I suppose this can be improved by utilizing the GPU using TornadoVM.

@mukel
Copy link
Owner

mukel commented Nov 19, 2024

Yes, it can!
There's some rough prototypes around, but nothing end-to-end.
There's one that offloads some matmuls to GPU: https://github.com/mikepapadim/llama2.tornadovm.java
Another one that uses oneAPI's shared memory to avoid copying, but since inference is memory bound, integrated GPUs don't give much benefit, bar some minor perf. improvement and power savings.

@kevintanhongann
Copy link
Author

kevintanhongann commented Nov 20, 2024

main...kevintanhongann:llama3.java:main

I don't know if this makes sense. Tried to run it but it seems that it still needs Graalvm related stuff to work.

How to run mine.

jbang -Dtornado.load.api.implementation=uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph -Dtornado.load.runtime.implementation=uk.ac.manchester.tornado.runtime.TornadoCoreRuntime -Dtornado.load.tornado.implementation=uk.ac.manchester.tornado.runtime.common.Tornado -Dtornado.load.annotation.implementation=uk.ac.manchester.tornado.annotation.ASMClassVisitor -Dtornado.load.annotation.parallel=uk.ac.manchester.tornado.api.annotations.Parallel Llama3.java --model Meta-Llama-3-8B-Instruct-Q4_0.gguf --chat

ongoing troubleshooting...

@kevintanhongann
Copy link
Author

Pulling @jjfumero into the discussion.

@mikepapadim
Copy link
Collaborator

We are currently working on extending llama3 with TornadoVM without support for quantized types on the GPU without the need to convert the gguf model to use floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants