Replies: 1 comment
-
It's mostly profile-guided, following basic CUDA guidelines. Some random points:
As for regression tests, I don't have a good framework set up. I mostly just run the previous version for reference and compare tensors at each step to make sure the output doesn't change too much. Tolerance is required because I'm consciously not worrying about the implementation being completely deterministic. Determinism would be nice, but I find the price is too high, and language models are probabilistic anyway. They're also quite robust, as shown by how well they hold up when quantized to a quarter of their original precision or less. I also don't think my setup is particularly great. I mostly develop in PyCharm, and I haven't yet found a good way to debug CUDA code. |
Beta Was this translation helpful? Give feedback.
-
I think this repo is great, I would really like to be able to do similar work on optimising performance of LLM for my particular use case.
@turboderp would you be able to share some of the process for how you go about speeding up the models? I'm sure there are lots of others out there who also want to learn more too.
Do you just run the pytorch profiler and pick out the slow parts and try to optimise them? Or do you have some other profiling tools?
How do you make sure that you dont break inference when you are optimising a module? Do you write a regression test and check it at each stage?
If anyone else has done some LLM optimisations I would be interested in hearing how you have approached it too.
Beta Was this translation helpful? Give feedback.
All reactions