How does turbodep go about optimising performance? #189

ri938 · 2023-07-24T09:45:13Z

ri938
Jul 24, 2023

I think this repo is great, I would really like to be able to do similar work on optimising performance of LLM for my particular use case.

@turboderp would you be able to share some of the process for how you go about speeding up the models? I'm sure there are lots of others out there who also want to learn more too.

Do you just run the pytorch profiler and pick out the slow parts and try to optimise them? Or do you have some other profiling tools?

How do you make sure that you dont break inference when you are optimising a module? Do you write a regression test and check it at each stage?

If anyone else has done some LLM optimisations I would be interested in hearing how you have approached it too.

turboderp · 2023-07-25T11:58:14Z

turboderp
Jul 25, 2023
Maintainer

It's mostly profile-guided, following basic CUDA guidelines. Some random points:

PyTorch profiling doesn't really help. What you want is NVIDIA Nsight Systems (or some equivalent) to show you a timeline of the forward pass. This highlights both which kernels take longer than they should to execute and whether or not you're even bottlenecked by CUDA.
Python is several orders of magnitude slower than any (JIT or not) compiled language, especially C. The reason you can use it efficiently for ML applications is that all the work ends up happening asynchronously on compute devices, so as long as the queue never runs out it doesn't matter how long it takes the CPU to append each new operation to the end of that queue. The CPU will reach a synchronization point and just wait, regardless.
That situation changes completely between the contexts of training neural networks (large, expensive matmuls) and generating a single string of tokens on a local instance of a "small" LLM (relatively small, fast matmuls, almost instantaneous on a Python scale). Here you want to fuse kernels, or at least fuse operations into C++ extension functions just so you can accomplish more in between each line of Python code.
Overall, CUDA is hard and I'm by no means an expert. Often the best I can do is just experiment, and I guess the real trick is to keep a mental catalog of all the things that might be worth trying, and then trying them. There are some very universal rules-of-thumb, like always coalesceing GMEM accesses, but it can be hard to say when, for instance, caching an intermediate is result going to help. SMEM is fast, but it's not faster than the L1 cache, for example. And it's definitely not faster than arithmetic operations that end up being free because you're just waiting for GMEM latency anyway. Either way the benchmark doesn't lie. Except when it does because you broke something.

As for regression tests, I don't have a good framework set up. I mostly just run the previous version for reference and compare tensors at each step to make sure the output doesn't change too much. torch.max((tensor_a - tensor_b).abs()) < tolerance type of thing.

Tolerance is required because I'm consciously not worrying about the implementation being completely deterministic. Determinism would be nice, but I find the price is too high, and language models are probabilistic anyway. They're also quite robust, as shown by how well they hold up when quantized to a quarter of their original precision or less.

I also don't think my setup is particularly great. I mostly develop in PyCharm, and I haven't yet found a good way to debug CUDA code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does turbodep go about optimising performance? #189

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How does turbodep go about optimising performance? #189

ri938 Jul 24, 2023

Replies: 1 comment

turboderp Jul 25, 2023 Maintainer

ri938
Jul 24, 2023

turboderp
Jul 25, 2023
Maintainer