ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers #1101

Kamayuq · 2025-02-04T05:20:07Z

handle all calls to ggml_barrier at ggml_graph_compute_thread level and parallelized some of the memset, memcpy etc. calls before some of the barrier calls.

All barriers are now implemented by returning a continuation point. And than the tensor op is re-run very similar to how a coroutine would resume but with much less boilerplate as the tasks are simple enough that there is no need.

Ideally the code around barriers should be split up but I also see the benefit of keeping the code together, so this was the best compromise I could came up with while still being able to move the barrier in the code to a single place (within ggml-cpu).

ggerganov · 2025-02-06T06:02:51Z

Is the continuation counter really necessary? The memsets and memcpys are single-threaded because in the past we didn't have the barrier mechanism and instead we had the INIT/COMPUTE/FINALIZE phases of the computations. But after the ggml_barrier introduction, we can simply multi-thread these sections.

Kamayuq · 2025-02-07T00:54:26Z

I am going to remove the continuation stuff for now as part of this MR. I used it because I am locally experimenting with a scheduler that is more akin to what a game engine would use. And for that unfortunately all the barriers have to go, because calling in all the threads and synchronize them on the spot does not compose very well with the other parallelization primitives.

I also ordered an dual socket Epyc to get a better idea of the NUMA memory traffic costs. I am probably bring my proposal (or similar) back in the future when I had more time to solidify my WiP code. If you like we can also talk about some ideas over vidcon.

also parallelized some of the memset, memcpy etc. calls before some of the ggml_barriers.

Kamayuq force-pushed the refactor_ggml_barrier_calls branch from daa949d to 97d27e6 Compare February 7, 2025 02:15

use lookuptable for ggml op function lookup

c323c56

also parallelized some of the memset, memcpy etc. calls before some of the ggml_barriers.

Kamayuq force-pushed the refactor_ggml_barrier_calls branch from 97d27e6 to c323c56 Compare February 7, 2025 02:17

Kamayuq changed the title ~~ggml-cpu: refactor use of ggml_barrier~~ ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers #1101

ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers #1101

Uh oh!

Kamayuq commented Feb 4, 2025

Uh oh!

ggerganov commented Feb 6, 2025

Uh oh!

Kamayuq commented Feb 7, 2025

Uh oh!

Uh oh!

ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers #1101

Are you sure you want to change the base?

ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers #1101

Uh oh!

Conversation

Kamayuq commented Feb 4, 2025

Uh oh!

ggerganov commented Feb 6, 2025

Uh oh!

Kamayuq commented Feb 7, 2025

Uh oh!

Uh oh!