Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers #1101

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Kamayuq
Copy link

@Kamayuq Kamayuq commented Feb 4, 2025

handle all calls to ggml_barrier at ggml_graph_compute_thread level and parallelized some of the memset, memcpy etc. calls before some of the barrier calls.

All barriers are now implemented by returning a continuation point. And than the tensor op is re-run very similar to how a coroutine would resume but with much less boilerplate as the tasks are simple enough that there is no need.

Ideally the code around barriers should be split up but I also see the benefit of keeping the code together, so this was the best compromise I could came up with while still being able to move the barrier in the code to a single place (within ggml-cpu).

@ggerganov
Copy link
Member

Is the continuation counter really necessary? The memsets and memcpys are single-threaded because in the past we didn't have the barrier mechanism and instead we had the INIT/COMPUTE/FINALIZE phases of the computations. But after the ggml_barrier introduction, we can simply multi-thread these sections.

@Kamayuq
Copy link
Author

Kamayuq commented Feb 7, 2025

I am going to remove the continuation stuff for now as part of this MR. I used it because I am locally experimenting with a scheduler that is more akin to what a game engine would use. And for that unfortunately all the barriers have to go, because calling in all the threads and synchronize them on the spot does not compose very well with the other parallelization primitives.

I also ordered an dual socket Epyc to get a better idea of the NUMA memory traffic costs. I am probably bring my proposal (or similar) back in the future when I had more time to solidify my WiP code. If you like we can also talk about some ideas over vidcon.

@Kamayuq Kamayuq force-pushed the refactor_ggml_barrier_calls branch from daa949d to 97d27e6 Compare February 7, 2025 02:15
also parallelized some of the memset, memcpy etc. calls before some of the ggml_barriers.
@Kamayuq Kamayuq force-pushed the refactor_ggml_barrier_calls branch from 97d27e6 to c323c56 Compare February 7, 2025 02:17
@Kamayuq Kamayuq changed the title ggml-cpu: refactor use of ggml_barrier ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants