ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers #1101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
handle all calls to ggml_barrier at ggml_graph_compute_thread level and parallelized some of the memset, memcpy etc. calls before some of the barrier calls.
All barriers are now implemented by returning a continuation point. And than the tensor op is re-run very similar to how a coroutine would resume but with much less boilerplate as the tasks are simple enough that there is no need.
Ideally the code around barriers should be split up but I also see the benefit of keeping the code together, so this was the best compromise I could came up with while still being able to move the barrier in the code to a single place (within ggml-cpu).