Replies: 11 comments 13 replies
-
Related: openscad/openscad#391 |
Beta Was this translation helpful? Give feedback.
-
This makes sense, but do we have a sense of how much this can gain us? I wonder if effort wouldn't be better spent parallelizing some more of the single-threaded code. What fraction of total time are we spending on triangulation and decimation and such? |
Beta Was this translation helpful? Give feedback.
-
Probably a lot, at least for the CUDA case. On my laptop with a 3050Ti mobile and 12900HK CPU, small models are 10 times slower with CUDA enabled, and large models are only < 10% faster with CUDA. |
Beta Was this translation helpful? Give feedback.
-
Oh wow, fair enough! My benchmarking has tended to focus on problems with large numbers of triangles (spheres, sponge). What do you think would be a good benchmark for small models? |
Beta Was this translation helpful? Give feedback.
-
Not sure, I am testing those python examples. I think we can port some more simple OpenSCAD benchmarks, which are usually not too large. |
Beta Was this translation helpful? Give feedback.
-
@ochafik One possible reason is that |
Beta Was this translation helpful? Give feedback.
-
If we are concerned about that performance, maybe we can also make collider update lazy (actually seems to be a good idea if users are going to do many transforms) |
Beta Was this translation helpful? Give feedback.
-
Isn't it already lazy since it's part of the lazy application of transforms in general? |
Beta Was this translation helpful? Give feedback.
-
Well, we can be more lazy: don't compute the collider if we don't use the mesh for further boolean operations. |
Beta Was this translation helpful? Give feedback.
-
Thinking about this more, I wonder if we should just move to C++ parallel + vulkan compute shader for GPU acceleration, and ditch thrust and CUDA all together. Adding abstraction over thrust, and supporting vulkan will be a large refactor, and will probably make things more complicated without gaining much performance. I think C++ parallel usually uses TBB backend, which is pretty fast judging from existing code using thrust TBB backend, and is likely more robust than thrust. Vulkan backend means that we have to write every GPU operation ourselves, so we will not have things like GPU sorting. I think this is fine as things like GPU sorting is probably not much faster than multithreaded sort due to additional memory transfers. And writing the vulkan backend by ourselves can control memory synchronization behavior and launch multiple streams, which seems to be hard with thrust. |
Beta Was this translation helpful? Give feedback.
-
As mentioned previously, sort is one of our bottlenecks. I did a simple benchmark using g++ 11.3.0, rustc 1.64.0, tbb 2020.3 and got the following results: Basically:
The benchmark is just generating vectors with random elements, run it 15 times, discard the first 5 and take the average. |
Beta Was this translation helpful? Give feedback.
-
The current heuristics in https://github.com/elalish/manifold/blob/master/src/utilities/include/par.h is basically a random number that results in an OK-ish performance but definitely not optimal. There are two problems here:
I think we need a more complete wrapper around thrust to do this, and VecDH should have a boolean indicating whether or not it was passed to the GPU or was used on the host. Ideally we can also try things like vulkan compute shader as an alternative backend for this API, selectively implementing it for some functions that can get good apeesup.
Beta Was this translation helpful? Give feedback.
All reactions