Improve execution policy heuristics #397

pca006132 · 2023-03-18T18:40:21Z

pca006132
Mar 18, 2023
Collaborator

The current heuristics in https://github.com/elalish/manifold/blob/master/src/utilities/include/par.h is basically a random number that results in an OK-ish performance but definitely not optimal. There are two problems here:

The policy is not stored in the vector. If a vector was passed to the GPU, it would make sense to prefer GPU instead of CPU for future operation.
The optimal number of elements for parallelizarion will depend on the algorithm and element size.

I think we need a more complete wrapper around thrust to do this, and VecDH should have a boolean indicating whether or not it was passed to the GPU or was used on the host. Ideally we can also try things like vulkan compute shader as an alternative backend for this API, selectively implementing it for some functions that can get good apeesup.

pca006132 · 2023-03-18T18:41:43Z

pca006132
Mar 18, 2023
Collaborator Author

Related: openscad/openscad#391

0 replies

elalish · 2023-03-20T04:43:07Z

elalish
Mar 20, 2023
Maintainer

This makes sense, but do we have a sense of how much this can gain us? I wonder if effort wouldn't be better spent parallelizing some more of the single-threaded code. What fraction of total time are we spending on triangulation and decimation and such?

0 replies

pca006132 · 2023-03-20T04:45:27Z

pca006132
Mar 20, 2023
Collaborator Author

Probably a lot, at least for the CUDA case. On my laptop with a 3050Ti mobile and 12900HK CPU, small models are 10 times slower with CUDA enabled, and large models are only < 10% faster with CUDA.

0 replies

elalish · 2023-03-20T05:13:58Z

elalish
Mar 20, 2023
Maintainer

Oh wow, fair enough! My benchmarking has tended to focus on problems with large numbers of triangles (spheres, sponge). What do you think would be a good benchmark for small models?

0 replies

pca006132 · 2023-03-20T05:22:47Z

pca006132
Mar 20, 2023
Collaborator Author

Not sure, I am testing those python examples. I think we can port some more simple OpenSCAD benchmarks, which are usually not too large.

0 replies

pca006132 · 2023-03-21T05:34:17Z

pca006132
Mar 21, 2023
Collaborator Author

@pca006132 good to know, thanks! Also, noticed Manifold::Transform seems ~8x slower than OpenSCAD's PolySet::transform (which uses Eigen transforms); haven't fully investigated but wondering if TBB has too much overhead maybe? (even without Eigen's SIMD optimizations, I'd expect to match its speed when throwing a dozen cores at it). I tried batching the thrust::transform calls in Impl::Transform, to no avail.

@ochafik One possible reason is that Manifold::Transform will perform a collider update if the transform is not axis aligned, and that is pretty expensive. Can you give an example model for me to check?

0 replies

pca006132 · 2023-03-21T06:45:35Z

pca006132
Mar 21, 2023
Collaborator Author

If we are concerned about that performance, maybe we can also make collider update lazy (actually seems to be a good idea if users are going to do many transforms)

0 replies

elalish · 2023-03-21T15:24:31Z

elalish
Mar 21, 2023
Maintainer

Isn't it already lazy since it's part of the lazy application of transforms in general?

0 replies

pca006132 · 2023-03-21T15:46:22Z

pca006132
Mar 21, 2023
Collaborator Author

Well, we can be more lazy: don't compute the collider if we don't use the mesh for further boolean operations.

0 replies

pca006132 · 2023-04-03T08:35:16Z

pca006132
Apr 3, 2023
Collaborator Author

Thinking about this more, I wonder if we should just move to C++ parallel + vulkan compute shader for GPU acceleration, and ditch thrust and CUDA all together.
It seems to me that we don't really need thrust for CPU based multithreading, and most of our users don't plan to use CUDA. The performance with CUDA for regular models is not that good as well.

Adding abstraction over thrust, and supporting vulkan will be a large refactor, and will probably make things more complicated without gaining much performance. I think C++ parallel usually uses TBB backend, which is pretty fast judging from existing code using thrust TBB backend, and is likely more robust than thrust. Vulkan backend means that we have to write every GPU operation ourselves, so we will not have things like GPU sorting. I think this is fine as things like GPU sorting is probably not much faster than multithreaded sort due to additional memory transfers. And writing the vulkan backend by ourselves can control memory synchronization behavior and launch multiple streams, which seems to be hard with thrust.

7 replies

ochafik Apr 3, 2023

I would assume full Vulkan would be the best, and avoid moving memory around as much as possible (not an issue on platforms w/ shared memory like M1 macs).

At execution time should be able to ask vulkan to execute kernels on CPU device instead of GPU with little to no code change. Most of the rendering could be expressed as an asynchronous execution graph, chaining operations and their dependencies w/ events (see Vulkan docs about synchronization). The current csg_tree.cpp rendering code could become more async that way.

Wondering if helpers like https://github.com/Glavnokoman/vuh could simplify parts of the setup.

Sort-wise, not sure there's any directly reusable utils, but there's a radix sort in fuschia and a post with a bitonic merge sort example.

An extra complication will be to ensure you're only using the subset of Vulkan supported by MoltenVK (for Mac / iOS compatibility; not sure how limited that subset is in practice).

pca006132 Apr 3, 2023
Collaborator Author

For transitioning to std::parallel_algorithms I would expect it to be pretty trivial. For vulkan, I guess we should just incrementally implement stuffs that are good with GPU. It will probably be a long process to get most of the things GPU accelerated, but it should not be hard to get something GPU accelerated and let others try. And yes I can lead the effort, although I will probably be a bit slow to finish this.

I am thinking about what other users think about this. My plan is to first transition to std::parallel_algorithms with one large PR, and then do the compute shader thing on another branch which users can try. Alternatively the compute shader stuff can merge on the master branch but behind a feature flag.

pca006132 Apr 3, 2023
Collaborator Author

Full vulkan sounds good, but it will probably take a long time. For zero copy, although it may be beneficial to M1, for PC users with discrete GPU there is still copying, and iirc the choice between host or device memory is still critical for performance (even if the memory is visible to both host and device).

Also, I think that writing everything in glsl may be problematic as our code make use of templates, and iirc glsl does not allow the use of templates? I am also not sure about the latency of kernel launch even when executed on the CPU. I guess I should try to get some simple testing code, borrow other's M1 and do some testing on different platforms.

geoffder Apr 3, 2023
Collaborator

I do think that moving off of CUDA would be great, I only happen to have an nvidia card in my old laptop (which I am still using for work) because it's handy to have for ML stuff, otherwise I try to stick to AMD given that all my machines are linux only. Unfortunately I don't have any expertise in performance parallel computing like this, and I'm only just learning a little about shaders now as I've been making a toy mesh visualizing toy with the OCaml raylib bindings.

elalish May 12, 2023
Maintainer

Also, there's WebGPU now - I wonder if this would be a better abstraction, since it can now target Vulkan, Metal, D3D12, and web just by linking to Dawn. I'm also pushing to see if someone might just hook up this kind of support into a C++ compiler for std::parallel_algorithms so we wouldn't have to think about it.

What we have now works well - I think I'd rather maintain less code than more. If it's me writing the code, I'll probably wait until there's a little more infrastructure available to make this easier. It seems like the world is moving in a good direction.

pca006132 · 2023-04-18T14:19:59Z

pca006132
Apr 18, 2023
Collaborator Author

As mentioned previously, sort is one of our bottlenecks. I did a simple benchmark using g++ 11.3.0, rustc 1.64.0, tbb 2020.3 and got the following results:

Basically:

Thrust is faster than C++ std implementation, at least for GCC.
Rust is even faster than Thrust, I think they are using pdqsort.
voracious sort (radix sort) is even faster. And they allow using integer keys for arbitrary struct.

The benchmark is just generating vectors with random elements, run it 15 times, discard the first 5 and take the average.

6 replies

pca006132 Apr 18, 2023
Collaborator Author

multithread. yeah I think rust might be a good fit, and there will be no weird segfault as rust is safe by default.

t-paul Apr 18, 2023

A note from the outside as user of the library... Maybe Rust is the actual future and moving into that direction is good - that's a separate topic I suppose.

The combination of include source into project and build yourself and using rust to do so, would likely mean it's near impossible to keep it in OpenSCAD. Maintaining dual build environments for release + nightly builds across Linux, Windows, macOS-multiarch, AppImage, Snap, Flatpak, Docker, WASM plus variations in different Linux distributions of ages less than about 5 years is difficult enough.

A separate library build that could be done as distinct build step could at least reduce the multiplication of those build topics. As far as I know the integration between C++ and Rust goes both ways. Manifold is awesome, so I hope it will continue to be easily integrated into various projects and/or be shipped as dev package for all the environments at some point.

pca006132 Apr 18, 2023
Collaborator Author

Yes we understand that we should keep our users happy. We will not convert to rust easily (this is obviously a lot of work), and even if we eventually do this we will try to maintain compatibility for those bindings and help users migrate.

But IMO using rust will probably simplify the build process, as it is much easier to deal with than cmake in my experience... Libraries often provide a C binding and users usually don't have to care much about the rust build process.

pca006132 Apr 18, 2023
Collaborator Author

@elalish And I'm afraid it might be pretty hard to transition piecemeal, as the build process will become really complicated

pca006132 Apr 18, 2023
Collaborator Author

Oh, and another advantage of rust is that rayon can run in multithread with wasm, while tbb still doesn't support wasm yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve execution policy heuristics #397

{{title}}

Replies: 11 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improve execution policy heuristics #397

pca006132 Mar 18, 2023 Collaborator

Replies: 11 comments · 13 replies

pca006132 Mar 18, 2023 Collaborator Author

elalish Mar 20, 2023 Maintainer

pca006132 Mar 20, 2023 Collaborator Author

elalish Mar 20, 2023 Maintainer

pca006132 Mar 20, 2023 Collaborator Author

pca006132 Mar 21, 2023 Collaborator Author

pca006132 Mar 21, 2023 Collaborator Author

elalish Mar 21, 2023 Maintainer

pca006132 Mar 21, 2023 Collaborator Author

pca006132 Apr 3, 2023 Collaborator Author

ochafik Apr 3, 2023

pca006132 Apr 3, 2023 Collaborator Author

pca006132 Apr 3, 2023 Collaborator Author

geoffder Apr 3, 2023 Collaborator

elalish May 12, 2023 Maintainer

pca006132 Apr 18, 2023 Collaborator Author

pca006132 Apr 18, 2023 Collaborator Author

t-paul Apr 18, 2023

pca006132 Apr 18, 2023 Collaborator Author

pca006132 Apr 18, 2023 Collaborator Author

pca006132 Apr 18, 2023 Collaborator Author

pca006132
Mar 18, 2023
Collaborator

Replies: 11 comments 13 replies

pca006132
Mar 18, 2023
Collaborator Author

elalish
Mar 20, 2023
Maintainer

pca006132
Mar 20, 2023
Collaborator Author

elalish
Mar 20, 2023
Maintainer

pca006132
Mar 20, 2023
Collaborator Author

pca006132
Mar 21, 2023
Collaborator Author

pca006132
Mar 21, 2023
Collaborator Author

elalish
Mar 21, 2023
Maintainer

pca006132
Mar 21, 2023
Collaborator Author

pca006132
Apr 3, 2023
Collaborator Author

pca006132 Apr 3, 2023
Collaborator Author

pca006132 Apr 3, 2023
Collaborator Author

geoffder Apr 3, 2023
Collaborator

elalish May 12, 2023
Maintainer

pca006132
Apr 18, 2023
Collaborator Author

pca006132 Apr 18, 2023
Collaborator Author

pca006132 Apr 18, 2023
Collaborator Author

pca006132 Apr 18, 2023
Collaborator Author

pca006132 Apr 18, 2023
Collaborator Author