Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Manifold for much faster & multithreaded CSG & minkowski operations #4533

Merged
merged 9 commits into from
Mar 18, 2023

Conversation

ochafik
Copy link
Contributor

@ochafik ochafik commented Feb 25, 2023

This is introducing experimental support for the https://github.com/elalish/manifold library by @elalish.

TL;DR: It's wicked fast, and ready to test / review. Maybe we get to ditch CGAL for most of the 3D rendering soon (except hulling & convex decomposition for minkowski), with 5-30x speedups over fast-csg (YMMV, multithreading works better on some models than others) and only theoretical precision downgrade (TBC, please help test!)

You can download Windows / Linux binaries from the last green CircleCI run (e.g. a Linux AppImage, a Win64 build, a Win32 build).

Note: fenced by the manifold experimental feature (needs enabling in the UI's settings or passing --enable=manifold in CLI - which doesn't affect the UI).

TODO:

  • Basic build & rendering plumbing
  • Find way for Manifold's multithreading to kick in some confusion about TBB flags. This incidentally fixes Multi-threaded Geometry rendering [$1,275] #391, although more work could be needed to maximize core utilization.
  • Fix segfault on tests/data/scad/issues/issue2342.scad (sent fix to Manifold: Simplify the CsgOpNodes as we build them, rather than in GetChildren elalish/manifold#368 )
  • Fix Minkowski. It's so fast now it's ridiculous. (read below)
  • Implement Hull
  • Ensure Manifold uses batch operations. More specifically, let it build deeper internal trees by not forcing them to leaf nodes with our facet-based priority queue logic (it seems to have the same, along with fast-union-like logic to compose non intersecting operands).
  • Fix tests (enabling manifold for all but a some known issues)
  • Ensure builds on all platforms
  • Clean up code & submit for review
  • Run Benchmarks (some results here, more to come)
  • Fix mirror tests (prolly need to ask Manifold to flip faces if the matrix transform determinant is < 0 as we do with our other engines)
  • Ensure works in WASM (CircleCI build env not updated yet)

A note on precision

Manifold uses single-precision floats, which on paper is a far cry from OpenSCAD's numeric handling, which is double precision at worst, and as a general-rule uses exact rationals (GMP / MPFR + CGAL) when doing CSG operations.

However, there's already substantial exceptions to OpenSCAD's exact numeric handling:

  • minkowksi and hull use double precision conversions and lose the exact rationals
  • TransformNodes (multmatrix / translate / scale / rotate) transform things in double then reconvert them to "exact" rationals. There's no reasonable way to get exact transforms in the general case anyway, e.g. to guarantee that the result of N rotations of PI / N around the origin or that applying two successive opposite transforms produces exactly the same solid (although that may appear to work in some cases).

That said, rounding errors from transforming simple precision coordinates could potentially snowball in trees that have many nested transforms.

The simple solution I'm rooting for here is to push the double-precision world transforms all the way down to pseudo-leaf nodes (i.e. actual leaves but also shared subtrees, minkowski, render, 2D-3D transitions), similarly to what I did in #3637.

I'll explore that route separately in #4561 (after initially trying to push the transform o the fly in GeometryEvaluator as what CSGTreeEvaluator does with the state; found out this messes up with caching so going a more traditional tree transform route).

Licensing: GPLv2+ + Apache 2 = GPLv3

Manifold is under Apache 2 license, which can be included in GPLv3 but not GPLv2 projects.

Since OpenSCAD source is GPLv2+, this means one can only release an OpenSCAD + Manifold binary under GPLv3. The same was already happening w/ CGAL (see this thread) so this doesn't change anything.

Building & running

To use Manifold as an experimental rendering engine, build locally (make sure to install the general prerequisites) and run with --enable=manifold (or enable the feature in the UI's preferences)

git clone --recurse-submodules https://github.com/ochafik/openscad --branch manifold1 openscad-manifold
cd openscad-manifold
( cd submodules/manifold && git apply thrust.diff )

( \
  mkdir -p build && \
  cd build && \
  cmake \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=1 \
    -DEXPERIMENTAL=1 \
    .. && \
    time make -j12 \
)
alias openscad=./build/OpenSCAD.app/Contents/MacOS/OpenSCAD # On Mac

openscad examples/Basics/CSG.scad --render -o out.stl '-D$fn=300'
# 50s

openscad examples/Basics/CSG.scad --render -o out.stl '-D$fn=300' --enable=fast-csg
# 2.8s

openscad examples/Basics/CSG.scad --render -o out.stl '-D$fn=300' --enable=manifold
# 0.8s (!)

(Interestingly enough, the rendering of that model breaks down at like $fn=1000. Might get back to this later to see if it's a Manifold issue or if it's about how we build spheres)

minkowski, 20x faster (YMMV)

I've ported the OpenSCAD minkowski logic for the Manifold case, with a few upgrades:

  • Fixed slight inefficiency in the original logic (points collection and conversion was being redone over and over)
  • Sprinkled a bit of TBB-based parallelism (wink wink Multi-threaded Geometry rendering [$1,275] #391 ) since this brings that dependency anyway and there's less thread-unsafe CGAL numerics floating around to worry about. Used to decompose two operands to convex parts in parallel, and to execute all the cross products of decomposed part pairs in parallel. Can disable w/ OPENSCAD_NO_PARALLEL=1 env var to see the difference (doesn't disable Manifold's parallel operations tho).
  • Switched to CGAL's surface mesh for hulling

Long story short: minkowski is orders of magnitude faster in lots of cases.

For instance, @revarbat's BOSL2 offset3d docs essentially say the function is so slow they don't even bother generating a preview. Well, the following takes 4sec on my M1 mac (at roughly 2.5 cores utilization), instead of 4m31sec (or 1m34sec with fast-csg):

include <BOSL2/std.scad>
$fn=$preview ? 15 : 30;
offset3d(1)
    bottom_half()
        difference() {
            cube(10, center=true);
            xscale(0.9) yscale(1.1)
                sphere(d=10);
        }

image

(Ngl, I struggled to find cool minkowski examples - adding some here - which this PR might help make more practical and widespread)

That said it looks like the CGAL Nef-based convex component decomposition we use is quick to crash with random models. Might muster the courage to file a bug soon

@elalish
Copy link

elalish commented Feb 25, 2023

Super exciting, @ochafik! Don't hesitate to open issues on Manifold if you find any. I'm about to push a new release, but soon I'd like to integrate Clipper lib for a proper 2D subsystem. Curious if that would be useful for you?

@ochafik
Copy link
Contributor Author

ochafik commented Feb 27, 2023

Thanks @elalish ! So far it looks remarkably stable (and fast), amazing work!

I've seen some weird corrupted results when I dramatically increased the number of vertices. Not sure yet if it's a manifold or OpenSCAD thing, will circle back on that.

Also, I can't seem to get much multithreading going on w/ TBB, not sure I'm doing it right / for which ops it's meant to kick in (seemingly getting 1.5 cores used at most).

Re/ Clipper, think it's already used in OpenSCAD but I'm unfamiliar w/ its 2D stack. I'd be surprised if its performance were a bottleneck for most users, though.

Things that would be great to have in Manifold on the 3D front, although they could be quite expensive: quickhull, minkowski (OpenSCAD has its own optimized implementation, which relies on CGAL's Nef to decompose bodies in convex parts).

@elalish
Copy link

elalish commented Feb 27, 2023

Yeah, I've never done much with the OMP/TBB backends of Thrust - considering it's an Nvidia library, my guess is they don't get a lot of optimization love. I would also guess our parallelization is a bit too fine-grained for multi-CPU to help that much.

I never prioritized convex hull or minkowski since I've never actually used them in my OpenSCAD designs. I'd guess Hull wouldn't be too hard with a sweep-plane - PRs welcome!

@pca006132
Copy link
Member

pca006132 commented Feb 27, 2023

For multithreading, can you try building manifold with python and run time python bindings/python/examples/run_all.py?
On my Linux machine I got

Took 0.2ms for union_failure
Took 26.2ms for bricks
Took 35.3ms for cube_with_dents
Took 1200.6ms for maze
python bindings/python/examples/run_all.py  7.93s user 0.83s system 610% cpu 1.436 total

which is using a lot of cores, so I think thrust is working fine for this, perhaps it is an OS/build flag/TBB version thing?

@ochafik
Copy link
Contributor Author

ochafik commented Feb 27, 2023

Ah yes, managing 2.5 cores with another model (condensed matter from #3641 (comment)) , guess it will depend on the model tree shape. Getting same utilisation figures as you with Manifold’s perfTest so flags are fine.

I’m working on separate PRs for tree transforms to alter said shape anyway, and on octree scene partitioning to max out parallelism (probably in a multi process setup for starters).

Edit: seeing peak core utilization of 9 cores (Avg. 4) 🎉 w/ a model such as examples/Old/example024.scad w/ -Dn=5, although I'll need a machine w/ more RAM to finish that render.

@pca006132
Copy link
Member

Yes we have a very simple tree transform in https://github.com/elalish/manifold/blob/master/src/manifold/src/csg_tree.cpp, there are some parallelization going on but it is limited as I'm not that good at math. Feel free to open PRs to add octree partitioning for speedup!

@ochafik ochafik changed the title Use Manifold for much faster CSG operations (WIP) Use Manifold for much faster & multithreaded CSG operations (WIP) Feb 27, 2023
@gsohler
Copy link
Member

gsohler commented Feb 27, 2023

just tested your branch and managed to compile. i have one polyedron with points and faces in my source. it displays in default openscad, but not within your branch. could it display already eventually ?

@ochafik
Copy link
Contributor Author

ochafik commented Feb 28, 2023

@gsohler thanks for trying it out! That's unexpected, could you share a model that exhibits the issue? Any special build flags beyond what I've listed above? And did you check that there's no other experimental features enabled? You could also try the rendering in command line (w/ --enable=manifold) to see if it makes a difference.

@gsohler
Copy link
Member

gsohler commented Feb 28, 2023

Hi Oliver, thank you for attention.
I compiled exactly as documented, lately, only make -j16 argument skipped i had to install tbb-devel.
i have only manifold experimental feature enabled.
other primitives show fine, just polyhedron does not.
background is to find out how manifold handles self instersection by moving the last pt in pt list.
here is my model. it displays fine with default openscad

points=[
[0,0,0], [10,0,0],
[10,10,0],[0,10,0],
[1,0,1], [9,0,1],
[9,10,1],[1,10,1],
[0,0,10],[1,0,10],
[1,10,10],[0,10,10],
[9,0,10],[10,0,10],
[10,10,10],[9,10,10],
[8,5,5]
];

faces=[
[3,2,1,0],[4,5,6,7],
[8,9,10,11],[12,13,14,15],
[0,1,13,12,5,4,9,8],
[2,3,11,10,7,6,15,14],
[1,2,14,13], [0,3,11,8],
[5,12,15,6],[4,16,9],
[9,16,10], [10,16,7],
[7,16,4]
];

union()
{
//cube();
//sphere();
polyhedron(points=points, faces=faces);
}

@ochafik
Copy link
Contributor Author

ochafik commented Feb 28, 2023

@gsohler ah great, thanks for sharing! If you inspect the console output (easier in command line) you’ll see the warning ERROR: [manifold] PolySet -> Manifold conversion failed: NotManifold. I’d assume manifold skips solids in that state.

I’ll see if it’s worth hack some mesh repairing attempt when manifold chokes on inputs. CGAL has some helpers for that although won't be perfect (see this question on SO for instance: PMP::self_intersections + aggressive CGAL::Euler::remove_face + hole filling)

@gsohler
Copy link
Member

gsohler commented Mar 4, 2023

@elalish , can manifold be used to fix self-intersecting objects(maybe disable optimization, that edges never cut faces of the same object) ?

@mconsidine
Copy link

mconsidine commented Mar 5, 2023

Hi,
Quick question (and apologies if I should ask this elsewhere): looks like you're working with manifold and the fast-linalg2 lines at the same time. Does one incorporate the other at the moment? Compiled the manifold branch fine on Linux Mint 21 (Ubuntu) just fine. But then wondered if the linalg2 branch was incorporating that as well. TIA. Great work ...
mconsidine
EDIT: I think I have this sorted out, after (mis)compiling each. They're separate. Thx....

@elalish
Copy link

elalish commented Mar 6, 2023

@gsohler Not currently but I'd like to, see here. It'll be a different algorithm though, because the parallelized one can't do self-intersection removal.

@mconsidine
Copy link

Hi,
When running cmake with the flags identified in the script, I get the following warning:
`CMake Warning:
Manually-specified variables were not used by the project:

THRUST_HOST_SYSTEM
THRUST_MULTICONFIG_ENABLE_SYSTEM_TBB`

Can anyone tell me if this is expected or does it suggest I'm missing something somewhere?
TIA,
mconsidine

@ochafik
Copy link
Contributor Author

ochafik commented Mar 14, 2023

@mconsidine thanks for reporting back on that, I'd missed that warning. Turns out these macros are already defined by Manifold's cmake build files, I got confused with Thrust's instructions / cmake variables that don't apply to the way Manifold includes it.

Hopefully you should see some multicore utilization regardless, depending on the model.

@mconsidine
Copy link

mconsidine commented Mar 15, 2023

EDIT: Looks like my best route to go is to limit the number of processors being used. Once it compiles a "render" in openscad shows all 32 cores being engaged. Nice!

Fyi, compilation of a fresh clone of the branch brings a 32 core, 32 GB ram Linux Mint 21box to it's knees. It happens reliably at the 80 % point "Building CXX object CMakeFiles/OpenSCAD.dir/src/glview/cgal/CGALRenderer.cc.o"

If I reboot and don't clean the directory it appears to pick up where it left off. One thing I can try is to increase swap memory. But that's already at 4G and since it and RAM are maxed at this point, I'm suspecting I've got some flag set incorrectly or some configuration setting that is a problem.

Any advice?
mconsidine

@ochafik
Copy link
Contributor Author

ochafik commented Mar 15, 2023

@mconsidine re/ build issue: you can be more gentle on your machine by reducing the build parallelism. make -j12 means up to 12 parallel compilations. You can bring it down to 2, or just drop the -j param for a fully serial build. That said, your specs seem more than capable to build with high parallelism. Which version of GCC or CLang are you building with? If that's also an issue with the master branch (as I'd assume it would), I'd suggest filing a separate issue.

And re/ utilization of all 32 cores during a render, that's good to read (I'm curious what kind of model you got this with, free to share!). There should be ways to limit the max parallelism to not hog the entire system (e.g. tbb::global_control(tbb::global_control::max_allowed_parallelism)), I'll look into it

@mconsidine
Copy link

I'll get some numbers to you in the next couple of days. The model (which is a WIP) is a reconstruction from blueprints of an old driving clock for a telescope. So I suspect most of the render resources are being used on calculation of threads. Haven't added gears yet.

Re: fast-linalg2 work. Is/was that specific to fastcsg, or are there yet-more speed up opportunities when paired with manifold? Just curious if I can start paying attention to only the manifold version.

Re core usage, I noticed that it seemed to be spreading the load out evenly. Ie if one was at 24% they pretty much all were. Ditto for other values. So I suppose there could be a model that would max all of them out. Or, put differently, if my machine only had 4 cores, I might bury it.

I do have a Linux laptop that I could run the experiment on. Compiling with all cores but 2 has been nice (ie make -j$(nproc --ignore=2) I think does the trick). But it was only with the latest work that I noticed all the RAM and swap partition getting used up. So I don't think I'll try compiling on the laptop :)

@ochafik ochafik changed the title Use Manifold for much faster & multithreaded CSG operations (WIP) Use Manifold for much faster & multithreaded CSG & minkowski operations (WIP) Mar 16, 2023
@ochafik
Copy link
Contributor Author

ochafik commented Mar 16, 2023

@mconsidine fast-linalg2 is completely orthogonal to this PR (and isn't ready for prime time yet, although I might open a draft PR soon).

It's meant to optimize the evaluation stage of the model (execution of the OpenSCAD language itself) as opposed to the rendering of the geometry (done in C++ using CGAL or now, Manifold).

Only models that do huge amounts of geometry computations in their script will benefit from the linalg work (e.g. when going overboard w/ BOSL2). To know if it could be the case for your model, you could see how long it take to render to CSG (i.e. just dump the render AST of the model).

@ochafik
Copy link
Contributor Author

ochafik commented Mar 16, 2023

@t-paul I'd need some help w/ the build images for CircleCI:

  • Need CMake 3.18+ in openscad/appimage-x86_64-base and openscad/wasm-base:latest. I see you're including your own deb repo for CMake, but I wonder if something like apt install cmake/buster-backports could work? (it just so happens to have 3.18)
  • Need intel-tbb in openscad/mxe-i686-gui and openscad/mxe-x86_64-gui (sent Add intel-tbb to MXE images docker-openscad#14 for review)

@t-paul
Copy link
Member

t-paul commented Mar 16, 2023

AppImage: I'll check the backports package. I suppose moving on to 20.04 would be fine by now too, but that might take longer.
MXE: I'll kick of the rebuild, that will take a while as it builds the whole world.

@t-paul
Copy link
Member

t-paul commented Mar 16, 2023

I tried updating the AppImage to 20.04 which went surprisingly smooth (well, it buids and runs on my machine 😀). It's grabbing cmake from https://apt.kitware.com/ubuntu bringing in version 3.26.0-0kitware1ubuntu20.04.1.

@mconsidine
Copy link

Latest pull compiles without error.

FWIW, I am compiling with
"cmake version 3.25.2
CMake suite maintained and supported by Kitware (kitware.com/cmake)."
with issues on Linux Mint 21. I do need to constrain the number of parallel jobs in the "make" step.

And, fwiw, attached is the openscad file that I'm working with (definitely a work in progress, so there are probably a zillion ways to do things more effectively ...)
mconsidine
ClockDrive.zip

@t-paul
Copy link
Member

t-paul commented Mar 16, 2023

Right, adding libtbb-dev...

@ochafik
Copy link
Contributor Author

ochafik commented Mar 20, 2023

@t-paul btw, not sure how often the MacOS dev snapshot is built, is there something jammed there?

@t-paul
Copy link
Member

t-paul commented Mar 20, 2023

It builds via scheduler as we are limited to 500 minutes per month.
The last build failed as it detected python2. I did not have time to look at this. I saw you did fight with that topic earlier already.

https://app.circleci.com/pipelines/github/openscad/openscad/3415/workflows/0bb8baf6-60ff-45cc-a246-73c4df61bec4/jobs/14012

@mconsidine
Copy link

mconsidine commented Mar 20, 2023

@t-paul I drafted some STL output parallelization here (WIP), but I'm intrigued by these double shenanigans, thanks for the pointer!

And thanks for the new wasm image, will try and update the playground soon!

@ochafik Thank you for this. The WIP seems to work - below are results of testing it on my model. The bottleneck doesn't seem to have been in the STL generation, though there is a pick up. The enormous gain is in the use of manifold. In this version of the model I have added BOSL2 and the generation of gears with teeth and threads, rather than use stand-ins. I have not tried to evaluate the STL yet for printing, as there is still more model building to do. Also, I note that invoking manifold gets me around the errors shown doing it the former way.

Many, many thanks!


This only invokes 1 core:

time ./openscad /home/matt/Downloads/3D_design_files/EquatorialClockDrive_reorg.scad --render -out.stl '-D$fn=192'

EXPORT-WARNING: Exported object may not be a valid 2-manifold and may need repair
Geometries in cache: 815
Geometry cache size in bytes: 12152832
CGAL Polyhedrons in cache: 17
CGAL cache size in bytes: 51466704
Total rendering time: 0:17:14.302
Top level object is a 3D object:
Simple: no
Vertices: 156891
Halfedges: 507728
Edges: 253864
Halffacets: 194172
Facets: 97086
Volumes: 44
WARNING: Object may not be a valid 2-manifold and may need repair!

real 17m18.075s
user 17m12.266s
sys 0m4.544s

Missed seeing what cores were involved on save to STL. But interactively, I see multiple cores invoked.
It just doesn't seem to be the bottleneck I thought it was.

time ./openscad /home/matt/Downloads/3D_design_files/EquatorialClockDrive_reorg.scad --render --enable=parallel-stl -out1.stl '-D$fn=192'

EXPORT-WARNING: Exported object may not be a valid 2-manifold and may need repair
Geometries in cache: 815
Geometry cache size in bytes: 12152832
CGAL Polyhedrons in cache: 17
CGAL cache size in bytes: 51466704
Total rendering time: 0:16:57.683
Top level object is a 3D object:
Simple: no
Vertices: 156891
Halfedges: 507728
Edges: 253864
Halffacets: 194172
Facets: 97086
Volumes: 44
WARNING: Object may not be a valid 2-manifold and may need repair!

real 17m1.475s
user 17m21.748s
sys 0m5.519s


Multiple cores, "parallel-stl" not specified. (But does it remember that setting from an earlier use?)

time ./openscad /home/matt/Downloads/3D_design_files/EquatorialClockDrive_reorg.scad --render --enable=manifold -out2.stl '-D$fn=192'

Geometries in cache: 1079
Geometry cache size in bytes: 29945624
CGAL Polyhedrons in cache: 0
CGAL cache size in bytes: 0
Total rendering time: 0:00:35.667
Top level object is a 3D object:
Facets: 247120

real 0m38.276s
user 13m15.744s
sys 0m5.723s


Everybody's in the pool on this (32 cores):

time ./openscad /home/matt/Downloads/3D_design_files/EquatorialClockDrive_reorg.scad --render --enable=manifold --enable=parallel-stl -out3.stl '-D$fn=192'

Geometries in cache: 1079
Geometry cache size in bytes: 29945624
CGAL Polyhedrons in cache: 0
CGAL cache size in bytes: 0
Total rendering time: 0:00:33.999
Top level object is a 3D object:
Facets: 247120

real 0m36.637s
user 13m34.466s
sys 0m6.249s

image

@elalish
Copy link

elalish commented Mar 20, 2023

It's so exciting to see this @ochafik! OpenSCAD is what got me into both 3D design and computational geometry; I started Manifold in hopes of contributing back, and it's amazing to see it actually happen. Big thank you to everyone in the community!

@butcherg
Copy link

Compiled master on a 12-core machine, I've been playing with F6 all day on models that previously took upwards of .5hr to render - cool beans!

Just wondering, what's needed to enable CUDA?

@jordanbrown0
Copy link
Contributor

I don't remember the details, but I believe I looked at how ASCII STLs are generated and found that they were just horrible in terms of buffering (or lack thereof).

@jordanbrown0
Copy link
Contributor

#4316 has the details from my previous investigation of ASCII STL performance.

@ochafik
Copy link
Contributor Author

ochafik commented Mar 21, 2023

@mconsidine thanks for the follow up!

re/ STL export speed, let's move the discussion to #4316 (@jordanbrown0 thanks for the pointer!), my initial parallel experiment doesn't help much but switching to indexed meshes does. In the meantime note that --export-format=binstl should help a lot compared to the text STL output.

@butcherg Glad this helps! Re/ CUDA, we'd need someone with access to a CUDA-enabled machine to play with the build flags and check it all works. And find out how to setup the CI / release environments to probably build CUDA-specific binary variants.

@pca006132
Copy link
Member

Just a note: As our execution policy is not yet optimized (elalish/manifold#380), CUDA may be slower than CPU only execution. But this should hopefully be fixed soon.

@ochafik
Copy link
Contributor Author

ochafik commented Mar 21, 2023

@t-paul #4571 mmmmight work 🤞

@ochafik
Copy link
Contributor Author

ochafik commented Mar 21, 2023

@pca006132 good to know, thanks! Also, noticed Manifold::Transform seems ~8x slower than OpenSCAD's PolySet::transform (which uses Eigen transforms); haven't fully investigated but wondering if TBB has too much overhead maybe? (even without Eigen's SIMD optimizations, I'd expect to match its speed when throwing a dozen cores at it). I tried batching the thrust::transform calls in Impl::Transform, to no avail.

@pca006132
Copy link
Member

Yeah TBB probably have quite a bit of overhead. And maybe we should move that discussion to elalish/manifold#380.

@t-paul
Copy link
Member

t-paul commented Mar 21, 2023

It would be nice to get some CUDA results for comparison, but I'm not sure this is going to be in an official release.

As far as I understand this is still tied to Nvidia only, and I'm not convinced we can maintain a whole GPU specific set of builds which is difficult to test. Maybe there's a chance for vendor independent solution at some point.

@butcherg
Copy link

@ochafik, I have a GeForce GTX 1660 Ti GPU, and a compiled OpenSCAD from the master branch; I'm willing to do some testing.

@t-paul, fully agree with that, 'vendor independence". I'm just starting my dig into current GPU software architectures, particularly to support an image processing program that uses vulkan, not yet deep enough to know what functions equivalent to CUDA it provides. Maybe you can short-circuit that: would vulkan be a viable alternative?

@elalish
Copy link

elalish commented Mar 21, 2023

I would guess a better solution would be for us over at Manifold to switch from using Thrust to C++17 parallel algorithms, which is the standard the Thrust library evolved into. It would be good to know from a user's perspective how seamless that would make it to target the various parallel architectures out there. Certainly it would nice to remove our related build flags and let the compilers take care of it for us.

@pca006132
Copy link
Member

I think only nvc++ currently supports stdpar, and I don't see any news from AMD, clang or gcc. I don't think stdpar will help us enable GPU accelerated processing on other GPUs anytime soon. Also, from a compiler perspective, automatically parallelizing code to the GPU is an extremely hard problem, it depends on access pattern, size of the workload, and usually require special algorithm or data structure design. At least I don't think C++ is a good language for this, so I doubt if stdpar will be that useful if we already have code designed around the GPU.

I think compute shader is probably the way to go for accelerating some of the most time consuming parts and remains somewhat vendor neutral. We can incrementally switch over to compute shader, I think sorting, copying are probably the most time consuming parts. An alternative way would be to use google/highway for SIMD sorting, but their code currently does not support sort by key, and I am too lazy to try to adapt their algorithm for it...

Due to the nature of our workload, we still have to tune various thresholds for selecting single core, multicore or GPU accelerated computation, so we will probably still need to test multiple different configurations to deliver the best performance. And I don't think this is bad either: we can make some profiling code for users to run if they want to improve the performance, similar to building with -march=native, and potentially submit a PR to us if the default is very bad for certain configuration.

@ochafik
Copy link
Contributor Author

ochafik commented Mar 21, 2023

@butcherg maybe just try adding -DMANIFOLD_USE_CUDA=1 to CMake's command line args (for the OpenSCAD build) as per Manifold's build instructions (Manifold is added as a submodule of OpenSCAD). In fact, I'd try building and running Manifold's tests on your CUDA hardware to make sure it all works. Then you could source benchmark scripts from here or here

@pca006132 @elalish Vulkan feels like the best standard practical way to harness both CPU and GPU. Also, I reckon it could definitely attract more OSS contributors (well, speaking for myself, that is; no plans to buy any NIVIDA hardware, and I used to be very much into OpenCL 😅). That said, I'm now super excited again about google/highway, might give it a deeper look.

@elalish
Copy link

elalish commented Mar 21, 2023

Thanks, that's good feedback, @ochafik and @pca006132. I haven't looked into Vulcan much, but it does seem like a reasonable alternative. And then there's WebGPU, which is coming rapidly - I wonder if there's any inkling of a way to cross compile one to the other?

@mconsidine
Copy link

mconsidine commented Mar 21, 2023 via email

@ochafik
Copy link
Contributor Author

ochafik commented Mar 22, 2023

@mconsidine manifold is a git submodule and will be pulled for you if you follow the general OpenSCAD build instructions.

@butcherg
Copy link

Yeah, you can have CUDA. I just finished restoring my NVIDIA driver configuration after my attempt to install the CUDA toolkit borked it all up. Probably my fault, but I'm just not into troubleshooting it, and I need to keep my GPU configuration clean for the Vulkan-based raw processing software to which I'm contributing. That, and its NVIDIA-dependency is not appealing to the open-architecture fool I am...

Manifold has already provided a multiple order-of-magnitude improvement to OpenSCAD's mesh rendering, to the point of up-showing preview rendering in some cases. Thanks, @elalish and @ochafik, keeps OpenSCAD relevant for my model-building.

https://glenn.pulpitrock.net/blog/

@mconsidine
Copy link

mconsidine commented Mar 22, 2023

@ochafik Okay, it looks like (against all odds) I've got the CUDA toolkit installed. A separate download of the latest manifold source compiled, with a few tests failing out of 135 - as well as it seemingly getting hung up on another. Regardless of that, are there any flags you would recommend for a compile beyond this:
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -DMANIFOLD_USE_CUDA=ON -DMANIFOLD_PAR=TBB -DMANIFOLD_EXPORT=ON -DBUILD_TEST_CGAL=ON -DEXPERIMENTAL=ON ..

EDIT: Okay, I think I'm with @butcherg on this. Cuda-related compile errors cropped up in a couple of places, one seemingly being between manifold and CGAL 5.6 (lastest master). Going back down to 5.5.3 seemed to deal with that, but then more cuda issues cropped up. And this is just not a rabbit hole I need to live in. So I'm sticking with a non-CUDA version for the time-being.

Given the size of the model I'm working with, there would be far more to be gained for me if Preview mode could be sped up, as I note that is only using one core at a time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-threaded Geometry rendering [$1,275]
9 participants