Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
cuda.core
Is your feature request related to a problem? Please describe.
As mentioned in a previous issue, equivalent operations using CuPy can be significantly faster. In this issue, I am requesting that the initialization of cuda.core
abstractions have less overhead. Specifically, when compared to their CuPy counterparts, the initialization of Device, Stream, and Event abstractions are slower.
>>> timeit.timeit('cp.cuda.Device()', setup='import cupy as cp')
0.06881106700166129
>>> timeit.timeit('device = ccx.Device()', setup='import cuda.core.experimental as ccx')
0.5686513699911302
>>> timeit.timeit('cp.cuda.Stream()', setup='import cupy as cp')
1.0035127629962517
>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
5.299269804003416
>>> timeit.timeit('cp.cuda.Event()', setup='import cupy as cp')
0.393913417996373
>>> timeit.timeit('device.create_event()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
3.1525100879953243
This has caused noticeable performance regressions in nvmath-python
when transitioning from cupy.cuda
to cuda.core
for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.
Specifically, we currently use event recording frequently in order to autoselect algorithms/plans and to wait for computation to complete before returning to the user (for host APIs), and we frequently use the Device constructor to check the current device.
Describe the solution you'd like
These init functions should just as fast as or faster than CuPy's abstractions.
Describe alternatives you've considered
- Refactoring our internal implementation to call
Device()
less often by passing around oneDevice
and being careful about context switching. - Using less event recording? Trying to reuse the same two
Event
s. I don't think this is feasible.
Additional context
I doesn't seem like the best long term solution for nvmath-python
to try to work around these issues.