Skip to content

[FEA]: Faster initialization time for cuda.core abstractions #658

Closed
@carterbox

Description

@carterbox

Is this a duplicate?

Area

cuda.core

Is your feature request related to a problem? Please describe.

As mentioned in a previous issue, equivalent operations using CuPy can be significantly faster. In this issue, I am requesting that the initialization of cuda.core abstractions have less overhead. Specifically, when compared to their CuPy counterparts, the initialization of Device, Stream, and Event abstractions are slower.

>>> timeit.timeit('cp.cuda.Device()', setup='import cupy as cp')
0.06881106700166129
>>> timeit.timeit('device = ccx.Device()', setup='import cuda.core.experimental as ccx')
0.5686513699911302
>>> timeit.timeit('cp.cuda.Stream()', setup='import cupy as cp')
1.0035127629962517
>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
5.299269804003416
>>> timeit.timeit('cp.cuda.Event()', setup='import cupy as cp')
0.393913417996373
>>> timeit.timeit('device.create_event()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
3.1525100879953243

This has caused noticeable performance regressions in nvmath-python when transitioning from cupy.cuda to cuda.core for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.

Specifically, we currently use event recording frequently in order to autoselect algorithms/plans and to wait for computation to complete before returning to the user (for host APIs), and we frequently use the Device constructor to check the current device.

Describe the solution you'd like

These init functions should just as fast as or faster than CuPy's abstractions.

Describe alternatives you've considered

  • Refactoring our internal implementation to call Device() less often by passing around one Device and being careful about context switching.
  • Using less event recording? Trying to reuse the same two Events. I don't think this is feasible.

Additional context

I doesn't seem like the best long term solution for nvmath-python to try to work around these issues.

Metadata

Metadata

Labels

P0High priority - Must do!cuda.coreEverything related to the cuda.core moduleenhancementAny code-related improvements

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions