[FEA]: Faster initialization time for `cuda.core` abstractions

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cuda-python/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Area

cuda.core

### Is your feature request related to a problem? Please describe.

As mentioned [in a previous issue](https://github.com/NVIDIA/cuda-python/issues/439), equivalent operations using CuPy can be significantly faster. In this issue, I am requesting that the initialization of `cuda.core` abstractions have less overhead. Specifically, when compared to their CuPy counterparts, the initialization of Device, Stream, and Event abstractions are slower.

```
>>> timeit.timeit('cp.cuda.Device()', setup='import cupy as cp')
0.06881106700166129
>>> timeit.timeit('device = ccx.Device()', setup='import cuda.core.experimental as ccx')
0.5686513699911302
```

```
>>> timeit.timeit('cp.cuda.Stream()', setup='import cupy as cp')
1.0035127629962517
>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
5.299269804003416
```

```
>>> timeit.timeit('cp.cuda.Event()', setup='import cupy as cp')
0.393913417996373
>>> timeit.timeit('device.create_event()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
3.1525100879953243
```

This has caused noticeable performance regressions in `nvmath-python` when transitioning from `cupy.cuda` to `cuda.core` for our benchmarks for smaller/medium size arrays or on faster devices where python overhead is more significant.

Specifically, we currently use event recording frequently in order to autoselect algorithms/plans and to wait for computation to complete before returning to the user (for host APIs), and we frequently use the Device constructor to check the current device.

### Describe the solution you'd like

These init functions should just as fast as or faster than CuPy's abstractions.

### Describe alternatives you've considered

- Refactoring our internal implementation to call `Device()` less often by passing around one `Device` and being careful about context switching.
- Using less event recording? Trying to reuse the same two `Event`s. I don't think this is feasible.

### Additional context

I doesn't seem like the best long term solution for `nvmath-python` to try to work around these issues. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA]: Faster initialization time for `cuda.core` abstractions #658

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA]: Faster initialization time for cuda.core abstractions #658

Description

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEA]: Faster initialization time for `cuda.core` abstractions #658