Skip to content

Querying current device is slow compared to CuPy #439

Closed
@shwina

Description

@shwina

Getting the current device using cuda.core is quite a bit slower than CuPy:

In [1]: import cupy as cp

In [2]: %timeit cp.cuda.Device()
69 ns ± 0.496 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: from cuda.core.experimental import Device

In [4]: %timeit Device()
795 ns ± 0.273 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Ultimately, my goal is to get the compute capability of the current device, and this is even slower:

In [5]: %timeit cp.cuda.Device().compute_capability
89.1 ns ± 0.413 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]: %timeit Device().compute_capability
2.64 μs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Are there tricks (e.g., caching) CuPy is employing here that cuda.core can use as well? Alternately, is there another way for me to use cuda.core or cuda.bindings to get this information quickly? Note that for my use case, I'm not super concerned about the first call to Device(), but I do want subsequent calls to be trivially inexpensive if the current device hasn't changed.


Using the low-level cuda.bindings is also not quite as fast:

In [11]: def get_cc():
    ...:     dev = runtime.cudaGetDevice()[1]
    ...:     return driver.cuDeviceComputeCapability(dev)
    ...:

In [12]: get_cc()
Out[12]: (<CUresult.CUDA_SUCCESS: 0>, 7, 5)

In [13]: %timeit get_cc()
597 ns ± 0.494 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Metadata

Metadata

Assignees

Labels

P1Medium priority - Should docuda.bindingsEverything related to the cuda.bindings moduleenhancementAny code-related improvements

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions