Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying current device is slow compared to CuPy #439

Open
shwina opened this issue Feb 7, 2025 · 4 comments
Open

Querying current device is slow compared to CuPy #439

shwina opened this issue Feb 7, 2025 · 4 comments
Assignees
Labels
cuda.bindings Everything related to the cuda.bindings module enhancement Any code-related improvements P1 Medium priority - Should do

Comments

@shwina
Copy link
Contributor

shwina commented Feb 7, 2025

Getting the current device using cuda.core is quite a bit slower than CuPy:

In [1]: import cupy as cp

In [2]: %timeit cp.cuda.Device()
69 ns ± 0.496 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: from cuda.core.experimental import Device

In [4]: %timeit Device()
795 ns ± 0.273 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Ultimately, my goal is to get the compute capability of the current device, and this is even slower:

In [5]: %timeit cp.cuda.Device().compute_capability
89.1 ns ± 0.413 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]: %timeit Device().compute_capability
2.64 μs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Are there tricks (e.g., caching) CuPy is employing here that cuda.core can use as well? Alternately, is there another way for me to use cuda.core or cuda.bindings to get this information quickly? Note that for my use case, I'm not super concerned about the first call to Device(), but I do want subsequent calls to be trivially inexpensive if the current device hasn't changed.


Using the low-level cuda.bindings is also not quite as fast:

In [11]: def get_cc():
    ...:     dev = runtime.cudaGetDevice()[1]
    ...:     return driver.cuDeviceComputeCapability(dev)
    ...:

In [12]: get_cc()
Out[12]: (<CUresult.CUDA_SUCCESS: 0>, 7, 5)

In [13]: %timeit get_cc()
597 ns ± 0.494 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
@github-actions github-actions bot added the triage Needs the team's attention label Feb 7, 2025
@leofang leofang added cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do! and removed triage Needs the team's attention labels Feb 7, 2025
@leofang
Copy link
Member

leofang commented Feb 14, 2025

We'll have to cache CC on a per-Device object level to bring this down to O(10) ns level.

In [32]: def get_cc(dev):
    ...:    if dev in data:
    ...:        return data[dev]
    ...:    data[dev] = (driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev)[1],
    ...:                 driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev)[1])
    ...:    return data[dev]
    ...: 

In [33]: 

In [33]: get_cc(1)
Out[33]: (12, 0)

In [36]: %timeit get_cc(1)
51.7 ns ± 0.0214 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [37]: %timeit cp.cuda.Device().compute_capability
179 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

which is also what CuPy does internally:
https://github.com/cupy/cupy/blob/1f9c9d4d1eb2edcbeb2a9294def57c2252e18b92/cupy/cuda/device.pyx#L213-L214

@leofang leofang self-assigned this Feb 14, 2025
@leofang leofang added this to the cuda.core beta 3 milestone Feb 14, 2025
@leofang
Copy link
Member

leofang commented Feb 21, 2025

I did some refactoring of Device.__new__() to replace cudart APIs by driver APIs, and found the perf gets even worse. Out of curiosity, I did this quick profiling and got very surprised: (the following already has the primary context set to current)

In [19]: %timeit runtime.cudaGetDevice()
338 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [20]: %timeit driver.cuCtxGetDevice()
406 ns ± 1.79 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [21]: %timeit cp.cuda.runtime.getDevice()
112 ns ± 0.822 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

A simple get-device call using cuda.bindings is 3-4x slower than CuPy. @vzhurba01 have we seen something like this before?

@leofang
Copy link
Member

leofang commented Feb 24, 2025

Accessing Device().compute_capability is being addressed in #459. Let me re-label this issue to track the remaining binding performance issue.

@leofang leofang added cuda.bindings Everything related to the cuda.bindings module P1 Medium priority - Should do and removed cuda.core Everything related to the cuda.core module P0 High priority - Must do! labels Feb 24, 2025
@leofang
Copy link
Member

leofang commented Feb 27, 2025

@rwgk reported that cuDriverGetVersion is also sluggish when called repeatedly in a busy loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.bindings Everything related to the cuda.bindings module enhancement Any code-related improvements P1 Medium priority - Should do
Projects
None yet
Development

No branches or pull requests

3 participants