-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying current device is slow compared to CuPy #439
Comments
We'll have to cache CC on a per- In [32]: def get_cc(dev):
...: if dev in data:
...: return data[dev]
...: data[dev] = (driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev)[1],
...: driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev)[1])
...: return data[dev]
...:
In [33]:
In [33]: get_cc(1)
Out[33]: (12, 0)
In [36]: %timeit get_cc(1)
51.7 ns ± 0.0214 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [37]: %timeit cp.cuda.Device().compute_capability
179 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) which is also what CuPy does internally: |
I did some refactoring of In [19]: %timeit runtime.cudaGetDevice()
338 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [20]: %timeit driver.cuCtxGetDevice()
406 ns ± 1.79 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [21]: %timeit cp.cuda.runtime.getDevice()
112 ns ± 0.822 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) A simple get-device call using |
Accessing |
@rwgk reported that |
Getting the current device using
cuda.core
is quite a bit slower than CuPy:Ultimately, my goal is to get the compute capability of the current device, and this is even slower:
Are there tricks (e.g., caching) CuPy is employing here that
cuda.core
can use as well? Alternately, is there another way for me to usecuda.core
orcuda.bindings
to get this information quickly? Note that for my use case, I'm not super concerned about the first call toDevice()
, but I do want subsequent calls to be trivially inexpensive if the current device hasn't changed.Using the low-level cuda.bindings is also not quite as fast:
The text was updated successfully, but these errors were encountered: