-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base #3118
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
@@ -92,5 +80,4 @@ struct cccl_iterator_t | |||
cccl_op_t dereference; | |||
cccl_type_info value_type; | |||
void* state; | |||
cccl_string_views* ltoirs = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we compile the op
for TransformIterator
on the Python side, we don't need to pass the LTOIR here.
Generally speaking, any device functions used by the advance
and dereference
methods can be compiled purely on the Python side.
/ok to test |
advance_ltoir, deref_ltoir = it.ltoirs | ||
advance_op = _CCCLOp( | ||
_CCCLOpKindEnum.STATELESS, | ||
type(it).advance.__name__.encode("utf-8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what are the requirements from the op's name
here? Do the names need to be globally unique? Unique within the context of a single NVRTC compilation? Something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: #3118 (comment)
/ok to test |
1 similar comment
/ok to test |
7ac794c
to
152345c
Compare
/ok to test |
@@ -13,7 +13,6 @@ | |||
import numba.cuda | |||
import numba.types | |||
import cuda.parallel.experimental as cudax | |||
from cuda.parallel.experimental import _iterators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I removed all usages of the private module from the tests as discussed in #2788 (comment). To follow up, do we want an additional test for testing pointer handling in TransformIterator
?
152345c
to
dd84b01
Compare
🟩 CI finished in 37m 05s: Pass: 100%/3 | Total: 35m 35s | Avg: 11m 51s | Max: 26m 35s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
it_advance = numba.cuda.jit(type(it).advance, device=True) | ||
it_dereference = numba.cuda.jit(type(it).dereference, device=True) | ||
op = numba.cuda.jit(op, device=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually do a JIT compilation (there's not any type information for compilation anyway). It helps make it legal to call it_advance
, it_dereference
and op
from other numba-compiled device functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remark: this approach is so much better!
/ok to test |
🟩 CI finished in 26m 40s: Pass: 100%/3 | Total: 35m 31s | Avg: 11m 50s | Max: 26m 40s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
🟩 CI finished in 27m 38s: Pass: 100%/3 | Total: 36m 30s | Avg: 12m 10s | Max: 27m 08s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
it_advance = numba.cuda.jit(type(it).advance, device=True) | ||
it_dereference = numba.cuda.jit(type(it).dereference, device=True) | ||
op = numba.cuda.jit(op, device=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remark: this approach is so much better!
if it_ntype_ir is None: | ||
raise RuntimeError(f"Unsupported: {type(it.ntype)=}") | ||
|
||
op_abi_name = f"{op.__name__}_{it.ntype.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
critical: I think we need a unique name per jit compilation (invokation of Parallel C *_build
step). Some algorithms, say, reduce by key, have multiple iterators. If you try passing transform of int and transform of float, compiling member functions of corresponding iterators would result in the same symbol name: advance
and dereference
regardless of the operator and data type they work with. Same with counting, constant, and pointer iterators. I'd suggest to be on the safe side and have "mangled" name for everything like we used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I worry that without some additional information, the name of the iterator and its ops + value types may not be enough: for example, what if we have 2 ConstantIterators
with the same value type, differing only in their state?
In [4]: CountingIterator(np.int32(1)).prefix
Out[4]: 'count_int32'
In [5]: CountingIterator(np.int32(2)).prefix
Out[5]: 'count_int32'
Should we include a unique UUID for safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need unique UUID here for safety. Value type on constant iterator defines codegen, but it's applied to different states. If you think about C++ analogy, there's only one counting_iterator<int>::operator*
symbol, only this
is different between invocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to be careful with transform, though. op
name is not bijective:
def make_transform(constant):
def op(val):
return val * constant
return iterators.Transform(..., op)
reduce_by_key(make_transform(1), make_transform(2.0), ...)
If dereference and advance are mangled by the op name alone, there's going to be a problem. So we need to have a way to distinguish operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I fixed the ABI name for TransformIterator instances. Can you check if this is good now?
a760108
to
dd84b01
Compare
🟩 CI finished in 24m 21s: Pass: 100%/3 | Total: 32m 48s | Avg: 10m 56s | Max: 24m 01s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
4d9140c
to
fa2e0af
Compare
🟩 CI finished in 2h 22m: Pass: 100%/3 | Total: 36m 28s | Avg: 12m 09s | Max: 27m 17s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
🟩 CI finished in 32m 32s: Pass: 100%/3 | Total: 36m 24s | Avg: 12m 08s | Max: 27m 47s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
🟩 CI finished in 27m 04s: Pass: 100%/3 | Total: 36m 03s | Avg: 12m 01s | Max: 26m 05s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
🟩 CI finished in 56m 48s: Pass: 100%/3 | Total: 37m 39s | Avg: 12m 33s | Max: 28m 02s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are almost all just minor nits, except for relying on dict insertion order, that's a subtle risk we don't need to take.
@@ -13,7 +14,7 @@ def CacheModifiedInputIterator(device_array, modifier): | |||
value_type = device_array.dtype | |||
return _iterators.CacheModifiedPointer( | |||
device_array.__cuda_array_interface__["data"][0], | |||
_iterators.numba_type_from_any(value_type), | |||
numba.from_dtype(value_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd inline device_array.dtype
here. value_type
isn't used anywhere else here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - inlined it.
@@ -1,4 +1,5 @@ | |||
from . import _iterators | |||
import numba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to explain, no change needed:
A few weeks ago @leofang explained to me that the long-term vision for cuda.parallel is that it works independently of numba. Therefore I tried to not have the numba import here.
At the moment we're so deeply dependent on numba, it seems fine that it comes through here (public API).
@@ -31,4 +32,4 @@ def CountingIterator(offset): | |||
|
|||
def TransformIterator(op, it): | |||
"""Python facade (similar to built-in map) mimicking a C++ Random Access TransformIterator.""" | |||
return _iterators.TransformIterator(op, it) | |||
return _iterators.make_transform_iterator(it, op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to reverse the order of args here?
I was going by map(op, it)
(the Python built-in).
It's a nit, and one case by itself doesn't matter much, but this kind of thing is likely to trip up someone. Generally I try to stay as compatible as possible with established APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have left a comment here clarifying why I reversed the argument order:
-
The order
it, op
is the same as inthrust::transform_iterator
. As discussed in this thread, we should build an API as similar to the underlying C++ layer as possible - and in the future, a more Pythonic one on top of that. -
Since we're no longer using the name
map
, it's less likely to cause confusion. I absolutely agree that a function calledmap(it, op)
would trip folks up.
|
||
|
||
@lru_cache(maxsize=256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is just a dictionary lookup anyway.
Maybe simpler is better here?
NUMBA_TO_CTYPES_MAPPING = {
numba.types.int8: ctypes.c_int8,
numba.types.int16: ctypes.c_int16,
numba.types.int32: ctypes.c_int32,
numba.types.int64: ctypes.c_int64,
numba.types.uint8: ctypes.c_uint8,
numba.types.uint16: ctypes.c_uint16,
numba.types.uint32: ctypes.c_uint32,
numba.types.uint64: ctypes.c_uint64,
numba.types.float32: ctypes.c_float,
numba.types.float64: ctypes.c_double,
}
Then replace the function calls with NUMBA_TO_CTYPES_MAPPING[ntype]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 1bffc98 I replaced the use of these utility functions with ones that numba provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, much better.
(returns nothing). | ||
- a `dereference` (static) method that dereferences the state pointer | ||
and returns a value. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add
Note: advance
and dereference
exist exclusively for compilation with numba. They are not meant to be called from Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some clarifying notes.
alignment = context.get_value_type(numba_type).get_abi_alignment( | ||
context.target_data | ||
) | ||
(advance_abi_name, advance_ltoir), (deref_abi_name, deref_ltoir) = it.ltoirs.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it.ltoirs
is a dict, relying on insertion order is to know what is advance and what is deref is a little dicy.
See below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was definitely true before Python 3.6 but is explicitly legal since Python 3.7. Iterating over a dict view (such as .items()
) now returns in insertion order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but I really wouldn't rely on that. It's too brittle / unobvious; and really buys nothing. Even a tuple of tuples isn't exactly great, but not quite as unobvious.
Also, AFAIK there is no assurance that future versions of Python will guarantee insertion order.
output="ltoir", | ||
abi_name=deref_abi_name, | ||
) | ||
return {advance_abi_name: advance_ltoir, deref_abi_name: deref_ltoir} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer to simply return a tuple of tuples here, then there is no question about insertion order.
Nicest would be a namedtuple
or dataclass
, with abi_name
, ltoir
as members. But since there is only one caller (IIUC) a plain tuple of tuples would seem fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no risk here from using a dict
. Python dicts guarantee maintaining insertion order since Python 3.7. From the Python docs:
[...] the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7):
A namedtuple
may be a little nicer but a dict
serves fine, especially for an internal function.
|
||
@property | ||
def state(self): | ||
raise AttributeError("Subclasses must override advance staticmethod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotImplementedError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Changed it to NotImplementedError
.
) | ||
|
||
@staticmethod | ||
def advance(it, distance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about state
instead of it
here?
Or data
, because that's what we're passing in the calls.
Or it_data
maybe.
(it
looks too much like it's an actual iterator object.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - I changed it to state
.
def __init__(self, it, op): | ||
self._it = it | ||
numba_type = it.numba_type | ||
op_abi_name = f"{self.__class__.__name__}_{op.py_func.__name__}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where we'll need to make the name more unique if there are two transform iterators in the same context in the future? Maybe add a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - added a TODO
3d7a1fc
to
9c66451
Compare
🟩 CI finished in 25m 50s: Pass: 100%/3 | Total: 35m 00s | Avg: 11m 40s | Max: 25m 50s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-v100-latest-1 |
1 | linux-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me!
Description
Closes #3064
This PR primarily simplifies the implementation of
TransformIterator
to look more like the implementation of the other iterator types. The major change is using numba to compile the unary ("transform") operation on the Python side, rather than passing the LTOIR to C++ and compiling it there.In addition, this PR introduces a
IteratorBase
class that encapsulates much of the common logic across all the iterator types. Each iterator subclass now simply needs to initialize the base and define the iterator-specificadvance
anddereference
methods.In terms of performance, there's a slight improvement (I believe primarily from some additional caching):
Benchmark
Before this PR
After this PR
Benchmarking script
Checklist