Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base #3118

Merged
merged 23 commits into from
Dec 14, 2024

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Dec 10, 2024

Description

Closes #3064

This PR primarily simplifies the implementation of TransformIterator to look more like the implementation of the other iterator types. The major change is using numba to compile the unary ("transform") operation on the Python side, rather than passing the LTOIR to C++ and compiling it there.

In addition, this PR introduces a IteratorBase class that encapsulates much of the common logic across all the iterator types. Each iterator subclass now simply needs to initialize the base and define the iterator-specific advance and dereference methods.

In terms of performance, there's a slight improvement (I believe primarily from some additional caching):

Benchmark

Before this PR

# 10_000_000 elements
Average time to reduce device array: 0.0002010730910114944                                                                                                                                     
Average time to reduce CountingIterator: 0.014292097103316337                                                                                                                                      
Average time to reduce TransformIterator(CountingIterator): 0.0567475771997123

After this PR

# 10_000_000 elements
Average time to reduce device array: 0.0001940533984452486
Average time to reduce CountingIterator: 0.0004284693975932896
Average time to reduce TransformIterator(CountingIterator): 0.04169541570590809
Benchmarking script
import timeit
import matplotlib.pyplot as plt
import cupy as cp
import numpy as np
import cuda.parallel.experimental as cudax
from cuda.parallel.experimental.iterators import ConstantIterator, CountingIterator, TransformIterator
import cProfile

def binary_op(x, y): return x + y

def unary_op(x): return x if x % 2 == 0 else 1

def benchmark_reduce_array(num_runs=10):
  times = []
  size = 10_000_000
  out = cp.empty(1, dtype=cp.int64)
  h_init = np.array([0], dtype="int64")

  inp = cp.arange(size)
  reducer = cudax.reduce_into(inp, out, binary_op, h_init)
  t1 = timeit.default_timer()
  for _ in range(num_runs):
      temp_storage_bytes = reducer(None, inp, out, size, h_init)
      d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
      reducer(d_temp_storage, inp, out, size, h_init)
  t2 = timeit.default_timer()
  assert out.get() == (inp.sum().get())
  print(f"Average time to reduce device array: {(t2 - t1)/  10}")


def benchmark_reduce_counting(num_runs=10):
  times = []
  size = 10_000_000
  out = cp.empty(1, dtype=cp.int64)
  h_init = np.array([0], dtype="int64")

  inp = CountingIterator(np.int64(1))
  reducer = cudax.reduce_into(inp, out, binary_op, h_init)
  t1 = timeit.default_timer()
  for _ in range(num_runs):
      inp = CountingIterator(np.int64(1))
      temp_storage_bytes = reducer(None, inp, out, size, h_init)
      d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
      reducer(d_temp_storage, inp, out, size, h_init)
  t2 = timeit.default_timer()
  assert out.get() == (cp.arange(1, size + 1).sum()).get()
  print(f"Average time to reduce CountingIterator: {(t2 - t1)/  10}")
  

def benchmark_reduce_transform_counting(num_runs=10):
  times = []
  size = 10_000_000
  out = cp.empty(1, dtype=cp.int64)
  h_init = np.array([0], dtype="int64")
  inp = TransformIterator(unary_op, CountingIterator(np.int64(1)))
  reducer = cudax.reduce_into(inp, out, binary_op, h_init)
  t1 = timeit.default_timer()
  for _ in range(num_runs):
      inp = TransformIterator(unary_op, CountingIterator(np.int64(1)))
      temp_storage_bytes = reducer(None, inp, out, size, h_init)
      d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
      reducer(d_temp_storage, inp, out, size, h_init)
  t2 = timeit.default_timer()
  inp = cp.arange(1, size + 1)
  inp[::2] = 1
  assert out.get() == inp.sum().get()
  print(f"Average time to reduce TransformIterator(CountingIterator): {(t2 - t1)/  10}")

benchmark_reduce_array()
benchmark_reduce_counting()
benchmark_reduce_transform_counting()

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 10, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@@ -92,5 +80,4 @@ struct cccl_iterator_t
cccl_op_t dereference;
cccl_type_info value_type;
void* state;
cccl_string_views* ltoirs = nullptr;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we compile the op for TransformIterator on the Python side, we don't need to pass the LTOIR here.

Generally speaking, any device functions used by the advance and dereference methods can be compiled purely on the Python side.

@shwina
Copy link
Contributor Author

shwina commented Dec 10, 2024

/ok to test

advance_ltoir, deref_ltoir = it.ltoirs
advance_op = _CCCLOp(
_CCCLOpKindEnum.STATELESS,
type(it).advance.__name__.encode("utf-8"),
Copy link
Contributor Author

@shwina shwina Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what are the requirements from the op's name here? Do the names need to be globally unique? Unique within the context of a single NVRTC compilation? Something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina
Copy link
Contributor Author

shwina commented Dec 10, 2024

/ok to test

1 similar comment
@shwina
Copy link
Contributor Author

shwina commented Dec 10, 2024

/ok to test

@shwina shwina changed the title [Refactor] Simplify TransformIterator implementation and refactor iterators to derive from a common base [Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base Dec 10, 2024
@shwina shwina force-pushed the refactor-and-simplify-iterators branch from 7ac794c to 152345c Compare December 10, 2024 20:20
@shwina
Copy link
Contributor Author

shwina commented Dec 10, 2024

/ok to test

@@ -13,7 +13,6 @@
import numba.cuda
import numba.types
import cuda.parallel.experimental as cudax
from cuda.parallel.experimental import _iterators
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I removed all usages of the private module from the tests as discussed in #2788 (comment). To follow up, do we want an additional test for testing pointer handling in TransformIterator?

@shwina shwina force-pushed the refactor-and-simplify-iterators branch from 152345c to dd84b01 Compare December 10, 2024 20:49
Copy link
Contributor

🟩 CI finished in 37m 05s: Pass: 100%/3 | Total: 35m 35s | Avg: 11m 51s | Max: 26m 35s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 00s | Avg: 4m 30s | Max: 7m 01s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 59s | Avg:  1m 59s | Max:  1m 59s
      🟩 Test               Pass: 100%/1   | Total:  7m 01s | Avg:  7m 01s | Max:  7m 01s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Comment on lines +209 to +255
it_advance = numba.cuda.jit(type(it).advance, device=True)
it_dereference = numba.cuda.jit(type(it).dereference, device=True)
op = numba.cuda.jit(op, device=True)
Copy link
Contributor Author

@shwina shwina Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually do a JIT compilation (there's not any type information for compilation anyway). It helps make it legal to call it_advance, it_dereference and op from other numba-compiled device functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark: this approach is so much better!

@shwina
Copy link
Contributor Author

shwina commented Dec 10, 2024

/ok to test

@shwina shwina marked this pull request as ready for review December 10, 2024 21:02
@shwina shwina requested a review from a team as a code owner December 10, 2024 21:02
@shwina shwina requested a review from alliepiper December 10, 2024 21:02
Copy link
Contributor

🟩 CI finished in 26m 40s: Pass: 100%/3 | Total: 35m 31s | Avg: 11m 50s | Max: 26m 40s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 51s | Avg: 4m 25s | Max: 6m 43s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 08s | Avg:  2m 08s | Max:  2m 08s
      🟩 Test               Pass: 100%/1   | Total:  6m 43s | Avg:  6m 43s | Max:  6m 43s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Copy link
Contributor

🟩 CI finished in 27m 38s: Pass: 100%/3 | Total: 36m 30s | Avg: 12m 10s | Max: 27m 08s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 22s | Avg: 4m 41s | Max: 7m 15s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
      🟩 Test               Pass: 100%/1   | Total:  7m 15s | Avg:  7m 15s | Max:  7m 15s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Comment on lines +209 to +255
it_advance = numba.cuda.jit(type(it).advance, device=True)
it_dereference = numba.cuda.jit(type(it).dereference, device=True)
op = numba.cuda.jit(op, device=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark: this approach is so much better!

if it_ntype_ir is None:
raise RuntimeError(f"Unsupported: {type(it.ntype)=}")

op_abi_name = f"{op.__name__}_{it.ntype.name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical: I think we need a unique name per jit compilation (invokation of Parallel C *_build step). Some algorithms, say, reduce by key, have multiple iterators. If you try passing transform of int and transform of float, compiling member functions of corresponding iterators would result in the same symbol name: advance and dereference regardless of the operator and data type they work with. Same with counting, constant, and pointer iterators. I'd suggest to be on the safe side and have "mangled" name for everything like we used to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I worry that without some additional information, the name of the iterator and its ops + value types may not be enough: for example, what if we have 2 ConstantIterators with the same value type, differing only in their state?

In [4]: CountingIterator(np.int32(1)).prefix
Out[4]: 'count_int32'

In [5]: CountingIterator(np.int32(2)).prefix
Out[5]: 'count_int32'

Should we include a unique UUID for safety?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need unique UUID here for safety. Value type on constant iterator defines codegen, but it's applied to different states. If you think about C++ analogy, there's only one counting_iterator<int>::operator* symbol, only this is different between invocations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful with transform, though. op name is not bijective:

def make_transform(constant):
  def op(val):
    return val * constant
  return iterators.Transform(..., op)

reduce_by_key(make_transform(1), make_transform(2.0), ...)

If dereference and advance are mangled by the op name alone, there's going to be a problem. So we need to have a way to distinguish operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I fixed the ABI name for TransformIterator instances. Can you check if this is good now?

@shwina shwina force-pushed the refactor-and-simplify-iterators branch from a760108 to dd84b01 Compare December 11, 2024 11:08
Copy link
Contributor

🟩 CI finished in 24m 21s: Pass: 100%/3 | Total: 32m 48s | Avg: 10m 56s | Max: 24m 01s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 47s | Avg: 4m 23s | Max: 6m 43s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 04s | Avg:  2m 04s | Max:  2m 04s
      🟩 Test               Pass: 100%/1   | Total:  6m 43s | Avg:  6m 43s | Max:  6m 43s
    
  • 🟩 python: Pass: 100%/1 | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

@shwina shwina force-pushed the refactor-and-simplify-iterators branch from 4d9140c to fa2e0af Compare December 11, 2024 19:47
docs/repo.toml Show resolved Hide resolved
Copy link
Contributor

🟩 CI finished in 2h 22m: Pass: 100%/3 | Total: 36m 28s | Avg: 12m 09s | Max: 27m 17s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 11s | Avg: 4m 35s | Max: 7m 13s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 58s | Avg:  1m 58s | Max:  1m 58s
      🟩 Test               Pass: 100%/1   | Total:  7m 13s | Avg:  7m 13s | Max:  7m 13s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Copy link
Contributor

🟩 CI finished in 32m 32s: Pass: 100%/3 | Total: 36m 24s | Avg: 12m 08s | Max: 27m 47s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 37s | Avg: 4m 18s | Max: 6m 35s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 02s | Avg:  2m 02s | Max:  2m 02s
      🟩 Test               Pass: 100%/1   | Total:  6m 35s | Avg:  6m 35s | Max:  6m 35s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Copy link
Contributor

🟩 CI finished in 27m 04s: Pass: 100%/3 | Total: 36m 03s | Avg: 12m 01s | Max: 26m 05s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 58s | Avg: 4m 59s | Max: 7m 46s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 12s | Avg:  2m 12s | Max:  2m 12s
      🟩 Test               Pass: 100%/1   | Total:  7m 46s | Avg:  7m 46s | Max:  7m 46s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Copy link
Contributor

🟩 CI finished in 56m 48s: Pass: 100%/3 | Total: 37m 39s | Avg: 12m 33s | Max: 28m 02s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 37s | Avg: 4m 48s | Max: 7m 30s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
      🟩 Test               Pass: 100%/1   | Total:  7m 30s | Avg:  7m 30s | Max:  7m 30s
    
  • 🟩 python: Pass: 100%/1 | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Copy link
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are almost all just minor nits, except for relying on dict insertion order, that's a subtle risk we don't need to take.

@@ -13,7 +14,7 @@ def CacheModifiedInputIterator(device_array, modifier):
value_type = device_array.dtype
return _iterators.CacheModifiedPointer(
device_array.__cuda_array_interface__["data"][0],
_iterators.numba_type_from_any(value_type),
numba.from_dtype(value_type),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd inline device_array.dtype here. value_type isn't used anywhere else here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - inlined it.

@@ -1,4 +1,5 @@
from . import _iterators
import numba
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to explain, no change needed:

A few weeks ago @leofang explained to me that the long-term vision for cuda.parallel is that it works independently of numba. Therefore I tried to not have the numba import here.

At the moment we're so deeply dependent on numba, it seems fine that it comes through here (public API).

@@ -31,4 +32,4 @@ def CountingIterator(offset):

def TransformIterator(op, it):
"""Python facade (similar to built-in map) mimicking a C++ Random Access TransformIterator."""
return _iterators.TransformIterator(op, it)
return _iterators.make_transform_iterator(it, op)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to reverse the order of args here?

I was going by map(op, it) (the Python built-in).

It's a nit, and one case by itself doesn't matter much, but this kind of thing is likely to trip up someone. Generally I try to stay as compatible as possible with established APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have left a comment here clarifying why I reversed the argument order:

  1. The order it, op is the same as in thrust::transform_iterator. As discussed in this thread, we should build an API as similar to the underlying C++ layer as possible - and in the future, a more Pythonic one on top of that.

  2. Since we're no longer using the name map, it's less likely to cause confusion. I absolutely agree that a function called map(it, op) would trip folks up.



@lru_cache(maxsize=256)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is just a dictionary lookup anyway.

Maybe simpler is better here?

NUMBA_TO_CTYPES_MAPPING = {
    numba.types.int8: ctypes.c_int8,
    numba.types.int16: ctypes.c_int16,
    numba.types.int32: ctypes.c_int32,
    numba.types.int64: ctypes.c_int64,
    numba.types.uint8: ctypes.c_uint8,
    numba.types.uint16: ctypes.c_uint16,
    numba.types.uint32: ctypes.c_uint32,
    numba.types.uint64: ctypes.c_uint64,
    numba.types.float32: ctypes.c_float,
    numba.types.float64: ctypes.c_double,
}

Then replace the function calls with NUMBA_TO_CTYPES_MAPPING[ntype].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 1bffc98 I replaced the use of these utility functions with ones that numba provides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, much better.

(returns nothing).
- a `dereference` (static) method that dereferences the state pointer
and returns a value.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add

Note: advance and dereference exist exclusively for compilation with numba. They are not meant to be called from Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some clarifying notes.

alignment = context.get_value_type(numba_type).get_abi_alignment(
context.target_data
)
(advance_abi_name, advance_ltoir), (deref_abi_name, deref_ltoir) = it.ltoirs.items()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it.ltoirs is a dict, relying on insertion order is to know what is advance and what is deref is a little dicy.

See below.

Copy link
Contributor Author

@shwina shwina Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was definitely true before Python 3.6 but is explicitly legal since Python 3.7. Iterating over a dict view (such as .items()) now returns in insertion order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but I really wouldn't rely on that. It's too brittle / unobvious; and really buys nothing. Even a tuple of tuples isn't exactly great, but not quite as unobvious.

Also, AFAIK there is no assurance that future versions of Python will guarantee insertion order.

output="ltoir",
abi_name=deref_abi_name,
)
return {advance_abi_name: advance_ltoir, deref_abi_name: deref_ltoir}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer to simply return a tuple of tuples here, then there is no question about insertion order.

Nicest would be a namedtuple or dataclass, with abi_name, ltoir as members. But since there is only one caller (IIUC) a plain tuple of tuples would seem fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no risk here from using a dict. Python dicts guarantee maintaining insertion order since Python 3.7. From the Python docs:

[...] the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7):

A namedtuple may be a little nicer but a dict serves fine, especially for an internal function.


@property
def state(self):
raise AttributeError("Subclasses must override advance staticmethod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotImplementedError

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Changed it to NotImplementedError.

)

@staticmethod
def advance(it, distance):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about state instead of it here?

Or data, because that's what we're passing in the calls.

Or it_data maybe.

(it looks too much like it's an actual iterator object.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - I changed it to state.

def __init__(self, it, op):
self._it = it
numba_type = it.numba_type
op_abi_name = f"{self.__class__.__name__}_{op.py_func.__name__}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where we'll need to make the name more unique if there are two transform iterators in the same context in the future? Maybe add a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - added a TODO

@shwina shwina force-pushed the refactor-and-simplify-iterators branch from 3d7a1fc to 9c66451 Compare December 13, 2024 13:42
@shwina shwina requested review from a team as code owners December 13, 2024 13:42
@shwina shwina requested a review from rwgk December 13, 2024 13:42
Copy link
Contributor

🟩 CI finished in 25m 50s: Pass: 100%/3 | Total: 35m 00s | Avg: 11m 40s | Max: 25m 50s
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 10s | Avg: 4m 35s | Max: 6m 59s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 11s | Avg:  2m 11s | Max:  2m 11s
      🟩 Test               Pass: 100%/1   | Total:  6m 59s | Avg:  6m 59s | Max:  6m 59s
    
  • 🟩 python: Pass: 100%/1 | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-v100-latest-1
1 linux-amd64-cpu16

Copy link
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

@shwina shwina merged commit 3a7f2cd into NVIDIA:main Dec 14, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
3 participants