[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base #3118

shwina · 2024-12-10T19:13:42Z

Description

This PR primarily simplifies the implementation of TransformIterator to look more like the implementation of the other iterator types. The major change is using numba to compile the unary ("transform") operation on the Python side, rather than passing the LTOIR to C++ and compiling it there.

In addition, this PR introduces a IteratorBase class that encapsulates much of the common logic across all the iterator types. Each iterator subclass now simply needs to initialize the base and define the iterator-specific advance and dereference methods.

In terms of performance, there's a slight improvement (I believe primarily from some additional caching):

Benchmark

Before this PR

# 10_000_000 elements
Average time to reduce device array: 0.0002010730910114944                                                                                                                                     
Average time to reduce CountingIterator: 0.014292097103316337                                                                                                                                      
Average time to reduce TransformIterator(CountingIterator): 0.0567475771997123

After this PR

# 10_000_000 elements
Average time to reduce device array: 0.0001940533984452486
Average time to reduce CountingIterator: 0.0004284693975932896
Average time to reduce TransformIterator(CountingIterator): 0.04169541570590809

Benchmarking script

import timeit
import matplotlib.pyplot as plt
import cupy as cp
import numpy as np
import cuda.parallel.experimental as cudax
from cuda.parallel.experimental.iterators import ConstantIterator, CountingIterator, TransformIterator
import cProfile

def binary_op(x, y): return x + y

def unary_op(x): return x if x % 2 == 0 else 1

def benchmark_reduce_array(num_runs=10):
  times = []
  size = 10_000_000
  out = cp.empty(1, dtype=cp.int64)
  h_init = np.array([0], dtype="int64")

  inp = cp.arange(size)
  reducer = cudax.reduce_into(inp, out, binary_op, h_init)
  t1 = timeit.default_timer()
  for _ in range(num_runs):
      temp_storage_bytes = reducer(None, inp, out, size, h_init)
      d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
      reducer(d_temp_storage, inp, out, size, h_init)
  t2 = timeit.default_timer()
  assert out.get() == (inp.sum().get())
  print(f"Average time to reduce device array: {(t2 - t1)/  10}")


def benchmark_reduce_counting(num_runs=10):
  times = []
  size = 10_000_000
  out = cp.empty(1, dtype=cp.int64)
  h_init = np.array([0], dtype="int64")

  inp = CountingIterator(np.int64(1))
  reducer = cudax.reduce_into(inp, out, binary_op, h_init)
  t1 = timeit.default_timer()
  for _ in range(num_runs):
      inp = CountingIterator(np.int64(1))
      temp_storage_bytes = reducer(None, inp, out, size, h_init)
      d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
      reducer(d_temp_storage, inp, out, size, h_init)
  t2 = timeit.default_timer()
  assert out.get() == (cp.arange(1, size + 1).sum()).get()
  print(f"Average time to reduce CountingIterator: {(t2 - t1)/  10}")
  

def benchmark_reduce_transform_counting(num_runs=10):
  times = []
  size = 10_000_000
  out = cp.empty(1, dtype=cp.int64)
  h_init = np.array([0], dtype="int64")
  inp = TransformIterator(unary_op, CountingIterator(np.int64(1)))
  reducer = cudax.reduce_into(inp, out, binary_op, h_init)
  t1 = timeit.default_timer()
  for _ in range(num_runs):
      inp = TransformIterator(unary_op, CountingIterator(np.int64(1)))
      temp_storage_bytes = reducer(None, inp, out, size, h_init)
      d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
      reducer(d_temp_storage, inp, out, size, h_init)
  t2 = timeit.default_timer()
  inp = cp.arange(1, size + 1)
  inp[::2] = 1
  assert out.get() == inp.sum().get()
  print(f"Average time to reduce TransformIterator(CountingIterator): {(t2 - t1)/  10}")

benchmark_reduce_array()
benchmark_reduce_counting()
benchmark_reduce_transform_counting()

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-10T19:13:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

shwina · 2024-12-10T19:15:39Z

c/parallel/include/cccl/c/types.h

@@ -92,5 +80,4 @@ struct cccl_iterator_t
  cccl_op_t dereference;
  cccl_type_info value_type;
  void* state;
-  cccl_string_views* ltoirs = nullptr;


Because we compile the op for TransformIterator on the Python side, we don't need to pass the LTOIR here.

Generally speaking, any device functions used by the advance and dereference methods can be compiled purely on the Python side.

shwina · 2024-12-10T19:49:17Z

/ok to test

shwina · 2024-12-10T20:00:42Z

python/cuda_parallel/cuda/parallel/experimental/__init__.py

+    advance_ltoir, deref_ltoir = it.ltoirs
+    advance_op = _CCCLOp(
+        _CCCLOpKindEnum.STATELESS,
+        type(it).advance.__name__.encode("utf-8"),


Question: what are the requirements from the op's name here? Do the names need to be globally unique? Unique within the context of a single NVRTC compilation? Something else?

xref: #3118 (comment)

shwina · 2024-12-10T20:10:23Z

/ok to test

shwina · 2024-12-10T20:13:03Z

/ok to test

shwina · 2024-12-10T20:20:41Z

/ok to test

shwina · 2024-12-10T20:35:09Z

python/cuda_parallel/tests/test_reduce.py

@@ -13,7 +13,6 @@
 import numba.cuda
 import numba.types
 import cuda.parallel.experimental as cudax
-from cuda.parallel.experimental import _iterators


Note: I removed all usages of the private module from the tests as discussed in #2788 (comment). To follow up, do we want an additional test for testing pointer handling in TransformIterator?

github-actions · 2024-12-10T20:59:39Z

🟩 CI finished in 37m 05s: Pass: 100%/3 | Total: 35m 35s | Avg: 11m 51s | Max: 26m 35s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 00s | Avg: 4m 30s | Max: 7m 01s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 00s | Avg:  4m 30s | Max:  7m 01s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 59s | Avg:  1m 59s | Max:  1m 59s
  🟩 Test               Pass: 100%/1   | Total:  7m 01s | Avg:  7m 01s | Max:  7m 01s

🟩 python: Pass: 100%/1 | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

shwina · 2024-12-10T21:00:52Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

+    it_advance = numba.cuda.jit(type(it).advance, device=True)
+    it_dereference = numba.cuda.jit(type(it).dereference, device=True)
+    op = numba.cuda.jit(op, device=True)


This doesn't actually do a JIT compilation (there's not any type information for compilation anyway). It helps make it legal to call it_advance, it_dereference and op from other numba-compiled device functions.

remark: this approach is so much better!

shwina · 2024-12-10T21:01:44Z

/ok to test

github-actions · 2024-12-10T21:29:32Z

🟩 CI finished in 26m 40s: Pass: 100%/3 | Total: 35m 31s | Avg: 11m 50s | Max: 26m 40s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 51s | Avg: 4m 25s | Max: 6m 43s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  6m 43s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 08s | Avg:  2m 08s | Max:  2m 08s
  🟩 Test               Pass: 100%/1   | Total:  6m 43s | Avg:  6m 43s | Max:  6m 43s

🟩 python: Pass: 100%/1 | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

github-actions · 2024-12-10T22:46:08Z

🟩 CI finished in 27m 38s: Pass: 100%/3 | Total: 36m 30s | Avg: 12m 10s | Max: 27m 08s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 22s | Avg: 4m 41s | Max: 7m 15s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 15s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
  🟩 Test               Pass: 100%/1   | Total:  7m 15s | Avg:  7m 15s | Max:  7m 15s

🟩 python: Pass: 100%/1 | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

gevtushenko · 2024-12-11T05:00:49Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

+    it_advance = numba.cuda.jit(type(it).advance, device=True)
+    it_dereference = numba.cuda.jit(type(it).dereference, device=True)
+    op = numba.cuda.jit(op, device=True)


remark: this approach is so much better!

gevtushenko · 2024-12-11T05:05:17Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

-    if it_ntype_ir is None:
-        raise RuntimeError(f"Unsupported: {type(it.ntype)=}")
-
-    op_abi_name = f"{op.__name__}_{it.ntype.name}"


critical: I think we need a unique name per jit compilation (invokation of Parallel C *_build step). Some algorithms, say, reduce by key, have multiple iterators. If you try passing transform of int and transform of float, compiling member functions of corresponding iterators would result in the same symbol name: advance and dereference regardless of the operator and data type they work with. Same with counting, constant, and pointer iterators. I'd suggest to be on the safe side and have "mangled" name for everything like we used to.

Thanks for clarifying. I worry that without some additional information, the name of the iterator and its ops + value types may not be enough: for example, what if we have 2 ConstantIterators with the same value type, differing only in their state?

In [4]: CountingIterator(np.int32(1)).prefix Out[4]: 'count_int32' In [5]: CountingIterator(np.int32(2)).prefix Out[5]: 'count_int32'

Should we include a unique UUID for safety?

We don't need unique UUID here for safety. Value type on constant iterator defines codegen, but it's applied to different states. If you think about C++ analogy, there's only one counting_iterator<int>::operator* symbol, only this is different between invocations.

We have to be careful with transform, though. op name is not bijective:

def make_transform(constant): def op(val): return val * constant return iterators.Transform(..., op) reduce_by_key(make_transform(1), make_transform(2.0), ...)

If dereference and advance are mangled by the op name alone, there's going to be a problem. So we need to have a way to distinguish operators.

Thanks - I fixed the ABI name for TransformIterator instances. Can you check if this is good now?

github-actions · 2024-12-11T11:46:49Z

🟩 CI finished in 24m 21s: Pass: 100%/3 | Total: 32m 48s | Avg: 10m 56s | Max: 24m 01s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 47s | Avg: 4m 23s | Max: 6m 43s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  6m 43s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 04s | Avg:  2m 04s | Max:  2m 04s
  🟩 Test               Pass: 100%/1   | Total:  6m 43s | Avg:  6m 43s | Max:  6m 43s

🟩 python: Pass: 100%/1 | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

docs/repo.toml

github-actions · 2024-12-11T22:11:50Z

🟩 CI finished in 2h 22m: Pass: 100%/3 | Total: 36m 28s | Avg: 12m 09s | Max: 27m 17s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 11s | Avg: 4m 35s | Max: 7m 13s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 11s | Avg:  4m 35s | Max:  7m 13s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 58s | Avg:  1m 58s | Max:  1m 58s
  🟩 Test               Pass: 100%/1   | Total:  7m 13s | Avg:  7m 13s | Max:  7m 13s

🟩 python: Pass: 100%/1 | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

github-actions · 2024-12-12T12:26:27Z

🟩 CI finished in 32m 32s: Pass: 100%/3 | Total: 36m 24s | Avg: 12m 08s | Max: 27m 47s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 37s | Avg: 4m 18s | Max: 6m 35s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 37s | Avg:  4m 18s | Max:  6m 35s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 02s | Avg:  2m 02s | Max:  2m 02s
  🟩 Test               Pass: 100%/1   | Total:  6m 35s | Avg:  6m 35s | Max:  6m 35s

🟩 python: Pass: 100%/1 | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

github-actions · 2024-12-12T14:41:32Z

🟩 CI finished in 27m 04s: Pass: 100%/3 | Total: 36m 03s | Avg: 12m 01s | Max: 26m 05s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 58s | Avg: 4m 59s | Max: 7m 46s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  7m 46s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 12s | Avg:  2m 12s | Max:  2m 12s
  🟩 Test               Pass: 100%/1   | Total:  7m 46s | Avg:  7m 46s | Max:  7m 46s

🟩 python: Pass: 100%/1 | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

github-actions · 2024-12-12T17:49:49Z

🟩 CI finished in 56m 48s: Pass: 100%/3 | Total: 37m 39s | Avg: 12m 33s | Max: 28m 02s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 37s | Avg: 4m 48s | Max: 7m 30s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  7m 30s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
  🟩 Test               Pass: 100%/1   | Total:  7m 30s | Avg:  7m 30s | Max:  7m 30s

🟩 python: Pass: 100%/1 | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 28m 02s | Avg: 28m 02s | Max: 28m 02s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

rwgk

These are almost all just minor nits, except for relying on dict insertion order, that's a subtle risk we don't need to take.

rwgk · 2024-12-12T19:35:20Z

python/cuda_parallel/cuda/parallel/experimental/iterators.py

@@ -13,7 +14,7 @@ def CacheModifiedInputIterator(device_array, modifier):
    value_type = device_array.dtype
    return _iterators.CacheModifiedPointer(
        device_array.__cuda_array_interface__["data"][0],
-        _iterators.numba_type_from_any(value_type),
+        numba.from_dtype(value_type),


I'd inline device_array.dtype here. value_type isn't used anywhere else here.

Good idea - inlined it.

rwgk · 2024-12-12T19:38:19Z

python/cuda_parallel/cuda/parallel/experimental/iterators.py

@@ -1,4 +1,5 @@
 from . import _iterators
+import numba


Just to explain, no change needed:

A few weeks ago @leofang explained to me that the long-term vision for cuda.parallel is that it works independently of numba. Therefore I tried to not have the numba import here.

At the moment we're so deeply dependent on numba, it seems fine that it comes through here (public API).

rwgk · 2024-12-12T19:42:03Z

python/cuda_parallel/cuda/parallel/experimental/iterators.py

@@ -31,4 +32,4 @@ def CountingIterator(offset):

 def TransformIterator(op, it):
    """Python facade (similar to built-in map) mimicking a C++ Random Access TransformIterator."""
-    return _iterators.TransformIterator(op, it)
+    return _iterators.make_transform_iterator(it, op)


Do we need to reverse the order of args here?

I was going by map(op, it) (the Python built-in).

It's a nit, and one case by itself doesn't matter much, but this kind of thing is likely to trip up someone. Generally I try to stay as compatible as possible with established APIs.

I should have left a comment here clarifying why I reversed the argument order:

The order it, op is the same as in thrust::transform_iterator. As discussed in this thread, we should build an API as similar to the underlying C++ layer as possible - and in the future, a more Pythonic one on top of that.

Since we're no longer using the name map, it's less likely to cause confusion. I absolutely agree that a function called map(it, op) would trip folks up.

rwgk · 2024-12-12T20:08:21Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py



+@lru_cache(maxsize=256)


The function is just a dictionary lookup anyway.

Maybe simpler is better here?

NUMBA_TO_CTYPES_MAPPING = { numba.types.int8: ctypes.c_int8, numba.types.int16: ctypes.c_int16, numba.types.int32: ctypes.c_int32, numba.types.int64: ctypes.c_int64, numba.types.uint8: ctypes.c_uint8, numba.types.uint16: ctypes.c_uint16, numba.types.uint32: ctypes.c_uint32, numba.types.uint64: ctypes.c_uint64, numba.types.float32: ctypes.c_float, numba.types.float64: ctypes.c_double, }

Then replace the function calls with NUMBA_TO_CTYPES_MAPPING[ntype].

In 1bffc98 I replaced the use of these utility functions with ones that numba provides.

Awesome, much better.

rwgk · 2024-12-12T20:10:57Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

+      (returns nothing).
+    - a `dereference` (static) method that dereferences the state pointer
+      and returns a value.
+    """


Maybe add

Note: advance and dereference exist exclusively for compilation with numba. They are not meant to be called from Python.

Added some clarifying notes.

rwgk · 2024-12-12T20:16:20Z

python/cuda_parallel/cuda/parallel/experimental/__init__.py

+    alignment = context.get_value_type(numba_type).get_abi_alignment(
+        context.target_data
+    )
+    (advance_abi_name, advance_ltoir), (deref_abi_name, deref_ltoir) = it.ltoirs.items()


it.ltoirs is a dict, relying on insertion order is to know what is advance and what is deref is a little dicy.

See below.

This was definitely true before Python 3.6 but is explicitly legal since Python 3.7. Iterating over a dict view (such as .items()) now returns in insertion order.

I know, but I really wouldn't rely on that. It's too brittle / unobvious; and really buys nothing. Even a tuple of tuples isn't exactly great, but not quite as unobvious.

Also, AFAIK there is no assurance that future versions of Python will guarantee insertion order.

rwgk · 2024-12-12T20:21:11Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

+            output="ltoir",
+            abi_name=deref_abi_name,
+        )
+        return {advance_abi_name: advance_ltoir, deref_abi_name: deref_ltoir}


It's safer to simply return a tuple of tuples here, then there is no question about insertion order.

Nicest would be a namedtuple or dataclass, with abi_name, ltoir as members. But since there is only one caller (IIUC) a plain tuple of tuples would seem fine.

There's no risk here from using a dict. Python dicts guarantee maintaining insertion order since Python 3.7. From the Python docs:

[...] the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7):

A namedtuple may be a little nicer but a dict serves fine, especially for an internal function.

rwgk · 2024-12-12T20:23:49Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

+
+    @property
+    def state(self):
+        raise AttributeError("Subclasses must override advance staticmethod")


NotImplementedError

?

Good call. Changed it to NotImplementedError.

rwgk · 2024-12-12T20:28:08Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

+        )
+
+    @staticmethod
+    def advance(it, distance):


What do you think about state instead of it here?

Or data, because that's what we're passing in the calls.

Or it_data maybe.

(it looks too much like it's an actual iterator object.)

Agree - I changed it to state.

rwgk · 2024-12-12T20:31:55Z

python/cuda_parallel/cuda/parallel/experimental/_iterators.py

+        def __init__(self, it, op):
+            self._it = it
+            numba_type = it.numba_type
+            op_abi_name = f"{self.__class__.__name__}_{op.py_func.__name__}"


Is this where we'll need to make the name more unique if there are two transform iterators in the same context in the future? Maybe add a TODO?

Yup - added a TODO

github-actions · 2024-12-13T14:10:16Z

🟩 CI finished in 25m 50s: Pass: 100%/3 | Total: 35m 00s | Avg: 11m 40s | Max: 25m 50s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 10s | Avg: 4m 35s | Max: 6m 59s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  6m 59s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 11s | Avg:  2m 11s | Max:  2m 11s
  🟩 Test               Pass: 100%/1   | Total:  6m 59s | Avg:  6m 59s | Max:  6m 59s

🟩 python: Pass: 100%/1 | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 50s | Avg: 25m 50s | Max: 25m 50s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-v100-latest-1`
1	`linux-amd64-cpu16`

rwgk

Looks great to me!

shwina commented Dec 10, 2024

View reviewed changes

shwina changed the title ~~[Refactor] Simplify TransformIterator implementation and refactor iterators to derive from a common base~~ [Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base Dec 10, 2024

shwina force-pushed the refactor-and-simplify-iterators branch from 7ac794c to 152345c Compare December 10, 2024 20:20

shwina commented Dec 10, 2024

View reviewed changes

shwina force-pushed the refactor-and-simplify-iterators branch from 152345c to dd84b01 Compare December 10, 2024 20:49

shwina commented Dec 10, 2024

View reviewed changes

shwina marked this pull request as ready for review December 10, 2024 21:02

shwina requested a review from a team as a code owner December 10, 2024 21:02

shwina requested a review from alliepiper December 10, 2024 21:02

gevtushenko reviewed Dec 11, 2024

View reviewed changes

shwina force-pushed the refactor-and-simplify-iterators branch from a760108 to dd84b01 Compare December 11, 2024 11:08

shwina force-pushed the refactor-and-simplify-iterators branch from 4d9140c to fa2e0af Compare December 11, 2024 19:47

shwina commented Dec 11, 2024

View reviewed changes

docs/repo.toml Show resolved Hide resolved

rwgk reviewed Dec 12, 2024

View reviewed changes

Introduce an IteratorBase base class for all iterators

681fb50

shwina added 22 commits December 13, 2024 06:29

Rewrite RawPointer as a IteratorBase

4efe257

Rewrite other iterators as subclasses of IteratorBase

6bd8595

Make attributes needed outside the type public

40f7726

Remove some unused/obsolete code

4a9fd52

Remove the ltoirs member from cccl_iterator_t

5e6ea77

No need to use as a decorator

67d6010

mock numpy import?

9c748f5

Move distance type to the only place it is used

f07fea8

Format with Ruff

d3203e8

Specify abi_name for advance/deref functions

7c8b580

Make state, advance, dereference methods that must be overriden

a65cb18

Update abi_name for TransformIterator

2022006

Add TODO

cbba992

First stab at documenting IteratorBase

34b7f11

Try mocking functools

70da218

Python 3.7 unsupported syntax

99546c7

Don't worry about caching ltoirs property for now

07e37de

Don't worry about caching ltoirs property for now

07f3322

Add maxsize for Python 3.7 compat

688e31a

Unused imports

9d2da4a

Use numba utils for type translations

1bffc98

Address review feedback

9c66451

shwina force-pushed the refactor-and-simplify-iterators branch from 3d7a1fc to 9c66451 Compare December 13, 2024 13:42

shwina requested review from a team as code owners December 13, 2024 13:42

shwina requested a review from rwgk December 13, 2024 13:42

rwgk approved these changes Dec 13, 2024

View reviewed changes

shwina merged commit 3a7f2cd into NVIDIA:main Dec 14, 2024
31 checks passed

[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base #3118

[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base #3118

Conversation

shwina commented Dec 10, 2024

Description

Benchmark

Before this PR

After this PR

Checklist

copy-pr-bot bot commented Dec 10, 2024

Choose a reason for hiding this comment

shwina commented Dec 10, 2024

shwina Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwina commented Dec 10, 2024

shwina commented Dec 10, 2024

shwina commented Dec 10, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 10, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 00s | Avg: 4m 30s | Max: 7m 01s

🟩 python: Pass: 100%/1 | Total: 26m 35s | Avg: 26m 35s | Max: 26m 35s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

shwina Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwina commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 51s | Avg: 4m 25s | Max: 6m 43s

🟩 python: Pass: 100%/1 | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

github-actions bot commented Dec 10, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 22s | Avg: 4m 41s | Max: 7m 15s

🟩 python: Pass: 100%/1 | Total: 27m 08s | Avg: 27m 08s | Max: 27m 08s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 11, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 47s | Avg: 4m 23s | Max: 6m 43s

🟩 python: Pass: 100%/1 | Total: 24m 01s | Avg: 24m 01s | Max: 24m 01s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

github-actions bot commented Dec 11, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 11s | Avg: 4m 35s | Max: 7m 13s

🟩 python: Pass: 100%/1 | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

github-actions bot commented Dec 12, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 37s | Avg: 4m 18s | Max: 6m 35s

🟩 python: Pass: 100%/1 | Total: 27m 47s | Avg: 27m 47s | Max: 27m 47s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

github-actions bot commented Dec 12, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 58s | Avg: 4m 59s | Max: 7m 46s

🟩 python: Pass: 100%/1 | Total: 26m 05s | Avg: 26m 05s | Max: 26m 05s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

github-actions bot commented Dec 12, 2024

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 37s | Avg: 4m 48s | Max: 7m 30s

shwina Dec 10, 2024 •

edited

Loading

shwina Dec 10, 2024 •

edited

Loading

shwina Dec 13, 2024 •

edited

Loading