Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[WIP] Support fancy iterators in cuda.parallel (#2788)
* New python/cuda_parallel/cuda/parallel/experimental/random_access_iterators.py (with unit tests). * Use TransformRAI to implement constant, counting, arbitrary RAIs. * Transfer `num_item`-related changes from cuda_parallel_itertools branch. The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools * Rename `op` to `reduction_op` in `cccl_device_reduce_build()` * Transfer test_device_sum_repeat_1_equals_num_items() from cuda_parallel_itertools branch. The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools * Add `class _TransformRAIUnaryOp`. Make `constant_op()` less trivial (to see if numba can still compile it). * Rename `class _Op` to `_ReductionOp` for clarity. * WIP Use TransformRAIUnaryOp here * INCOMPLETE SNAPSHOT * This links the input_unary_op() successfully into the nvrtc cubin, but then fails with: Fatal Python error: Floating point exception * passing `_type_to_info_from_numba_type(numba.int32)` as `value_type` resolves the Floating point exception (but the `cccl_device_reduce()` call still does not succeed) * More debug output. LOOOK single_tile_kernel CALL /home/coder/cccl/c/parallel/src/reduce.cu:116 LOOOK EXCEPTION CUDA error: invalid argument /home/coder/cccl/c/parallel/src/reduce.cu:703 * Substituting `fake_in_ptr` if `in_ptr == nullptr`: All tests pass. * Rename new test function to `test_device_sum_input_unary_op()` and parametrize: `use_numpy_array`: `[True, False]`, `input_generator`: `["constant", "counting", "arbitrary", "nested"]` * Remove python/cuda_parallel/cuda/parallel/experimental/random_access_iterators.py (because numba.cuda cannot JIT classes). * Add `"nested_global"` test, but disable. * `cu_repeat()`, `cu_count()`, `cu_arbitrary()` functions that return a `unary_op`, which is then compiled with `numba.cuda.compile()` * Snapshot DOES NOT WORK, from 2024-10-20+1646 * Files copy-pasted from 2479 comment, then: notabs.py kernel.cpp main.py * Commands to build and run the POC. * `RawPointer(..., dtype)` * `assert_expected_output_array(, `more_nested_map` * Add `@register_jitable` to `cu_repeat()`, ..., `cu_map()`: this fixes the `"map_mul2"` test and the added `"map_add10_map_mul2"` test works, too. * Restore original c/parallel/src/reduce.cu (to avoid `git merge main` conflicts). * Transfer a few minor `printf("\nLOOOK ...);` from python_random_access_iterators branch. * clang-format auto-fixes * Change `ADVANCE(this, diff)` to `ADVANCE(data, diff)` * Remove `input_unary_op`-related code. * Plug Georgii's POC ConstantIterator, CountingIterator into cuda_parallel test_reduce.py * Change `map(it, op)` to `cu_map(op, it)`, to be compatible with the built-in `map(func, *iterables)` * Plug Georgii's POC map code into cuda_parallel test_reduce.py; test FAILS: ptxas fatal : Unresolved extern function 'transform_count_int32_mul2_advance' ERROR NVJITLINK_ERROR_PTX_COMPILE: JIT the PTX (ltoPtx) LOOOK EXCEPTION nvJitLink error: 4 /home/coder/cccl/c/parallel/src/reduce.cu:339 =========================== short test summary info ============================ FAILED tests/test_reduce.py::test_device_sum_sentinel_iterator[map_mul2-False] - ValueError: Error building reduce ======================= 1 failed, 11 deselected in 1.33s ======================= * Add `cccl_string_views* ltoirs` to `cccl_iterator_t` * Populate `cccl_iterator_t::ltoirs` in `_itertools_iter_as_cccl_iter()` * Add `nvrtc_sm_top_level::add_link_list()` * Fix `_extract_ctypes_ltoirs()` implementation * Add `extract_ltoirs(const cccl_iterator_t&)` implementation. All tests pass, including the `cu_map()` test. * Plug Georgii's POC RawPointer, CacheModifiedPointer into cuda_parallel test_reduce.py * Copy all sentinel iterators wholesale from georgii_poc_2479/pocenv/main.py to new python/cuda_parallel/cuda/parallel/experimental/sentinel_iterators.py * Remove all sentinel iterator implementations from test_reduce.py and use cuda.parallel.experimental.sentinel_iterators instead. * Cleanup * Change `DEREF(this)` to `DEREF(data)` * Make RawPointer, CacheModifiedPointer nnuba.cuda.compile() `sig` expressions more readable and fix `int32` vs `uint64` mixup for `advance` methods. * Add `cccl_iterator_t::advance_multiply_diff_with_sizeof_value_t` feature. * Make discovery mechanism for cuda/_include directory compatible with `pip install --editable` * Make discovery mechanism for cuda/_include directory compatible with `pip install --editable` * Add pytest-xdist to test requirements. * Revert "Add `cccl_iterator_t::advance_multiply_diff_with_sizeof_value_t` feature." This reverts commit eaee196. * pointer_advance_sizeof(), cache_advance_sizeof(): Use Python capture to achieve `distance * sizeof_dtype` calculation. * Rename sentinel_iterators.py -> iterators.py * ConstantIterator(..., dtype) * Change `dtype`, `numba_type` variable names to `ntype`. * CountingIterator(..., ntype) * Add `ntype.name` to `self.prefix` for `RawPointer`, `CacheModifiedPointer`. Introduce `_ncc()` helper function. * Trivial: Shuffle order of tests to match code in iterators.py * ruff format iterators.py (NO manual changes). * Add "raw_pointer_int16" test. * Generalize `ldcs()` `@intrinsic` and add "streamed_input_int16" test. * Expand tests for raw_pointer, streamed_input: (int, uint) x (16, 32, 64) * _numba_type_as_ir_type() experiment: does not make a difference Compared to the `int32` case, this line is inserted: ``` cvt.u64.u32 %rd3, %r1; ``` This is the generated ptx code (including that line) and the error message: ``` // .globl cacheint64_dereference .common .global .align 8 .u64 _ZN08NumbaEnv4cuda8parallel12experimental9iterators20CacheModifiedPointer32cache_cache_dereference_bitwidth12_3clocals_3e17cache_dereferenceB3v48B96cw51cXTLSUwv1sCUt9Uw1VEw0NRRQPKiLTj0gIGIFp_2b2oLQFEYYkHSQB1OQAk0Bynm21OizQ1K0UoIGvDpQE8oxrNQE_3dE11int64_2a_2a; .visible .func (.param .b64 func_retval0) cacheint64_dereference( .param .b64 cacheint64_dereference_param_0 ) { .reg .b32 %r<2>; .reg .b64 %rd<4>; ld.param.u64 %rd2, [cacheint64_dereference_param_0]; ld.u64 %rd1, [%rd2]; // begin inline asm ld.global.cs.s64 %r1, [%rd1]; // end inline asm cvt.u64.u32 %rd3, %r1; st.param.b64 [func_retval0+0], %rd3; ret; } ptxas application ptx input, line 6672; error : Arguments mismatch for instruction 'ld' ptxas fatal : Ptx assembly aborted due to errors ERROR NVJITLINK_ERROR_PTX_COMPILE: JIT the PTX (ltoPtx) ``` * Revert "_numba_type_as_ir_type() experiment: does not make a difference" This reverts commit c32e7ed. * Plug in numba pointer arithmetic code provided by @gmarkall (with a small bug fix): @gmarkall's POC code: #2861 The small bug fix: ```diff def sizeof_pointee(context, ptr): - size = context.get_abi_sizeof(ptr.type) + size = context.get_abi_sizeof(ptr.type.pointee) return ir.Constant(ir.IntType(64), size) ``` * WIP * Fix streamed_input_int64, streamed_input_uint64 issue using suggestion provided by @gevtushenko: #2788 (comment) * Simplified and more readable `ldcs()` implementation. * Add raw_pointer_floatXX, streamed_input_floatXX tests. Those run but the reduction results are incorrect. Needs debugging. * Use `numpy.dtype(intty)` to obtain an actual dtype object (rather than a numpy scalar object). * Fix `_itertools_iter_as_cccl_iter()` helper function do use `d_in.ntype` instead of hard-wired `int32`. This fixes the raw_pointer_floatXX, streamed_input_floatXX tests. * Add raw_pointer_float16, streamed_input_float16 tests. Needs a change in iterators.py, not sure why, or if there is a better solution. * Add constant_XXX tests. constant_float16 does not work because ctypes does not support float16. * Add counting_XXX tests. counting_float32,64 do not work, needs debugging. * Change cu_map() make_advance_codegen() distty to uint64, for consistency with all other iterator types. * Replace all cu_map() int32 with either it.ntype or op_return_ntype * Fix ntype vs types.uint64 mixup for distance in ConstantIterator, CountingIterator. This fixes the counting_float32, counting_float64 tests. * Remove float16 code and tests. Not needed at this stage. * Fix trivial (but slightly confusing) typo. * Remove unused host_address(self) methods. * Introduce _DEVICE_POINTER_SIZE, _DEVICE_POINTER_BITWIDTH, _DISTANCE_NUMBA_TYPE, _DISTANCE_IR_TYPE. Resolve most TODOs. * Generalize map_mul2 test as map_mul2_int32_int32, but still only test with int32. * Add more map_mul2 tests that all pass with the production code as-is. map_mul2_float32_int32 does not pass (commented out in this commit). * Make cu_map() code more readable by introducing _CHAR_PTR_NUMBA_TYPE, _CHAR_PTR_IR_TYPE * Archiving debugging code including bug fix in make_dereference_codegen() * Revert "Archiving debugging code including bug fix in make_dereference_codegen()" This reverts commit b525b80. * FIX but does not work * Add more map_mul2 tests. For unknown reasons these all pass when run separately, or when running all tests in parallel (pytest -v -n 32). * Add type information to cu_map() op abi_name. * Simplify() helper function, reuse for compiling cu_map() op * Change map_mul2_xxx test names to map_mul2_count_xxx * Add map_mul3_map_mul2_count_int32_int32_int32, map_mul2_map_mul2_count_float64_float32_int16 tests. map_mul* tests are flaky,TBD why. * Introduce op_caller() in cu_map(): this solves the test failures (TBH, I cannot explain why/how). * Move `num_items` argument to just before `h_init`, for compatibility with cub `DeviceReduce` API. * Move iterators.py -> _iterators.py, add new iterators.py as public interface. * Pull repeat(), count() out from _iterators.py into iterators.py; add map() forwarding function (to renamed cumap()). * Move `class TransformIterator` out to module scope. This make it more clear where Python captures are used. It might also be a minor performance improvement because the `class` definition code is only processed once. * Replace the very awkward _Reduce.__handle_d_in() method with the new _d_in_as_cccl_iter() function. * Move _iterators.cache() to iterators.cache_load_modifier() * Fix minor oversights in docstrings. * Enable passing device array as `it` to `map()` (testing with cupy array). * Rename `l_input` -> `l_varr` (in test_reduce.py). * Rename `ldcs()` -> `load_cs()` and add reference to cub `CacheModifiedInputIterator` `LOAD_CS` * Move `*_advance` and `*_dereference` functions out of class scopes, to make it obvious that they are never used as Python methods, but exclusively as source for `numba.cuda.compile()` * Turn state_c_void_p, size, alignment methods into properties. * Improved organization of newly added tests. NO functional changes. Applied ruff format to newly added code. * Add comments pointing to issue #3064 * Change `ntype` to `value_type` in public APIs. Introduce `numba_type_from_any(value_type)` function. * Change names of function in public API. * Effectively undo commit 708c341: Move `*_advance` and `*_dereference` functions back to class scope, with comments to explicitly state that these are not actual methods. * Use `@pytest.fixture` to replace most `@pytest.mark.parametrize`, as suggested by @shwina
- Loading branch information