Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal Cleanup] Resolve "ATTENTION: NOT op_caller here!" vs "ATTENTION: op_caller here" asymmetry in Transform Iterator implementation. #3064

Closed
rwgk opened this issue Dec 5, 2024 · 1 comment · Fixed by #3118
Assignees

Comments

@rwgk
Copy link
Contributor

rwgk commented Dec 5, 2024

I will update this issue description with source code references after PR #2788 is merged.

Current best guess: Eliminating @overload will sidestep the issue.

Relevant links provided by @gmarkall (in a meeting on 2024-11-27):

@rwgk rwgk self-assigned this Dec 5, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 5, 2024
rwgk added a commit to rwgk/cccl that referenced this issue Dec 5, 2024
@shwina
Copy link
Contributor

shwina commented Dec 6, 2024

Current best guess: Eliminating @overload will sidestep the issue.

In general, using the extension API to define the typing and lowering for iterator types and their advance/deref functions will simplify things greatly. See for example https://github.com/gmarkall/numba-accelerated-udfs/blob/main/filigree/numba_extension.py.

gevtushenko pushed a commit that referenced this issue Dec 6, 2024
* New python/cuda_parallel/cuda/parallel/experimental/random_access_iterators.py (with unit tests).

* Use TransformRAI to implement constant, counting, arbitrary RAIs.

* Transfer `num_item`-related changes from cuda_parallel_itertools branch.

The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools

* Rename `op` to `reduction_op` in `cccl_device_reduce_build()`

* Transfer test_device_sum_repeat_1_equals_num_items() from cuda_parallel_itertools branch.

The other branch is: https://github.com/rwgk/cccl/tree/cuda_parallel_itertools

* Add `class _TransformRAIUnaryOp`. Make `constant_op()` less trivial (to see if numba can still compile it).

* Rename `class _Op` to `_ReductionOp` for clarity.

* WIP Use TransformRAIUnaryOp here

* INCOMPLETE SNAPSHOT

* This links the input_unary_op() successfully into the nvrtc cubin, but then fails with: Fatal Python error: Floating point exception

* passing `_type_to_info_from_numba_type(numba.int32)` as `value_type` resolves the Floating point exception (but the `cccl_device_reduce()` call still does not succeed)

* More debug output.

LOOOK single_tile_kernel CALL /home/coder/cccl/c/parallel/src/reduce.cu:116

LOOOK EXCEPTION CUDA error: invalid argument  /home/coder/cccl/c/parallel/src/reduce.cu:703

* Substituting `fake_in_ptr` if `in_ptr == nullptr`: All tests pass.

* Rename new test function to `test_device_sum_input_unary_op()` and parametrize: `use_numpy_array`: `[True, False]`, `input_generator`: `["constant", "counting", "arbitrary", "nested"]`

* Remove python/cuda_parallel/cuda/parallel/experimental/random_access_iterators.py (because numba.cuda cannot JIT classes).

* Add `"nested_global"` test, but disable.

* `cu_repeat()`, `cu_count()`, `cu_arbitrary()` functions that return a `unary_op`, which is then compiled with `numba.cuda.compile()`

* Snapshot DOES NOT WORK, from 2024-10-20+1646

* Files copy-pasted from 2479 comment, then: notabs.py kernel.cpp main.py

* Commands to build and run the POC.

* `RawPointer(..., dtype)`

* `assert_expected_output_array(, `more_nested_map`

* Add `@register_jitable` to `cu_repeat()`, ..., `cu_map()`: this fixes the `"map_mul2"` test and the added `"map_add10_map_mul2"` test works, too.

* Restore original c/parallel/src/reduce.cu (to avoid `git merge main` conflicts).

* Transfer a few minor `printf("\nLOOOK ...);` from python_random_access_iterators branch.

* clang-format auto-fixes

* Change `ADVANCE(this, diff)` to `ADVANCE(data, diff)`

* Remove `input_unary_op`-related code.

* Plug Georgii's POC ConstantIterator, CountingIterator into cuda_parallel test_reduce.py

* Change `map(it, op)` to `cu_map(op, it)`, to be compatible with the built-in `map(func, *iterables)`

* Plug Georgii's POC map code into cuda_parallel test_reduce.py; test FAILS:

ptxas fatal   : Unresolved extern function 'transform_count_int32_mul2_advance'
ERROR NVJITLINK_ERROR_PTX_COMPILE: JIT the PTX (ltoPtx)
LOOOK EXCEPTION nvJitLink error: 4  /home/coder/cccl/c/parallel/src/reduce.cu:339
=========================== short test summary info ============================
FAILED tests/test_reduce.py::test_device_sum_sentinel_iterator[map_mul2-False] - ValueError: Error building reduce
======================= 1 failed, 11 deselected in 1.33s =======================

* Add `cccl_string_views* ltoirs` to `cccl_iterator_t`

* Populate `cccl_iterator_t::ltoirs` in `_itertools_iter_as_cccl_iter()`

* Add `nvrtc_sm_top_level::add_link_list()`

* Fix `_extract_ctypes_ltoirs()` implementation

* Add `extract_ltoirs(const cccl_iterator_t&)` implementation. All tests pass, including the `cu_map()` test.

* Plug Georgii's POC RawPointer, CacheModifiedPointer into cuda_parallel test_reduce.py

* Copy all sentinel iterators wholesale from georgii_poc_2479/pocenv/main.py to new python/cuda_parallel/cuda/parallel/experimental/sentinel_iterators.py

* Remove all sentinel iterator implementations from test_reduce.py and use cuda.parallel.experimental.sentinel_iterators instead.

* Cleanup

* Change `DEREF(this)` to `DEREF(data)`

* Make RawPointer, CacheModifiedPointer nnuba.cuda.compile() `sig` expressions more readable and fix `int32` vs `uint64` mixup for `advance` methods.

* Add `cccl_iterator_t::advance_multiply_diff_with_sizeof_value_t` feature.

* Make discovery mechanism for cuda/_include directory compatible with `pip
 install --editable`

* Make discovery mechanism for cuda/_include directory compatible with `pip
 install --editable`

* Add pytest-xdist to test requirements.

* Revert "Add `cccl_iterator_t::advance_multiply_diff_with_sizeof_value_t` feature."

This reverts commit eaee196.

* pointer_advance_sizeof(), cache_advance_sizeof(): Use Python capture to achieve `distance * sizeof_dtype` calculation.

* Rename sentinel_iterators.py -> iterators.py

* ConstantIterator(..., dtype)

* Change `dtype`, `numba_type` variable names to `ntype`.

* CountingIterator(..., ntype)

* Add `ntype.name` to `self.prefix` for `RawPointer`, `CacheModifiedPointer`. Introduce `_ncc()` helper function.

* Trivial: Shuffle order of tests to match code in iterators.py

* ruff format iterators.py (NO manual changes).

* Add "raw_pointer_int16" test.

* Generalize `ldcs()` `@intrinsic` and add "streamed_input_int16" test.

* Expand tests for raw_pointer, streamed_input: (int, uint) x (16, 32, 64)

* _numba_type_as_ir_type() experiment: does not make a difference

Compared to the `int32` case, this line is inserted:

```
        cvt.u64.u32     %rd3, %r1;
```

This is the generated ptx code (including that line) and the error message:

```
        // .globl       cacheint64_dereference
.common .global .align 8 .u64 _ZN08NumbaEnv4cuda8parallel12experimental9iterators20CacheModifiedPointer32cache_cache_dereference_bitwidth12_3clocals_3e17cache_dereferenceB3v48B96cw51cXTLSUwv1sCUt9Uw1VEw0NRRQPKiLTj0gIGIFp_2b2oLQFEYYkHSQB1OQAk0Bynm21OizQ1K0UoIGvDpQE8oxrNQE_3dE11int64_2a_2a;

.visible .func  (.param .b64 func_retval0) cacheint64_dereference(
        .param .b64 cacheint64_dereference_param_0
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<4>;

        ld.param.u64    %rd2, [cacheint64_dereference_param_0];
        ld.u64  %rd1, [%rd2];
        // begin inline asm
        ld.global.cs.s64 %r1, [%rd1];
        // end inline asm
        cvt.u64.u32     %rd3, %r1;
        st.param.b64    [func_retval0+0], %rd3;
        ret;

}
ptxas application ptx input, line 6672; error   : Arguments mismatch for instruction 'ld'
ptxas fatal   : Ptx assembly aborted due to errors
ERROR NVJITLINK_ERROR_PTX_COMPILE: JIT the PTX (ltoPtx)
```

* Revert "_numba_type_as_ir_type() experiment: does not make a difference"

This reverts commit c32e7ed.

* Plug in numba pointer arithmetic code provided by @gmarkall (with a small bug fix):

@gmarkall's POC code: #2861

The small bug fix:

```diff
 def sizeof_pointee(context, ptr):
-    size = context.get_abi_sizeof(ptr.type)
+    size = context.get_abi_sizeof(ptr.type.pointee)
     return ir.Constant(ir.IntType(64), size)
```

* WIP

* Fix streamed_input_int64, streamed_input_uint64 issue using suggestion provided by @gevtushenko:

#2788 (comment)

* Simplified and more readable `ldcs()` implementation.

* Add raw_pointer_floatXX, streamed_input_floatXX tests. Those run but the reduction results are incorrect. Needs debugging.

* Use `numpy.dtype(intty)` to obtain an actual dtype object (rather than a numpy scalar object).

* Fix `_itertools_iter_as_cccl_iter()` helper function do use `d_in.ntype` instead of hard-wired `int32`. This fixes the raw_pointer_floatXX, streamed_input_floatXX tests.

* Add raw_pointer_float16, streamed_input_float16 tests. Needs a change in iterators.py, not sure why, or if there is a better solution.

* Add constant_XXX tests. constant_float16 does not work because ctypes does not support float16.

* Add counting_XXX tests. counting_float32,64 do not work, needs debugging.

* Change cu_map() make_advance_codegen() distty to uint64, for consistency with all other iterator types.

* Replace all cu_map() int32 with either it.ntype or op_return_ntype

* Fix ntype vs types.uint64 mixup for distance in ConstantIterator, CountingIterator. This fixes the counting_float32, counting_float64 tests.

* Remove float16 code and tests. Not needed at this stage.

* Fix trivial (but slightly confusing) typo.

* Remove unused host_address(self) methods.

* Introduce _DEVICE_POINTER_SIZE, _DEVICE_POINTER_BITWIDTH, _DISTANCE_NUMBA_TYPE, _DISTANCE_IR_TYPE. Resolve most TODOs.

* Generalize map_mul2 test as map_mul2_int32_int32, but still only test with int32.

* Add more map_mul2 tests that all pass with the production code as-is. map_mul2_float32_int32 does not pass (commented out in this commit).

* Make cu_map() code more readable by introducing _CHAR_PTR_NUMBA_TYPE, _CHAR_PTR_IR_TYPE

* Archiving debugging code including bug fix in make_dereference_codegen()

* Revert "Archiving debugging code including bug fix in make_dereference_codegen()"

This reverts commit b525b80.

* FIX but does not work

* Add more map_mul2 tests. For unknown reasons these all pass when run separately, or when running all tests in parallel (pytest -v -n 32).

* Add type information to cu_map() op abi_name.

* Simplify() helper function, reuse for compiling cu_map() op

* Change map_mul2_xxx test names to map_mul2_count_xxx

* Add map_mul3_map_mul2_count_int32_int32_int32, map_mul2_map_mul2_count_float64_float32_int16 tests. map_mul* tests are flaky,TBD why.

* Introduce op_caller() in cu_map(): this solves the test failures (TBH, I cannot explain why/how).

* Move `num_items` argument to just before `h_init`, for compatibility with cub `DeviceReduce` API.

* Move iterators.py -> _iterators.py, add new iterators.py as public interface.

* Pull repeat(), count() out from _iterators.py into iterators.py; add map() forwarding function (to renamed cumap()).

* Move `class TransformIterator` out to module scope. This make it more clear where Python captures are used. It might also be a minor performance improvement because the `class` definition code is only processed once.

* Replace the very awkward _Reduce.__handle_d_in() method with the new _d_in_as_cccl_iter() function.

* Move _iterators.cache() to iterators.cache_load_modifier()

* Fix minor oversights in docstrings.

* Enable passing device array as `it` to `map()` (testing with cupy array).

* Rename `l_input` -> `l_varr` (in test_reduce.py).

* Rename `ldcs()` -> `load_cs()` and add reference to cub `CacheModifiedInputIterator` `LOAD_CS`

* Move `*_advance` and `*_dereference` functions out of class scopes, to make it obvious that they are never used as Python methods, but exclusively as source for `numba.cuda.compile()`

* Turn state_c_void_p, size, alignment methods into properties.

* Improved organization of newly added tests. NO functional changes. Applied ruff format to newly added code.

* Add comments pointing to issue #3064

* Change `ntype` to `value_type` in public APIs. Introduce `numba_type_from_any(value_type)` function.

* Change names of function in public API.

* Effectively undo commit 708c341: Move `*_advance` and `*_dereference` functions back to class scope, with comments to explicitly state that these are not actual methods.

* Use `@pytest.fixture` to replace most `@pytest.mark.parametrize`, as suggested by @shwina
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 10, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 10, 2024
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
2 participants