-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make extract_gradient[_chunk]! GPU compatible #619
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportBase: 86.83% // Head: 89.74% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #619 +/- ##
==========================================
+ Coverage 86.83% 89.74% +2.90%
==========================================
Files 9 10 +1
Lines 889 965 +76
==========================================
+ Hits 772 866 +94
+ Misses 117 99 -18
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems surely harmless.
Maybe it would be worth adding some tests, using JLArrays so that they can run on githun?
Added a barebones set of JLArray tests. |
0f0d0d9
to
9dc9769
Compare
Ok, found the magic code which gives no allocation or speed regressions vs. current master for Arrays of any dimensions and which works with CuArrays, its this: map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
@inbounds partials(T, dual, i)
end Trick is indeed as @mcabbott you suggested to use a ReshapedArray which unlike Squashed into a single commit. Ready on my end. |
Don't want to have this lost after the holidays, could this get looked at again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this seems still mainly a CUDA issue. Changing the definitions to copyto!
operating on Tuple
s seems the most direct and natural implementation to me, and one that I would expect to work for any array. Whereas to me the code in this PR seems to be a bit non-intuitive and solely due to CUDA. So I would suggest
- using
copyto!
with Tuples (as suggested in more detail above) - adding definitions of
copyto!
withTuple
to CUDA
That seems like a strech to me with CUDA tbh. Although these things aren't specified perfectly, In the meantime, this PR has no regressions but enables GPU, so can it still be merged? And we can revist later if we get the CUDA interface? |
Here's an issue with the copyto!(result, index, ntuple(i->0, Val(chunksize)), 1, chunksize) which is not inferred, although the Julia compiler must work some magic bc its allocation free. But its 2X slower than my solution in this PR. So in lieu of a solution, it seems even if we get |
I should also note that one nice thing about the But in any case, I can't fiure out how to do it with only |
With |
Thanks, |
No, not that I know of. But couldn't you I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package. To mention the obvious, on Julia >= 1.9 yet another alternative would be to add e.g. GPUArraysCore (I guess that might be sufficient?) as a weak dependency and add a special method for GPU arrays there. The disadvantages would be that it would not fix Julia < 1.9. |
No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR. |
Yea, I'm sympathetic to that, also bc the by-value thing, but ultimately although the code looks a little wierd, in the case of Array its purely relying on the get/set index interface, and in the case of CuArrays, on the |
Oh, with unorthodox I was referring rather to the fact that currently the PR removes a simple |
No I get that. All I'm saying is its unorthodox in terms of the usage of ReshapedArray sure, but in terms of the underlying native code, for Arrays, its compiled to nearly identical code as before this PR, since either this or the old for-loop all ultimately end up at |
We could use julia> function h2(::Type{T}, result, x, index, chunksize) where {T}
fill!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), zero(x))
return result
end
h2 (generic function with 1 method)
julia> @btime h2(Float64, $(rand(100, 2)), 0f0, 20, 15);
2.570 ns (0 allocations: 0 bytes)
julia> function h3(::Type{T}, result, x, index, chunksize) where {T}
offset = index - 1
for i in 1:chunksize
result[i + offset] = zero(x)
end
return result
end
h3 (generic function with 1 method)
julia> @btime h3(Float64, $(rand(100, 2)), 0f0, 20, 15);
5.338 ns (0 allocations: 0 bytes) |
Sorry I'm confused, if you're ok with ReshapedArray there, whats the issue with the current PR? |
Well, I still think that 1) replacing the julia> function h4(::Type{T}, result, x, index, chunksize) where {T}
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), 1:chunksize) do _
return zero(x)
end
return result
end
h4 (generic function with 1 method)
julia> @btime h4(Float64, $(rand(100, 2)), 0f0, 20, 15);
10.412 ns (0 allocations: 0 bytes) (I'm not too surprised that it's easier for the compiler to optimize the |
So it seems the PR actually introduces a regression in its current form? |
I had benchmarked some full gradient calls, and there I saw essentially no difference (presumably extracting the gradient is not a dominant part of the computation except for trivial functions). But you're right it looks like the
However, So at the moment I guess it boils down to a small speed regression in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so what do you think about adding an implementation of extract_gradient_chunk!(::Type{T}, result, dual::Real, index, chunksize)
using fill!
with ReshapedArray
which seems to be fast? The PR already uses ReshapedArray
anyway, so if we go with it, we might as well just avoid the regression and use ReshapedArray
once more.
And afterwards maybe check that there are really no regressions left?
offset = index - 1 | ||
for i in 1:chunksize | ||
result[i + offset] = partials(T, dual, i) | ||
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking around a bit and also browsing old julia issues - maybe we just have to use ReshapedArray
for optimal performance, even though it's unexported.
In any case, I think it would be good to add some explanations for why the function is implemented in this way:
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i | |
# `reshape` without allocations: https://github.com/JuliaLang/julia/issues/36313 | |
out = view(Base.ReshapedArray(result, (length(result),), ()), index:(index+chunksize-1)), 1:chunksize) | |
# use `map!` instead of a for-loop for GPU compatibility: #619 | |
map!(out) do i |
for i in 1:chunksize | ||
result[i + offset] = partials(T, dual, i) | ||
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i | ||
@inbounds partials(T, dual, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really safe? The current implementation does not use @inbounds
:
@inbounds partials(T, dual, i) | |
return partials(T, dual, i) |
extract_gradient!(::Type{T}, result::AbstractArray, dual::Dual) where {T} = | ||
extract_gradient_chunk!(T, result, dual, 1, npartials(dual)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still somewhat unhappy about this change, intuitively it feels that it should be easier for base and packages to optimize copyto!(::AbstractArray, ::Tuple)
than the chunk-based version. But if benchmarks don't show any regression, I guess it seems OK - and we don't have to wait for CUDA to add the implementation.
@marius311 is this still being worked on? |
Was this not merged earlier due to a regression on presently working cases? If so, then we can move this implementation to an extension package |
My summary from what I remember is what I said above:
I unfortunately ran out of time to dedicate to responding to the requests here but anyone is absolutely welcome to finish it off or close it and do your extension package solution. |
would that be a workable solution in your opinion, @devmotion? |
@mcabbott what do you think about moving this change to a GPUArrays ext? |
Fixes this:
My adhoc tests suggest no performance impact on 1.8.3. for
result::Vector
, however, there is an extra(2 allocations: 80 bytes)
forresult::Matrix
I think due to a reshape/view thing.I'm not sure if there's some other form of the code that works without scalar indexing and without this allocation, but open to suggestions. Another alternative is to write some gluecode here or in CUDA.jl that handles CuArray specifically, but that seem more brittle.
I'd vote this solution is OK (pending other feedback) for the benefit of CUDA compatibility with generic code, but thats just IMO.