Add checkbounds for gather and support empty source array#51
Conversation
|
@mcabbott would you please review this? |
Co-authored-by: Peter <adgjl5645@hotmail.com> Update src/gather.jl Co-authored-by: Peter <adgjl5645@hotmail.com> Update src/gather.jl Co-authored-by: Peter <adgjl5645@hotmail.com> Update src/gather.jl Co-authored-by: Peter <adgjl5645@hotmail.com>
chengchingwen
left a comment
There was a problem hiding this comment.
I don't think I have the permission, so you would need someone else to review
|
@CarloLucibello would you please review this? |
|
We can make the boundschecking much faster. In the simplest case with integer indexes: a = axes(src, ndims(src))
checkindex(a, collect(extrema(idx))) |
|
I just try it. It seems that |
|
I think there won't be much difference in speed since they are all calling the |
|
We can check just the |
|
I think @chengchingwen is right. They all pass through the same GPU kernel such that computation over an array costs the same time as computing a single value. Since using CUDA
using BenchmarkTools
T = Float32
CT = CuArray{Float32}
src = CT([3, 4, 5, 6, 7])
idx = cu([1 2 3 4;
4 2 1 3;
3 5 5 3])
function checkbounds_src(src, dims::Union{Int, Val}, ::Type{<:Any})
return i -> checkbounds(Bool, src, ntuple(x -> Colon(), dims)..., i...)
end
function checkbounds_src(src, dims::Union{Int, Val}, ::Type{<:CartesianIndex})
return i -> checkbounds(Bool, src, ntuple(x -> Colon(), dims)..., i)
end
function checkbounds1(src, idx, dims)
return map(checkbounds_src(src, Val(dims), eltype(idx)), idx)
end
function checkbounds2(src, idx, dims)
a = axes(src, ndims(src))
return checkindex(Bool, a, minimum(idx):maximum(idx))
end
checkbounds1(src, idx, 1)
checkbounds2(src, idx, 1)
julia> @benchmark CUDA.@sync checkbounds1($src, $idx, 1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 10.406 μs … 73.799 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.080 μs ┊ GC (median): 0.00%
Time (mean ± σ): 12.074 μs ± 3.770 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▆▅▆▅▄▃▂▁ ▂
█████████████▇▇▇▇▆▇▆▅▄▄▄▄▄▃▄▄▂▃▅▄▄▂▄▄▆▅▄▅▅▄▅▅▄▄▄▅▂▄▄▅▅▄▅▄▃▃ █
10.4 μs Histogram: log(frequency) by time 32.7 μs <
Memory estimate: 3.12 KiB, allocs estimate: 56.
julia> @benchmark CUDA.@sync checkbounds2($src, $idx, 1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 31.644 μs … 39.208 ms ┊ GC (min … max): 0.00% … 44.45%
Time (median): 33.708 μs ┊ GC (median): 0.00%
Time (mean ± σ): 40.350 μs ± 391.981 μs ┊ GC (mean ± σ): 4.32% ± 0.44%
▅██▇▆▄▂▁ ▁▁▁▁ ▂
█████████████▇█▇▇▇█████▇▇▇▇▇▇▆▅▆▅▆▅▇▆▆▆▆▇▆▆▆▆▅▅▅▄▄▅▅▄▄▅▄▅▅▅▄ █
31.6 μs Histogram: log(frequency) by time 79.8 μs <
Memory estimate: 4.88 KiB, allocs estimate: 88.The original approach ( |
|
Any updates? |
|
Wait for review. |
|
@yuehhua can you benchmark this PR vs master for a few input sizes? Just to make sure that boundschecking doesn't take more than the real computation |
|
Benchmark code: This PR: master branch: It seems to take around 5 times slower than master branch. |
|
Mhmh. That is very small size though, can you check with 10x or 100x bigger arrays? |
|
PR: master branch: |
|
It seems to take too much GC time. Maybe the closure causes this. |
|
is |
|
|
||
| # check bounds | ||
| in_bnd = map(checkbounds_src(src, Val(dims), eltype(idx)), idx) | ||
| if !all(in_bnd) |
There was a problem hiding this comment.
This line slows down the code to a great portion
|
I have done some tests, the closure here is fine. It won't help to use |
|
The closure is resolved and benchmarked below: @MilkshakeForReal It is still 4 times slower. So, we could come out with a more efficient CUDA kernel or find other efficient way to check bounds, or even no bound checks. |
|
For the last version, commenting out |
|
Drop |
|
@MilkshakeForReal For the last version, |
|
MRE: function NNlib.gather!(dst::AnyCuArray, src::AnyCuArray, idx::AnyCuArray)
# check dims
dims = gather_check_dims(src, dst, idx)
dims_size = size(src)[1:dims]
max_dims_idx = prod(dims_size)
max_idx = max_dims_idx * length(idx)
# check bounds
idx_bounds = size(src, ndims(src))#[dims+1:end]
in_bnd = map(i -> i <= idx_bounds, idx)
isempty(src) && return dst
# cuda kernel
args = dst, src, idx, max_idx, max_dims_idx, dims_size
kernel = @cuda launch=false gather_kernel!(args...)
config = launch_configuration(kernel.fun; max_threads=256)
threads = min(max_idx, config.threads)
blocks = cld(max_idx, threads)
kernel(args...; threads=threads, blocks=blocks)
return dst
end
julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 18.900 μs … 2.720 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 21.700 μs ┊ GC (median): 0.00%
Time (mean ± σ): 26.564 μs ± 71.784 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆██▇▆▅▄▃▃▃▂▂▁▁▁▁▁▁▂▁▁▁ ▂
███████████████████████████▇▇██▇▆▇▇▇▇▆▆▇▇▆▆▆▆▅▅▆▄▄▄▄▅▃▅▂▃▄▃ █
18.9 μs Histogram: log(frequency) by time 60.7 μs <
Memory estimate: 1.33 KiB, allocs estimate: 31.With julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 55.200 μs … 2.510 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 69.300 μs ┊ GC (median): 0.00%
Time (mean ± σ): 79.856 μs ± 102.221 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂█▇▃▂▁
▁▃██████▆▅▆▄▅▇▇▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
55.2 μs Histogram: frequency by time 174 μs <
Memory estimate: 3.72 KiB, allocs estimate: 72.Without any bounds checking: |
I believe you can achieve almost identical speedup by writing in_bnd = mapreduce(checkbounds_src(src, Val(dims), eltype(idx)), &, idx)in the last version. The speedup does not majorly come from resolving closure. |
|
This is my implementation based on yours. It seems working fine. I still don't know why we need to dispatch function _checkbounds_indices(i::Tuple, idx_bounds::Tuple)
return Base.checkbounds_indices(Bool, idx_bounds, i)
end
function _checkbounds_indices(i::CartesianIndex, idx_bounds::Tuple)
return Base.checkbounds_indices(Bool, idx_bounds, Tuple(i))
end
function _checkbounds_indices(i::Int, idx_bounds::Tuple)
return Base.checkbounds_indices(Bool, idx_bounds, (i,))
end
function NNlib.gather!(dst::AnyCuArray, src::AnyCuArray, idx::AnyCuArray)
# check dims
dims = gather_check_dims(src, dst, idx)
dims_size = size(src)[1:dims]
max_dims_idx = prod(dims_size)
max_idx = max_dims_idx * length(idx)
# check bounds
idx_bounds = axes(src)[dims+1:end]
in_bnd = mapreduce(Base.Fix2(_checkbounds_indices,idx_bounds), &, idx)
if !in_bnd
#whatever is here, we don't need to care about the speed when something is wrong.
end
isempty(src) && return dst
# cuda kernel
args = dst, src, idx, max_idx, max_dims_idx, dims_size
kernel = @cuda launch=false gather_kernel!(args...)
config = launch_configuration(kernel.fun; max_threads=256)
threads = min(max_idx, config.threads)
blocks = cld(max_idx, threads)
kernel(args...; threads=threads, blocks=blocks)
return dst
end |
The size of the improvement by removing |
We still have to raise a bound check error to users, otherwise there is meaningless to do bound check. You could have other ways to replace |
|
It's just to demonstrate we don't really need to avoid closure in this case. I just tested the latest version and the performance was almost identical to mine, with closure or not. Both of them use |
|
@MilkshakeForReal avoiding closure is for reducing the gc time and memory allocation. |
|
Not much improvement for the latest proposal. |
|
Do you mean my proposal? There isn't any improvement. Just similar performance. |
|
@MilkshakeForReal We still need |
|
Where do we have this issue? In my code its already reduced. The closure is also avoided if we don't want it. |
I don't know much about the gc time, just speaking from my test results |
|
Could we isolate the support for empty arrays and leave bounds checking to further discussion? |
Closes FluxML/NNlib.jl#416, FluxML/NNlib.jl#411