Fast path onehotbatch(::Vector{Int}, ::UnitRange)#27
Conversation
Codecov ReportBase: 95.96% // Head: 96.21% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #27 +/- ##
==========================================
+ Coverage 95.96% 96.21% +0.24%
==========================================
Files 3 4 +1
Lines 124 132 +8
==========================================
+ Hits 119 127 +8
Misses 5 5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
| function onehotbatch(data::AbstractArray{<:Integer}, labels::AbstractUnitRange{<:Integer}) | ||
| # lo, hi = extrema(data) # fails on Julia 1.6 | ||
| lo, hi = minimum(data), maximum(data) | ||
| lo < first(labels) && error("Value $lo not found in labels") | ||
| hi > last(labels) && error("Value $hi not found in labels") | ||
| offset = 1 - first(labels) | ||
| indices = UInt32.(data .+ offset) | ||
| return OneHotArray(indices, length(labels)) |
There was a problem hiding this comment.
Unfortunately the bounds checking here is quite expensive, especially on GPU arrays where I think each of minimum & maximum forces synchronisation:
julia> let ci = cu(rand(1:99, 100))
@btime CUDA.@sync onehotbatch($ci, 1:99)
@btime CUDA.@sync OneHotMatrix($ci, 99)
end;
100.993 μs (86 allocations: 4.02 KiB)
2.803 μs (0 allocations: 0 bytes)
julia> let ci = cu(rand(1:99, 100))
@btime CUDA.@sync maximum($ci), minimum($ci)
@btime CUDA.@sync extrema($ci)
@btime CUDA.@sync map($ci) do i
0<i<100 || error("bad index")
UInt32(i+0)
end
end;
71.448 μs (58 allocations: 2.91 KiB)
38.094 μs (29 allocations: 1.47 KiB)
18.543 μs (30 allocations: 1.14 KiB)
julia> let ci = cu(rand(1:99, 100)) # without explicit CUDA.@sync
@btime extrema($ci)
@btime OneHotMatrix($ci, 99) # async, which is good
@btime OneHotMatrix(map($ci) do i # unfortunately not?
0<i<100 || error("bad index")
UInt32(i+0)
end, 99)
end;
35.544 μs (29 allocations: 1.47 KiB)
6.527 ns (0 allocations: 0 bytes)
10.619 μs (30 allocations: 1.14 KiB)Moving the check inside the broadcast is faster, at the cost of more obscure errors. Maybe that's ok? Still not fully async.
julia> map(cu(rand(1:199, 100))) do i
0<i<100 || error("bad index")
UInt32(i+0)
end
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.There was a problem hiding this comment.
it is rather obscure indeed. What if we wrap the map inside a try-catch and raise a proper error?
There was a problem hiding this comment.
I wonder if we can tell the GPU not to wait? This doesn't work but perhaps something similar does:
julia> let i = rand(1:99, 100)
@btime maximum($i)<100 || error("outside")
@btime @async maximum($i)<100 || error("outside")
end;
58.407 ns (0 allocations: 0 bytes)
759.747 ns (5 allocations: 496 bytes)
julia> let ci = cu(rand(1:99, 100))
@btime maximum($ci)<100 || error("outside")
@btime @async maximum($ci)<100 || error("outside")
end;
35.134 μs (29 allocations: 1.45 KiB)
# hangs?There was a problem hiding this comment.
I think the ideal solution would be something like JuliaGPU/CUDA.jl#1140. If we had a way to write kernels, another idea would be to create an ad-hoc in kernel which flips a one-element bool array to true if it finds a matching element.
There was a problem hiding this comment.
That sounds like the right thing. Perhaps rather than owning a kernel, this package could call checkbounds(out, inds, 1) or whatever -- that's essentially the same operation.
I wondered what gather did, and it turns out there is no check:
julia> NNlib.gather([1,20,300,4000] |> cu, [2,4,2,99] |> cu)
4-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
20
4000
20
0
julia> NNlib.gather([1,20,300,4000], [2,4,2,99])
ERROR: BoundsError: attempt to access 4-element Vector{Int64} at index [99]The PR to add one FluxML/NNlibCUDA.jl#51 has many benchmarks... perhaps also 10s of μs.
This adds the obvious shortcut when the data is already indicies. It's a bit quicker, but also a partial solution to #16, as this will work with GPU arrays too.
Needs tests, and probably an error check.Done.