-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using AbstractFloat instead of Float64 in warning check for slow conv #475
base: master
Are you sure you want to change the base?
Conversation
I like the idea, but do we support half precision in the fast path? I didn't think we did since it relies on BLAS and that's usually 32/64bit only. |
Hum, I think I've read somewhere something like "half precision support", but indeed, BLAS only supports 32/64. But still, if you mix Float32 and Float16 (it was what happened to me and made me create this PR) there will be no warning. Maybe for now, instead of Float64 only we check for Float64 and Float32 then? |
There should be no warning, because we don't have another implementation of convolutions for those types! Or are you saying that we should be dissuading people from mixing FP16+FP32 altogether? |
I don't think we should be dissuading people from mixing types or doing weird combinations, but I think the warning is something valid and useful once you can actually do this mix by mistake. Flux is currently also issuing a warning for a possible mix of Float16 and Float32. And if you have scalar indexing disabled, you get a scalar index error because of how the direct convolution is implemented. If you are somehow new to Julia's environment, it can be not so straightforward to detect that the problem is type mixing, IMO. julia> x = CUDA.rand(Float32, 5, 5, 1, 1)
5×5×1×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
0.601529 0.142086 0.612069 0.094501 0.699099
0.821472 0.202396 0.484461 0.337047 0.106021
0.586893 0.62307 0.168788 0.329471 0.8758
0.213742 0.770866 0.132256 0.489098 0.570035
0.392795 0.27233 0.656506 0.795186 0.618649
julia> w = CUDA.rand(Float16, 3, 3, 1, 1)
3×3×1×1 CuArray{Float16, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
0.4766 0.009766 0.8623
0.6084 0.0 0.001953
0.8877 0.2754 0.6045
julia> conv(x, w)
ERROR: TaskFailedException
nested task error: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] assertscalar(op::String)
@ GPUArraysCore ~/.julia/packages/GPUArraysCore/B3xv7/src/GPUArraysCore.jl:100
[3] getindex(::CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::Int64, ::Int64, ::Vararg{Int64})
@ GPUArrays ~/.julia/packages/GPUArrays/5wTN2/src/host/indexing.jl:9
[4] getindex
@ ./subarray.jl:282 [inlined]
[5] conv_direct!(y::SubArray{Float32, 5, CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, x::SubArray{Float32, 5, CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, w::CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{3, 3, 3, 6, 3}, ::Val{(3, 3, 1)}, ::Val{1}, ::Val{(0, 0, 0, 0, 0, 0)}, ::Val{(1, 1, 1)}, ::Val{(1, 1, 1)}, fk::Val{false}; alpha::Float32, beta::Bool)
@ NNlib ~/.julia/packages/NNlib/TZPiH/src/impl/conv_direct.jl:104
[6] conv_direct!(y::SubArray{Float32, 5, CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, x::SubArray{Float32, 5, CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, w::CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{3, 3, 3, 6, 3}; alpha::Float32, beta::Bool)
@ NNlib ~/.julia/packages/NNlib/TZPiH/src/impl/conv_direct.jl:50
[7] conv_direct!
@ ~/.julia/packages/NNlib/TZPiH/src/impl/conv_direct.jl:47 [inlined]
[8] (::NNlib.var"#308#312"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, DenseConvDims{3, 3, 3, 6, 3}, SubArray{Float32, 5, CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, SubArray{Float32, 5, CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}})()
@ NNlib ./threadingconstructs.jl:258
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:436
[2] macro expansion
@ ./task.jl:455 [inlined]
[3] conv!(out::CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, in1::CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, in2::CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{3, 3, 3, 6, 3}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ NNlib ~/.julia/packages/NNlib/TZPiH/src/conv.jl:205
[4] conv!
@ ~/.julia/packages/NNlib/TZPiH/src/conv.jl:185 [inlined]
[5] #conv!#258
@ ~/.julia/packages/NNlib/TZPiH/src/conv.jl:145 [inlined]
[6] conv!
@ ~/.julia/packages/NNlib/TZPiH/src/conv.jl:140 [inlined]
[7] conv(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float16, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ NNlib ~/.julia/packages/NNlib/TZPiH/src/conv.jl:88
[8] conv
@ ~/.julia/packages/NNlib/TZPiH/src/conv.jl:83 [inlined]
[9] #conv#231
@ ~/.julia/packages/NNlib/TZPiH/src/conv.jl:56 [inlined]
[10] conv(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float16, 4, CUDA.Mem.DeviceBuffer})
@ NNlib ~/.julia/packages/NNlib/TZPiH/src/conv.jl:50
[11] top-level scope
@ REPL[3]:1 |
The reason <64 and 64 bit mixing is so insidious is because so it's so easy to write Julia code which promotes to the latter. That's less of a concern with 16/32 because one has to explicitly request arrays be that eltype. It's also not as if users have a different option for 16-bit conv operations, which is the case for both 32 and 64 bit. IMO the scalar indexing problem is a separate one and should be addressed by us adding a check for GPU arrays. Edit: I might support having the fallback warning as a way to tell users that FP16 does not have have a fast path on CPU, but again that's a different discussion than trying to extend the behaviour of the current warning (which as far as I know was meant as a "hey, this isn't hitting the BLAS path" warning) to different eltype combinations. |
PR Checklist
#383 silence some warnings that are not necessarily useful (as discussed in the PR), but keep the warning for Float64 with other types.
But with the support for half-precision floats, some other combinations can happen and go unnoticed, such as a Float16 weight on a Float32 matrix, or a Float32 weight on an Int matrix ("oh no I forgot to convert my Integers to Floats").
This PR changes the type check from Float64 to AbstractFloat, creating a warning for some unintentional and weird combinations, but still preventing issuing warnings for Dual.ForwardDiff, which I think was the main purpose of the previous PR.
Some examples
Current NNlib (only shows a warning when there is a Float64):
This PR (shows a warning for any mixture of different Floats):
PS: I did these tests resetting the runtime since there is a maxlog of 1 for the warnings.