Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set fast math for CUDA #491

Closed
Zentrik opened this issue Aug 11, 2023 · 6 comments · Fixed by JuliaGPU/CUDA.jl#2030
Closed

How to set fast math for CUDA #491

Zentrik opened this issue Aug 11, 2023 · 6 comments · Fixed by JuliaGPU/CUDA.jl#2030

Comments

@Zentrik
Copy link
Contributor

Zentrik commented Aug 11, 2023

It seems to me that fast math is set in this line and as --math-mode has been disabled (JuliaLang/julia#41638) there is no way currently to use fast math.

fast_math = Base.JLOptions().fast_math == 1

Perhaps fast math should be set by the user similarly to max_regs.

@vchuravy
Copy link
Member

You can use the @fastmath macro locally.

@maleadt
Copy link
Member

maleadt commented Aug 11, 2023

@fastmath being a syntactical transformation, JuliaLang/julia#26828, we cannot implement this as an option to @cuda without reimplementing it all using overlay tables. Not saying that shouldn't happen, and I'd be in favor, it's just that this shouldn't happen in CUDA.jl.

@maleadt maleadt closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2023
@Zentrik
Copy link
Contributor Author

Zentrik commented Aug 11, 2023

Does @fastmath enable flushing denormals and the other stuff here.

GPUCompiler.jl/src/ptx.jl

Lines 427 to 441 in 15f0077

fast_math = Base.JLOptions().fast_math == 1
# NOTE: we follow nvcc's --use_fast_math
reflect_val = if reflect_arg == "__CUDA_FTZ"
# single-precision denormals support
ConstantInt(reflect_typ, fast_math ? 1 : 0)
elseif reflect_arg == "__CUDA_PREC_DIV"
# single-precision floating-point division and reciprocals.
ConstantInt(reflect_typ, fast_math ? 0 : 1)
elseif reflect_arg == "__CUDA_PREC_SQRT"
# single-precision denormals support
ConstantInt(reflect_typ, fast_math ? 0 : 1)
elseif reflect_arg == "__CUDA_FMAD"
# contraction of floating-point multiplies and adds/subtracts into
# floating-point multiply-add operations (FMAD, FFMA, or DFMA)
ConstantInt(reflect_typ, fast_math ? 1 : 0)

That's what I cared about, I'm already using @fastmath to use the faster versions of functions.

It doesn't seem to, e.g.

@fastmath function kernel!(y, x)
     i = threadIdx().x
     @inbounds y[i] = sqrt(x[i])
     
     return nothing
 end
x = CuArray(Float32[])
@device_code_ptx @cuda launch=false always_inline=true kernel!(x, x)

Looking at the ptx of this I see sqrt.rn.f32 %f2, %f1; whereas if I run a fork of GPUCompiler.jl and set fast_math to true, I get sqrt.approx.ftz.f32 %f2, %f1; and the sass code looks significantly better.

@Zentrik
Copy link
Contributor Author

Zentrik commented Aug 11, 2023

@fastmath sqrt and sqrt compile to @llvm.sqrt.f32 in llvm as CUDA doesn't define a fastmath sqrt, so perhaps not the best example but I think my point still stands. Doing @fastmath 1 / x[i] I see with fast_math = false div.approx in the ptx and with fast_math=true div.approx.ftz.

@maleadt
Copy link
Member

maleadt commented Aug 11, 2023

Does @fastmath enable flushing denormals and the other stuff here.

Ah yes, that kind of stuff we should be able to control.

Doing @fastmath 1 / x[i] I see with fast_math = false div.approx in the ptx and with fast_math=true div.approx.ftz.

That won't be affected by the proposed fast_math flag though, which only affects the code from libdevice. IIUC Julia itself should change the emission of the fdiv LLVM IR instruction, adding appropriate fast-math flags (not saying that's implemented, but it's the level where this should be happening).

I guess we could have a GPUCompiler pass that adds fast-math stuff everywhere, but that feels like a hack.

@Zentrik
Copy link
Contributor Author

Zentrik commented Aug 19, 2023

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants