You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yan's comment: For binary ops we can't use the full float32 precision. The reason is this. Both input tiles (from A and B) are in the local SRAM in the fp32 format. Then the unpacker puts them into SrcA and SrcB registers. Those registers only support the TF32 format, immediately losing 13 bits of mantissa. Then they are placed in the DST register back in fp32, but the precision has already been lost. We do support the direct SRAM to DST unpacking with full precision, but only for one of the two unpackers. So this would work for unary ops, but not for binary.
Ref: comment , ticket
incorporate the full float32 precision in the current elt binary implementation without disturbing the existing implemenatation i.e. a separate compute kernel for fp32 and program factory. the criteria to pick full float32 precision, for now, is when both inputs are in float32 dtype. #15483: Initial setup for binary sfpu ops #15557
Need to support pre and post-activations on input and output
Need to support chained binary ops
Do we need typecast on output ? I dont think so , since this kernel exists for the purpose of providing full float32 precision
The text was updated successfully, but these errors were encountered:
Yan's comment:
For binary ops we can't use the full float32 precision. The reason is this. Both input tiles (from A and B) are in the local SRAM in the fp32 format. Then the unpacker puts them into SrcA and SrcB registers. Those registers only support the TF32 format, immediately losing 13 bits of mantissa. Then they are placed in the DST register back in fp32, but the precision has already been lost. We do support the direct SRAM to DST unpacking with full precision, but only for one of the two unpackers. So this would work for unary ops, but not for binary.
Ref: comment , ticket
New LLK for binary SFPU ops - rd/binary_sfpu_pow
Goal:
incorporate the full float32 precision in the current elt binary implementation without disturbing the existing implemenatation i.e. a separate compute kernel for fp32 and program factory. the criteria to pick full float32 precision, for now, is when both inputs are in float32 dtype. #15483: Initial setup for binary sfpu ops #15557
Need to support pre and post-activations on input and output
Need to support chained binary ops
Do we need typecast on output ? I dont think so , since this kernel exists for the purpose of providing full float32 precision
The text was updated successfully, but these errors were encountered: