Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FLOAT32 support for eltwise binary ops #15483

Open
3 tasks
KalaivaniMCW opened this issue Nov 27, 2024 · 0 comments
Open
3 tasks

Implement FLOAT32 support for eltwise binary ops #15483

KalaivaniMCW opened this issue Nov 27, 2024 · 0 comments

Comments

@KalaivaniMCW
Copy link
Contributor

KalaivaniMCW commented Nov 27, 2024

Yan's comment:
For binary ops we can't use the full float32 precision. The reason is this. Both input tiles (from A and B) are in the local SRAM in the fp32 format. Then the unpacker puts them into SrcA and SrcB registers. Those registers only support the TF32 format, immediately losing 13 bits of mantissa. Then they are placed in the DST register back in fp32, but the precision has already been lost. We do support the direct SRAM to DST unpacking with full precision, but only for one of the two unpackers. So this would work for unary ops, but not for binary.
Ref: comment , ticket

New LLK for binary SFPU ops - rd/binary_sfpu_pow

Goal:

  • incorporate the full float32 precision in the current elt binary implementation without disturbing the existing implemenatation i.e. a separate compute kernel for fp32 and program factory. the criteria to pick full float32 precision, for now, is when both inputs are in float32 dtype. #15483: Initial setup for binary sfpu ops #15557

  • Need to support pre and post-activations on input and output

  • Need to support chained binary ops

  • Do we need typecast on output ? I dont think so , since this kernel exists for the purpose of providing full float32 precision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant