A detailed description of the approach implemented in this repository can be found in our FCCM'22 paper [1].
This repository implements an arbitrary precision floating point multiplier and adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through a matrix multiplication primitive that allows running them at full throughput without becoming memory bound. The design is fully pipelined, yielding a MAC throughput equivalent to the frequency times the number of compute units instantiated.
Instantiations of the design on an Alveo U250 accelerator were shown to yield 2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores worth of throughput [1].
The hardware design is configured using CMake. The target Xilinx XRT-enabled
platform must be specified with the APFP_PLATFORM
parameter. The most
important configuration parameters include:
- The width used for the floating point representation is fixed at compile-time
using the
APFP_BITS
CMake parameter, out of which 63 bits will be used for the exponent, 1 bit will be used for the sign, and the remaining bits will be used for the mantissa. The value is currently expected to be a multiple of 512 for the sake of being aligned to the memory interface width. - To scale the design beyond a single pipelined multiplier, the
APFP_COMPUTE_UNITS
can be used to replicate the full kernel. Each instantiation will run a fully independent matrix multiplication unit. These can be used to collaborate on a single matrix multiplication operation (seehost/TestMatrixMultiplication.cpp
for an example. - The floating point multiplier uses Karatsuba decomposition to reduce the
overall resource usage of the design. The decomposition bottoms out at
APFP_MULT_BASE_BITS
, after which it falls back on naive multiplication using DSPs as generated by the HLS tool. Similarly, theAPFP_ADD_BASE_BITS
configures the number of bits to dispatch to the HLS tool's addition implementation, manually pipelining the addition into multiple stages above this threshold. - To avoid being memory bound, the matrix multiplication implementation is
tiled using the approach described in our FPGA'20
paper [2]. The
tile sizes are exposed through the
APFP_TILE_SIZE_N
andAPFP_TILE_SIZE_M
parameters. The highest arithmetic intensity is achieved when these two quantities are equal and maximized, but relatively small tile sizes are sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes increase arithmetic intensity at the cost of BRAM usage, and potential overhead when the input matrix is not a multiple of the tile size. APFP_FREQUENCY
can be used to change the maximum frequency targeted by the design. If unspecified, the default of the target platform will be used.
For more details on how to configure the project to achieve high throughput, see our paper [1].
Please make sure you clone the repository with git clone --recursive
or run
git submodule update --init
after cloning to check out dependencies.
The minimum commands necessary to configure and build the code are:
mkdir build
cd build
cmake .. # Default parameters
make # Builds software components
make hw # Builds hardware accelerator
However, the accelerator should always be configured to match the target system using the parameters described in the previous section and in our paper [1]. The CMake configuration flow uses hlslib [3] to locate the Xilinx tools and expose hardware build targets.
The project depends on Vitis, GMP, and MPFR to successfully configure.
We provide an example host code that runs the matrix multiplication accelerator
on a randomized input in host/TestMatrixMultiplication.cpp
. See the executable
for usage. An example invocation could be:
./TestMatrixMultiplicationHardware hw 256 256 256
To install the project, including both the software interface components and the
hardware accelerator itself (built with make hw
), simply run make install
.
The location to install the project in is configured with the
CMAKE_INSTALL_PREFIX
parameter.
[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos Ziogas, David Simmons-Duffin, Torsten Hoefler, "Fast Arbitrary Precision Floating Point on FPGA", in Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'22). 🔗
[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler, "Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis", in Proceedings of 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20). 🔗
[3] Johannes de Fine Licht, and Torsten Hoefler. "hlslib: Software Engineering for Hardware Design.", presented at the Fifth International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC'19). 🔗