awesome-gemm

Introduction: This repository is dedicated to compiling an extensive list of frameworks, libraries, and software for matrix-matrix multiplication (A * B = C) optimization. It serves as a comprehensive resource for developers and researchers interested in high-performance computing, numerical analysis, and optimization of matrix operations.

Fundamental Theories and Concepts

General Matrix Multiply (GeMM)
General Matrix Multiply (Intel)
Strassen's Algorithm: An algorithm for matrix multiplication that is faster than the conventional algorithm for large matrices.
Winograd's Algorithm: An efficient algorithm for matrix multiplication that reduces the number of multiplications.

General Optimization Techniques

How To Optimize Gemm: A guide and tutorial on optimizing GEMM operations.
GEMM: From Pure C to SSE Optimized Micro Kernels: An in-depth look into optimizing GEMM from basic C to SSE.

Frameworks and Development Tools

BLIS: A software framework for instantiating high-performance BLAS-like dense linear algebra libraries. BSD-3-Clause
- Created by SHPC at UT Austin (formerly FLAME).
BLISlab: A framework for experimenting with and learning about BLIS-like GEMM algorithms.
Tensile: AMD ROCm's library for JIT compiling kernels for matrix multiplications and tensor contractions. MIT

Libraries

CPU Libraries

OpenBLAS: An optimized BLAS library based on GotoBLAS2. BSD-3-Clause
- Created by Xianyi Zhang.
Intel MKL: Intel's Math Kernel Library for optimized mathematical operations.
oneDNN (formerly MKL-DNN): An open-source cross-platform performance library of deep learning building blocks, optimized for Intel architectures. Apache-2.0
FBGEMM: Facebook's CPU GEMM library optimized for server-side inference. BSD-3-Clause
Google gemmlowp: A small self-contained low-precision GEMM library. Apache-2.0
libFLAME: A high-performance dense linear algebra library. BSD-3-Clause
blis_apple: A BLIS library optimized for Apple M1. BSD-3-Clause
BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization, tailored for small to medium-sized matrices common in embedded optimization. BSD-2-Clause
LIBXSMM: A library targeting small, dense or sparse matrix multiplications, especially useful for small GEMM kernels. BSD-3-Clause

GPU Libraries

NVIDIA CUTLASS: NVIDIA's template library for CUDA GEMM kernels. BSD-3-Clause
NVIDIA cuBLAS: NVIDIA's implementation of BLAS for CUDA. NVIDIA Software License
NVIDIA cuSPARSE: NVIDIA's library for sparse matrix operations. NVIDIA Software License
NVIDIA cuDNN: NVIDIA's CUDA Deep Neural Network library, providing optimized primitives for deep learning, including matrix multiplication. NVIDIA Software License
hipBLAS: ROCm's BLAS implementation for GPU platforms. MIT
hipBLASLt: Lightweight BLAS implementation for ROCm. MIT
hipBLAS-common: Common utilities for hipBLAS implementations. MIT
OpenAI GEMM: OpenAI's optimized GEMM implementations. MIT
Grouped GEMM: Efficient implementation of grouped GEMM operations. Apache-2.0
CoralGemm: AMD's high-performance GEMM implementation. MIT
cutlass_fpA_intB_gemm: GEMM kernel for fp16 activation and quantized weight. Apache-2.0
DGEMM on Int8 Tensor Core: Library intercepting cuBLAS DGEMM function calls. MIT
chgemm: An int8 GEMM project.
clBLAS: A software library containing BLAS functions written in OpenCL, making it portable across different GPU vendors. Apache-2.0
clBLAST: An optimized OpenCL BLAS library tuned for performance. Apache-2.0
ArrayFire: A general-purpose GPU library that simplifies GPU computing with high-level functions, including matrix operations. BSD-3-Clause

Cross-Platform Libraries

MAGMA: Matrix Algebra on GPU and Multicore Architectures. BSD-3-Clause
LAPACK: Software library for numerical linear algebra. BSD-3-Clause
ARM Compute Library: Machine learning functions optimized for ARM architectures. MIT Apache-2.0
viennacl-dev: Free open-source linear algebra library for many-core architectures. MIT
CUSP: A C++ Templated Sparse Matrix Library. Apache-2.0
CUV: A C++ template and Python library for CUDA.
Ginkgo: A high-performance linear algebra library for many-core systems, designed for flexibility and efficiency. BSD-3-Clause

Language-Specific Libraries

NumPy: Python library for scientific computing. BSD-3-Clause
SciPy: Python library for scientific computing. BSD-3-Clause
TensorFlow: Open-source software library for machine learning. Apache-2.0
TensorFlow XLA (Accelerated Linear Algebra): A domain-specific compiler for linear algebra that optimizes TensorFlow computations. Apache-2.0
JAX: A Python library for high-performance machine learning research, enabling transformations of numerical functions. Apache-2.0
PyTorch: Open-source software library for machine learning. BSD-3-Clause
GemmKernels.jl: Julia package for GEMM operations on GPUs. BSD-3-Clause
BLIS.jl: Julia wrapper for BLIS interface. BSD-3-Clause
Eigen: C++ template library for linear algebra. MPL2
Blaze: High-performance C++ math library. BSD-3-Clause
Armadillo: C++ linear algebra library.
Boost uBlas: C++ template class library for BLAS functionality. Boost Software License 1.0

Development Software: Debugging and Profiling

Intel VTune Profiler: A performance analysis tool for various platforms, ideal for profiling and optimizing applications on Intel architectures.
Intel Advisor: A tool for vectorization optimization and memory layout transformations to improve application performance.
NVIDIA Nsight Systems: A system-wide performance analysis tool designed to visualize application algorithms, optimize performance, and enhance efficiency on NVIDIA GPUs. NVIDIA SOFTWARE LICENSE AGREEMENT
NVIDIA Nsight Compute: A performance analysis tool for CUDA kernels, providing detailed performance metrics and API debugging.
Nsight Visual Studio Edition: An integrated development environment for debugging and profiling CUDA applications within Visual Studio.
nvprof: NVIDIA's command-line profiler for CUDA applications. NVIDIA End User License Agreement
ROCm Profiler: AMD's performance analysis tool for profiling applications running on ROCm platforms. MIT
HPCToolkit: An integrated suite of tools for program performance measurement and analysis across a range of architectures. BSD-3-Clause
TAU (Tuning and Analysis Utilities): A performance evaluation tool framework for high-performance parallel programs.
Perf: A performance analyzing tool in Linux, useful for profiling CPU performance counters and system-level metrics. GPLv2
gprof: A performance analysis tool for Unix applications, useful for identifying program bottlenecks. GPLv3
gprofng: The next-generation GNU profiling tool with improved capabilities. GPLv3
- gprofng-gui: A graphical user interface for gprofng. GPLv3
LIKWID: A suite of command-line tools for performance-oriented programmers to profile and optimize their applications. GPLv3
VAMPIR: A tool suite for performance analysis and visualization of parallel programs, aiding in identifying performance issues. Proprietary
Extrae: A package that generates trace files for performance analysis, which can be visualized with Paraver. GPLv2.1
Memcheck (Valgrind): A memory error detector that helps identify issues like memory leaks and invalid memory access. GPLv2
FPChecker: A tool for detecting floating-point accuracy problems in applications. BSD-3-Clause
MegPeak: A tool for testing processor peak computation performance, useful for benchmarking. Apache-2.0

Learning Resources

University Courses & Tutorials

GPU MODE
HLS Tutorial and Deep Learning Accelerator Design Lab1
UCSB: CS 240A: Applied Parallel Computing
UC Berkeley: CS267
UT Austin: EE382 System-on-Chip (SoC) Design
UT Austin (Flame): LAFF-On Programming for High Performance
MIT OpenCourseWare: Performance Engineering of Software Systems: Techniques for writing fast code, including optimization of matrix operations.

Selected Papers

BLIS: A Framework for Rapidly Instantiating BLAS Functionality. FG Van Zee, RA Van De Geijn. 2015.
Anatomy of High-Performance Many-Threaded Matrix Multiplication. TM Smith, R Van De Geijn, M Smelyanskiy, JR Hammond, FG Van Zee. 2014.
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor. Z Xianyi, W Qian, Z Yunquan. 2012.
High-performance implementation of the level-3 BLAS. K Goto, R Van De Geijn. 2008.
Anatomy of high-performance matrix multiplication. K Goto, RA Geijn. 2008.

Lecture Notes

Blogs

Other Resources

NVIDIA Developer Blog: New cuBLAS 12.0 Features and Matrix Multiplication Performance.
Matrix Multiplication Background User's Guide: Guide to matrix multiplication on NVIDIA GPUs.
Triton: Programming language for efficient GPU code.
perf-book: "Performance Analysis and Tuning on Modern CPU" by Denis Bakhvalov.
The High-Performance Computing (HPC) Garage: A collection of HPC codes and tools from the Innovative Computing Laboratory (ICL) at the University of Tennessee.

Example Implementations

Toy HGEMM Library using Tensor Cores with MMA/WMMA/CuTe: May achieve the 98%~100% performance of cuBLAS. GPLv3
SGEMM_CUDA: Step-by-step optimization of matrix multiplication in CUDA. MIT
simple-gemm: Collection of simple GEMM implementations. MIT
YHs_Sample: A CUDA implementation of GEMM. GPLv3
how-to-optimize-gemm: A row-major matmul optimization tutorial. GPLv3
GEMM: Fast Matrix Multiplication Implementation in C. MIT
GEMM Optimization with LIBXSMM: Sample codes showing how to use LIBXSMM for optimizing small matrix multiplications. BSD-3-Clause
Deep Learning GEMM Benchmarks: Benchmarks for measuring the performance of basic deep learning operations including GEMM. Apache-2.0

This curated list aims to be a comprehensive resource for anyone interested in the optimization of matrix-matrix multiplication. Contributions and suggestions are welcome to help keep this list up-to-date and useful for the community.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-gemm

Table of Contents

Fundamental Theories and Concepts

General Optimization Techniques

Frameworks and Development Tools

Libraries

CPU Libraries

GPU Libraries

Cross-Platform Libraries

Language-Specific Libraries

Development Software: Debugging and Profiling

Learning Resources

University Courses & Tutorials

Selected Papers

Lecture Notes

Blogs

Other Resources

Example Implementations

About

Releases

Packages

License

jssonx/awesome-gemm

Folders and files

Latest commit

History

Repository files navigation

awesome-gemm

Table of Contents

Fundamental Theories and Concepts

General Optimization Techniques

Frameworks and Development Tools

Libraries

CPU Libraries

GPU Libraries

Cross-Platform Libraries

Language-Specific Libraries

Development Software: Debugging and Profiling

Learning Resources

University Courses & Tutorials

Selected Papers

Lecture Notes

Blogs

Other Resources

Example Implementations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages