Introduction: This repository is dedicated to compiling an extensive list of frameworks, libraries, and software for matrix-matrix multiplication (A * B = C) optimization. It serves as a comprehensive resource for developers and researchers interested in high-performance computing, numerical analysis, and optimization of matrix operations.
- Fundamental Theories and Concepts
- General Optimization Techniques
- Frameworks and Development Tools
- Libraries
- Development Software: Debugging and Profiling
- Learning Resources
- Example Implementations
- General Matrix Multiply (GeMM)
- General Matrix Multiply (Intel)
- Strassen's Algorithm: An algorithm for matrix multiplication that is faster than the conventional algorithm for large matrices.
- Winograd's Algorithm: An efficient algorithm for matrix multiplication that reduces the number of multiplications.
- How To Optimize Gemm: A guide and tutorial on optimizing GEMM operations.
- GEMM: From Pure C to SSE Optimized Micro Kernels: An in-depth look into optimizing GEMM from basic C to SSE.
- BLIS: A software framework for instantiating high-performance BLAS-like dense linear algebra libraries.
BSD-3-Clause
- Created by SHPC at UT Austin (formerly FLAME).
- BLISlab: A framework for experimenting with and learning about BLIS-like GEMM algorithms.
- Tensile: AMD ROCm's library for JIT compiling kernels for matrix multiplications and tensor contractions.
MIT
- OpenBLAS: An optimized BLAS library based on GotoBLAS2.
BSD-3-Clause
- Created by Xianyi Zhang.
- Intel MKL: Intel's Math Kernel Library for optimized mathematical operations.
- oneDNN (formerly MKL-DNN): An open-source cross-platform performance library of deep learning building blocks, optimized for Intel architectures.
Apache-2.0
- FBGEMM: Facebook's CPU GEMM library optimized for server-side inference.
BSD-3-Clause
- Google gemmlowp: A small self-contained low-precision GEMM library.
Apache-2.0
- libFLAME: A high-performance dense linear algebra library.
BSD-3-Clause
- blis_apple: A BLIS library optimized for Apple M1.
BSD-3-Clause
- BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization, tailored for small to medium-sized matrices common in embedded optimization.
BSD-2-Clause
- LIBXSMM: A library targeting small, dense or sparse matrix multiplications, especially useful for small GEMM kernels.
BSD-3-Clause
- NVIDIA CUTLASS: NVIDIA's template library for CUDA GEMM kernels.
BSD-3-Clause
- NVIDIA cuBLAS: NVIDIA's implementation of BLAS for CUDA.
NVIDIA Software License
- NVIDIA cuSPARSE: NVIDIA's library for sparse matrix operations.
NVIDIA Software License
- NVIDIA cuDNN: NVIDIA's CUDA Deep Neural Network library, providing optimized primitives for deep learning, including matrix multiplication.
NVIDIA Software License
- hipBLAS: ROCm's BLAS implementation for GPU platforms.
MIT
- hipBLASLt: Lightweight BLAS implementation for ROCm.
MIT
- hipBLAS-common: Common utilities for hipBLAS implementations.
MIT
- OpenAI GEMM: OpenAI's optimized GEMM implementations.
MIT
- Grouped GEMM: Efficient implementation of grouped GEMM operations.
Apache-2.0
- CoralGemm: AMD's high-performance GEMM implementation.
MIT
- cutlass_fpA_intB_gemm: GEMM kernel for fp16 activation and quantized weight.
Apache-2.0
- DGEMM on Int8 Tensor Core: Library intercepting cuBLAS DGEMM function calls.
MIT
- chgemm: An int8 GEMM project.
- clBLAS: A software library containing BLAS functions written in OpenCL, making it portable across different GPU vendors.
Apache-2.0
- clBLAST: An optimized OpenCL BLAS library tuned for performance.
Apache-2.0
- ArrayFire: A general-purpose GPU library that simplifies GPU computing with high-level functions, including matrix operations.
BSD-3-Clause
- MAGMA: Matrix Algebra on GPU and Multicore Architectures.
BSD-3-Clause
- LAPACK: Software library for numerical linear algebra.
BSD-3-Clause
- ARM Compute Library: Machine learning functions optimized for ARM architectures.
MIT
Apache-2.0
- viennacl-dev: Free open-source linear algebra library for many-core architectures.
MIT
- CUSP: A C++ Templated Sparse Matrix Library.
Apache-2.0
- CUV: A C++ template and Python library for CUDA.
- Ginkgo: A high-performance linear algebra library for many-core systems, designed for flexibility and efficiency.
BSD-3-Clause
- NumPy: Python library for scientific computing.
BSD-3-Clause
- SciPy: Python library for scientific computing.
BSD-3-Clause
- TensorFlow: Open-source software library for machine learning.
Apache-2.0
- TensorFlow XLA (Accelerated Linear Algebra): A domain-specific compiler for linear algebra that optimizes TensorFlow computations.
Apache-2.0
- JAX: A Python library for high-performance machine learning research, enabling transformations of numerical functions.
Apache-2.0
- PyTorch: Open-source software library for machine learning.
BSD-3-Clause
- GemmKernels.jl: Julia package for GEMM operations on GPUs.
BSD-3-Clause
- BLIS.jl: Julia wrapper for BLIS interface.
BSD-3-Clause
- Eigen: C++ template library for linear algebra.
MPL2
- Blaze: High-performance C++ math library.
BSD-3-Clause
- Armadillo: C++ linear algebra library.
- Boost uBlas: C++ template class library for BLAS functionality.
Boost Software License 1.0
- Intel VTune Profiler: A performance analysis tool for various platforms, ideal for profiling and optimizing applications on Intel architectures.
- Intel Advisor: A tool for vectorization optimization and memory layout transformations to improve application performance.
- NVIDIA Nsight Systems: A system-wide performance analysis tool designed to visualize application algorithms, optimize performance, and enhance efficiency on NVIDIA GPUs.
NVIDIA SOFTWARE LICENSE AGREEMENT
- NVIDIA Nsight Compute: A performance analysis tool for CUDA kernels, providing detailed performance metrics and API debugging.
- Nsight Visual Studio Edition: An integrated development environment for debugging and profiling CUDA applications within Visual Studio.
- nvprof: NVIDIA's command-line profiler for CUDA applications.
NVIDIA End User License Agreement
- ROCm Profiler: AMD's performance analysis tool for profiling applications running on ROCm platforms.
MIT
- HPCToolkit: An integrated suite of tools for program performance measurement and analysis across a range of architectures.
BSD-3-Clause
- TAU (Tuning and Analysis Utilities): A performance evaluation tool framework for high-performance parallel programs.
- Perf: A performance analyzing tool in Linux, useful for profiling CPU performance counters and system-level metrics.
GPLv2
- gprof: A performance analysis tool for Unix applications, useful for identifying program bottlenecks.
GPLv3
- gprofng: The next-generation GNU profiling tool with improved capabilities.
GPLv3
- gprofng-gui: A graphical user interface for gprofng.
GPLv3
- gprofng-gui: A graphical user interface for gprofng.
- LIKWID: A suite of command-line tools for performance-oriented programmers to profile and optimize their applications.
GPLv3
- VAMPIR: A tool suite for performance analysis and visualization of parallel programs, aiding in identifying performance issues.
Proprietary
- Extrae: A package that generates trace files for performance analysis, which can be visualized with Paraver.
GPLv2.1
- Memcheck (Valgrind): A memory error detector that helps identify issues like memory leaks and invalid memory access.
GPLv2
- FPChecker: A tool for detecting floating-point accuracy problems in applications.
BSD-3-Clause
- MegPeak: A tool for testing processor peak computation performance, useful for benchmarking.
Apache-2.0
- GPU MODE
- HLS Tutorial and Deep Learning Accelerator Design Lab1
- UCSB: CS 240A: Applied Parallel Computing
- UC Berkeley: CS267
- UT Austin: EE382 System-on-Chip (SoC) Design
- UT Austin (Flame): LAFF-On Programming for High Performance
- MIT OpenCourseWare: Performance Engineering of Software Systems: Techniques for writing fast code, including optimization of matrix operations.
- BLIS: A Framework for Rapidly Instantiating BLAS Functionality. FG Van Zee, RA Van De Geijn. 2015.
- Anatomy of High-Performance Many-Threaded Matrix Multiplication. TM Smith, R Van De Geijn, M Smelyanskiy, JR Hammond, FG Van Zee. 2014.
- Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor. Z Xianyi, W Qian, Z Yunquan. 2012.
- High-performance implementation of the level-3 BLAS. K Goto, R Van De Geijn. 2008.
- Anatomy of high-performance matrix multiplication. K Goto, RA Geijn. 2008.
- ORNL: CUDA C++ Exercise: Basic Linear Algebra Kernels: GEMM Optimization Strategies
- Stanford: BLAS-level CPU Performance in 100 Lines of C
- Purdue: Optimizing Matrix Multiplication
- NJIT: Optimize Matrix Multiplication
- Optimizing Matrix Multiplication using SIMD and Parallelization
- Distributed GEMM - A novel CUTLASS-based implementation of Tensor Parallelism for NVLink-enabled systems
- Outperforming cuBLAS on H100: a Worklog
- Deep Dive on CUTLASS Ping-Pong GEMM Kernel
- CUTLASS Tutorial: Efficient GEMM kernel designs with Pipelining
- Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: A Worklog
- Fast Multidimensional Matrix Multiplication on CPU from Scratch
- Matrix Multiplication on CPU
- Optimizing Matrix Multiplication
- Optimizing Matrix Multiplication: Cache + OpenMP
- Tuning Matrix Multiplication (GEMM) for Intel GPUs
- Building a FAST Matrix Multiplication Algorithm
- Matrix-Matrix Product Experiments with BLAZE
- CUDA Learn Notes
- CUDA GEMM Optimization
- Why GEMM is at the heart of deep learning
- The OpenBLAS Project and Matrix Multiplication Optimization (Chinese)
- Step by Step Optimization of CUDA SGEMM (Chinese)
- OpenBLAS GEMM from Scratch (Chinese)
- The Proper Approach to CUDA for Beginners: How to Optimize GEMM (Chinese)
- ARMv7 4x4kernel Optimization Practice (Chinese)
- GEMM Caching (Chinese)
- NVIDIA Developer Blog: New cuBLAS 12.0 Features and Matrix Multiplication Performance.
- Matrix Multiplication Background User's Guide: Guide to matrix multiplication on NVIDIA GPUs.
- Triton: Programming language for efficient GPU code.
- perf-book: "Performance Analysis and Tuning on Modern CPU" by Denis Bakhvalov.
- The High-Performance Computing (HPC) Garage: A collection of HPC codes and tools from the Innovative Computing Laboratory (ICL) at the University of Tennessee.
- Toy HGEMM Library using Tensor Cores with MMA/WMMA/CuTe: May achieve the 98%~100% performance of cuBLAS.
GPLv3
- SGEMM_CUDA: Step-by-step optimization of matrix multiplication in CUDA.
MIT
- simple-gemm: Collection of simple GEMM implementations.
MIT
- YHs_Sample: A CUDA implementation of GEMM.
GPLv3
- how-to-optimize-gemm: A row-major matmul optimization tutorial.
GPLv3
- GEMM: Fast Matrix Multiplication Implementation in C.
MIT
- GEMM Optimization with LIBXSMM: Sample codes showing how to use LIBXSMM for optimizing small matrix multiplications.
BSD-3-Clause
- Deep Learning GEMM Benchmarks: Benchmarks for measuring the performance of basic deep learning operations including GEMM.
Apache-2.0
This curated list aims to be a comprehensive resource for anyone interested in the optimization of matrix-matrix multiplication. Contributions and suggestions are welcome to help keep this list up-to-date and useful for the community.