LightNeuron is a highly efficient, educational neural network library designed for x86-64 architectures in C. It aims to provide insights into neural network mechanics, profiling, and optimization, with a special focus on the efficiency of General Matrix Multiply (GEMM) operations.
Targeted primarily at students, researchers, and developers, LightNeuron offers a CNN inference framework capable of processing HDF5 model files. This facilitates the integration with models trained on frameworks like PyTorch and TensorFlow. Key features include:
- Convolutional Layer Computation (conv())
- Matrix Multiplication (matmul())
- Activation Functions (relu())
- Pooling (pooling())
- Forward Pass Operations (forwardPass())
- Feature Extraction and Interpretation
- Prediction (softmax(), predict())
LightNeuron is optimized for x86-64 architectures, ensuring compatibility and efficiency on a wide range of systems. Below are the specifications of the primary development environment, which can serve as a benchmark for expected performance:
- Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
- Microarchitecture: Comet Lake
- 1.6 GHz is the base frequency of the CPU
- 4 cores, 2 threads per core
- 16 DP FLOPS/cycle (AVX2, FP64)
- Single core theoretical peak performance = 1.6 GHz * 16 FLOPS/cycle = 25.6 GFLOPS
- Multi-core theoretical peak performance = 25.6 GFLOPS * 4 cores = 102.4 GFLOPS
- References:
Ensure your system is ready for LightNeuron by installing the perf
tool:
sudo apt-get install linux-tools-$(uname -r) linux-cloud-tools-$(uname -r)
Configure your system by editing /etc/sysctl.conf
:
kernel.perf_event_paranoid = -1
kernel.nmi_watchdog = 0
Activate the changes:
sudo sysctl -p
-
Clone the Repository:
git clone [repository-url]
-
Download MNIST Dataset:
python get_data.py
-
Compile and Run Labs:
make lab && ./lab
Profile GEMM operations with specific targets and cache levels:
make perf TARGET=[your-target] CACHE_LEVEL=[your-cache-level] USE_PMU=1
- Replace
TARGET
with the GEMM implementation (e.g.,matmul_naive
). - Set
CACHE_LEVEL
to desired cache level (e.g.,L1
,L2
,L3
).
Example:
make perf TARGET=matmul_naive CACHE_LEVEL=L1 USE_PMU=1
USE_PMU=1
activates the Performance Monitoring Unit for detailed hardware-level performance insights.
LightNeuron places a strong emphasis on optimizing General Matrix Multiply (GEMM) operations. This optimization leads to significant performance improvements, as measured in GFLOPS (Giga Floating Point Operations Per Second), particularly noticeable across a range of matrix dimensions. Key strategies employed in this optimization include:
- Loop Interchange: Reorders nested loops to enhance memory access patterns and improve cache performance, eg. ijk -> kji.
- Compiler Optimization Flags: Employs -O2/-O3 levels for code efficiency.
- Parallel Loops: Uses OpenMP directives to distribute loop execution across multiple CPU threads.
- Loop Tiling (Blocking): Optimizes spatial and temporal locality for caches.
- Divide-and-Conquer: Splits large matrices into smaller sub-matrices for better cache performance.
- SIMD Intrinsics with Data Alignment: Uses AVX2 instructions and aligns data to boost vectorized operations and memory throughput.
The result of these enhancements is a notable increase in CPU computational efficiency, boosting the performance of matrix multiplication operations considerably.
Implementation | Cache References (millions) | L1-d Cache Misses (millions) | LL Cache Misses (millions) |
---|---|---|---|
+parallel loops | 4934.44 | 406.47 | 404.9 |
+tiling | 5010.46 | 620.66 | 13.29 |
+parallel divide-and-conquer | 1881.06 | 152.97 | 5.21 |
Tiling achieves a 96% reduction in last-level cache misses, and parallel divide-and-conquer further lowers overall cache references and minimizes cache misses.
The following table showcases the GFLOPs performance of various kernels compared to Intel MKL, at a matrix size of 1200x1200.
Version Implementation | Running Times (ms) | Relative Speedup | Absolute Speedup | GFLOPS | Percent of Peak | Percent of Intel MKL |
---|---|---|---|---|---|---|
naive | 11190.93 | 1.00 | 1.00 | 0.19 | 0.19% | 0.25% |
naive + interchange loops | 4267.47 | 2.62 | 2.62 | 0.50 | 0.49% | 0.65% |
naive + interchange loops + optimization flags | 675.76 | 6.32 | 16.56 | 3.18 | 3.10% | 4.08% |
naive + interchange loops + optimization flags + parallel loops | 147.87 | 4.57 | 75.68 | 14.52 | 14.18% | 18.62% |
naive + interchange loops + optimization flags + parallel tiling | 101.3 | 1.46 | 110.47 | 21.20 | 20.70% | 27.19% |
naive + interchange loops + optimization flags + parallel divide-and-conquer | 89.52 | 1.13 | 125.01 | 23.99 | 23.43% | 30.76% |
naive + interchange loops + optimization flags + parallel divide-and-conquer + avx2 intrinsics + data alignment | 71.11 | 1.26 | 157.37 | 30.20 | 29.49% | 38.73% |
naive + interchange loops + optimization flags + parallel tiling + avx2 intrinsics + data alignment | 62.41 | 1.14 | 179.31 | 34.41 | 33.60% | 44.13% |
naive + interchange loops + optimization flags + parallel divide-and-conquer + avx2 intrinsics + data alignment + coarsening | 43.62 | 1.43 | 256.56 | 49.23 | 48.08% | 63.14% |
Intel MKL | 27.54 | 1.58 | 406.35 | 77.98 | 76.15% | 100.00% |
v1 denotes the naive implementation, while v2 through v10 sequentially represent the advanced enhancements detailed in the above table.