Skip to content

An educational ConvNet inference framework designed for x86 architectures

License

Notifications You must be signed in to change notification settings

jssonx/lightneuron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LightNeuron

GitHub Actions status Codacy Badge License

lightneuron

LightNeuron is a highly efficient, educational neural network library designed for x86-64 architectures in C. It aims to provide insights into neural network mechanics, profiling, and optimization, with a special focus on the efficiency of General Matrix Multiply (GEMM) operations.

Overview

Targeted primarily at students, researchers, and developers, LightNeuron offers a CNN inference framework capable of processing HDF5 model files. This facilitates the integration with models trained on frameworks like PyTorch and TensorFlow. Key features include:

  • Convolutional Layer Computation (conv())
  • Matrix Multiplication (matmul())
  • Activation Functions (relu())
  • Pooling (pooling())
  • Forward Pass Operations (forwardPass())
  • Feature Extraction and Interpretation
  • Prediction (softmax(), predict())

framework

Development Environment Specifications

LightNeuron is optimized for x86-64 architectures, ensuring compatibility and efficiency on a wide range of systems. Below are the specifications of the primary development environment, which can serve as a benchmark for expected performance:

Prerequisites

Ensure your system is ready for LightNeuron by installing the perf tool:

sudo apt-get install linux-tools-$(uname -r) linux-cloud-tools-$(uname -r)

Configure your system by editing /etc/sysctl.conf:

kernel.perf_event_paranoid = -1
kernel.nmi_watchdog = 0

Activate the changes:

sudo sysctl -p

Getting Started

  1. Clone the Repository:

    git clone [repository-url]
  2. Download MNIST Dataset:

    python get_data.py
  3. Compile and Run Labs:

    make lab && ./lab

Performance Profiling

Profile GEMM operations with specific targets and cache levels:

make perf TARGET=[your-target] CACHE_LEVEL=[your-cache-level] USE_PMU=1
  • Replace TARGET with the GEMM implementation (e.g., matmul_naive).
  • Set CACHE_LEVEL to desired cache level (e.g., L1, L2, L3).

Example:

make perf TARGET=matmul_naive CACHE_LEVEL=L1 USE_PMU=1

USE_PMU=1 activates the Performance Monitoring Unit for detailed hardware-level performance insights.

GEMM Optimization

LightNeuron places a strong emphasis on optimizing General Matrix Multiply (GEMM) operations. This optimization leads to significant performance improvements, as measured in GFLOPS (Giga Floating Point Operations Per Second), particularly noticeable across a range of matrix dimensions. Key strategies employed in this optimization include:

  • Loop Interchange: Reorders nested loops to enhance memory access patterns and improve cache performance, eg. ijk -> kji.
  • Compiler Optimization Flags: Employs -O2/-O3 levels for code efficiency.
  • Parallel Loops: Uses OpenMP directives to distribute loop execution across multiple CPU threads.
  • Loop Tiling (Blocking): Optimizes spatial and temporal locality for caches.
  • Divide-and-Conquer: Splits large matrices into smaller sub-matrices for better cache performance.
  • SIMD Intrinsics with Data Alignment: Uses AVX2 instructions and aligns data to boost vectorized operations and memory throughput.

The result of these enhancements is a notable increase in CPU computational efficiency, boosting the performance of matrix multiplication operations considerably.

Implementation Cache References (millions) L1-d Cache Misses (millions) LL Cache Misses (millions)
+parallel loops 4934.44 406.47 404.9
+tiling 5010.46 620.66 13.29
+parallel divide-and-conquer 1881.06 152.97 5.21

Tiling achieves a 96% reduction in last-level cache misses, and parallel divide-and-conquer further lowers overall cache references and minimizes cache misses.

Performance Benchmark

The following table showcases the GFLOPs performance of various kernels compared to Intel MKL, at a matrix size of 1200x1200.

Version Implementation Running Times (ms) Relative Speedup Absolute Speedup GFLOPS Percent of Peak Percent of Intel MKL
naive 11190.93 1.00 1.00 0.19 0.19% 0.25%
naive + interchange loops 4267.47 2.62 2.62 0.50 0.49% 0.65%
naive + interchange loops + optimization flags 675.76 6.32 16.56 3.18 3.10% 4.08%
naive + interchange loops + optimization flags + parallel loops 147.87 4.57 75.68 14.52 14.18% 18.62%
naive + interchange loops + optimization flags + parallel tiling 101.3 1.46 110.47 21.20 20.70% 27.19%
naive + interchange loops + optimization flags + parallel divide-and-conquer 89.52 1.13 125.01 23.99 23.43% 30.76%
naive + interchange loops + optimization flags + parallel divide-and-conquer + avx2 intrinsics + data alignment 71.11 1.26 157.37 30.20 29.49% 38.73%
naive + interchange loops + optimization flags + parallel tiling + avx2 intrinsics + data alignment 62.41 1.14 179.31 34.41 33.60% 44.13%
naive + interchange loops + optimization flags + parallel divide-and-conquer + avx2 intrinsics + data alignment + coarsening 43.62 1.43 256.56 49.23 48.08% 63.14%
Intel MKL 27.54 1.58 406.35 77.98 76.15% 100.00%

benchmark

v1 denotes the naive implementation, while v2 through v10 sequentially represent the advanced enhancements detailed in the above table.

About

An educational ConvNet inference framework designed for x86 architectures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages