GitHub - kig/correlate_opencl: Implementation of an image correlation algorithm in OpenCL for benchmark purposes

kig / correlate_opencl Public

Notifications You must be signed in to change notification settings
Fork 2
Star 14

Implementation of an image correlation algorithm in OpenCL for benchmark purposes

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
LICENSE		LICENSE
README		README
build.sh		build.sh
corr.cpp		corr.cpp
corr.ml		corr.ml
corr.py		corr.py
corr_500.cpp		corr_500.cpp
corr_gl.cpp		corr_gl.cpp
corr_naive.cpp		corr_naive.cpp
correlate.cl		correlate.cl
correlate.fs		correlate.fs
correlate.vs		correlate.vs
correlate2.cl		correlate2.cl
correlate_500.cl		correlate_500.cl
correlate_500x500.cl		correlate_500x500.cl
correlate_image.cl		correlate_image.cl
correlate_local_bw.cl		correlate_local_bw.cl
correlate_lockstep_bw.cl		correlate_lockstep_bw.cl
correlate_naive.cl		correlate_naive.cl
correlate_register_bw.cl		correlate_register_bw.cl
memsum.cl		memsum.cl
memsum.cpp		memsum.cpp
sse.h		sse.h

Repository files navigation

An implementation of the algorithm used in the paper 
"Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive
Applications!" (http://bit.ly/a1xvZB) for comparing CPU and GPU 
performance on a cache-friendly image correlation algorithm.

And it was a good excuse to learn OpenCL.

Hardware tested:
  ATI Radeon HD 4850, 67.9 GBps memory bandwidth, 384 GBps L2, 480 GBps L1,
    1000 mul-add SP GFLOPS, hobbled by lack of local mem and texture support 
    in ATI's OpenCL drivers, resulting in most reads coming from main memory
  Core 2 Duo E6400, 6.4 GBps memory bandwidth, 18 GBps L2, 90.5 GBps L1,
    37.5 mul-add SP GFLOPS, hobbled by lack of performance
  Athlon II X4 640, 15 GBps memory bandwidth, 50 GBps L2, 250 GBps L1,
    96 mul-add SP GFLOPS, hobbled by small cache (not a problem here though)
    Interesting CPU result differences between the 32-bit and 64-bit versions.

Current results (8 FLOPS per 32 bytes memory read):
  The results from the paper are on 500x500 input images. The paper reported 
  runtimes in seconds, I calculated the bandwidth numbers by dividing 281 GB 
  by the reported runtimes.

  The OpenCL results are measured without the context and kernel creation
  and release times, as those are likely only relevant for a single-run 
  workload (and a 0.7 sec one-time overhead to do a 1 sec operation would
  be a bit skewy). The OpenCL results do include the data write and read times.

  850 GBps (212 GFLOPS) 2x2 blocked GLSL on Radeon HD 6950 (640x640 input)
  772 GBps (193 GFLOPS) 2x2 blocked GLSL on Radeon HD 6950 (500x500 input)
  671 GBps (168 GFLOPS) OpenCL image kernel on Radeon HD 6950 (512x512 input)
  618 GBps (154 GFLOPS) OpenCL image kernel on Radeon HD 6950 (500x500 input)
  339 GBps (85 GFLOPS) 2x2 blocked GLSL on Radeon HD 4850 (500x500 input)
  278 GBps (70 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (640x640 input)
 [276 GBps (69 GFLOPS) the paper's optimized impl on 8-core 3 GHz Power7]
  259 GBps (65 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (512x512 input)
  243 GBps (61 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (500x500 input)
  240 GBps (60 GFLOPS) optimized OpenCL CPU on Athlon II X4 640 (32-bit)
 [230 GBps (58 GFLOPS) the paper's GTX 285 implementation using local mem]
  204 GBps (51 GFLOPS) naive GLSL implementation on Radeon HD 4850
  201 GBps (50 GFLOPS) manually optimized OpenMP SSE on Athlon II X4 640 (64-bit)
 [159 GBps (40 GFLOPS) the paper's GTX 285 implementation]
 [150 GBps (38 GFLOPS) the paper's optimized impl on 2x4-core 2.93 GHz Nehalem]
  125 GBps (31 GFLOPS) optimized OpenMP SSE on Athlon II X4 (64-bit)
  103 GBps (26 GFLOPS) manually optimized OpenMP SSE on Athlon II X4 (32-bit)
   86 GBps (22 GFLOPS) naive OpenCL GPU implementation (1D workgroup)
   84 GBps (21 GFLOPS) optimized OpenCL CPU on Core 2 Duo E6400
   45 GBps (11 GFLOPS) manually optimized OpenMP SSE implementation on C2D
   25 GBps (6 GFLOPS) naive OpenCL GPU implementation (2D workgroup)
   13 GBps (3 GFLOPS) naive OpenMP SSE implementation on Athlon II X4
    5 GBps (1 GFLOPS) naive OpenMP SSE implementation on C2D
    5 GBps (1 GFLOPS) naive scalar C implementation on C2D
    1.1 GBps (0.3 GFLOPS) naive OCaml implementation on C2D
    0.03 GBps (0.008 GFLOPS) naive pure-Python implementation on C2D

Compiling and running:
  export ATISTREAMSDKROOT=path_to_ati_stream_sdk
  export LD_LIBRARY_PATH=$ATISTREAMSDKROOT/lib/x86:$LD_LIBRARY_PATH

  ./build.sh
  ./corr
  ./corr_500
  ./corr_gl
  ./corr_naive
  ./memsum

  ocamlopt -o ocorr corr.ml
  time ./ocorr # divide time by 1.21 to get GBps

  time python corr.py # divide time by 1.21 to get GBps