Skip to content

Latest commit

 

History

History
83 lines (68 loc) · 5.44 KB

AvxPi.md

File metadata and controls

83 lines (68 loc) · 5.44 KB

AvxPi

Serial solution using Avx vector instructions, including fused multiply add. With 256-bit registers this can perform 4 "iterations" at once, which is twice as many as the SSE2 solution. As the problem scales particularly well, we also get about twice the performance.

benchmark name                            samples    iterations          mean
-------------------------------------------------------------------------------
AvxPi                                           10             1    820.518 ms 

As intrinsics are directly used there is little difference in the compiler flags used and below are the high level statistics.

$ perf stat -d ./out/build/linux-gcc-profile/PiBench/PiBench --benchmark-no-analysis --benchmark-samples 10 "[!benchmark][AvxPi]"
 Performance counter stats for './out/build/linux-gcc-profile/PiBench/PiBench --benchmark-no-analysis --benchmark-samples 10 [!benchmark][AvxPi]':

         18,989.80 msec task-clock                       #    1.000 CPUs utilized             
                92      context-switches                 #    4.845 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
           420,156      page-faults                      #   22.125 K/sec                     
   104,838,147,991      cycles                           #    5.521 GHz                         (71.43%)
     8,033,075,338      stalled-cycles-frontend          #    7.66% frontend cycles idle        (71.43%)
   178,612,789,114      instructions                     #    1.70  insn per cycle            
                                                  #    0.04  stalled cycles per insn     (71.43%)
    23,683,097,575      branches                         #    1.247 G/sec                       (71.43%)
        74,687,544      branch-misses                    #    0.32% of all branches             (71.43%)
     6,359,120,038      L1-dcache-loads                  #  334.870 M/sec                       (71.43%)
       114,107,280      L1-dcache-load-misses            #    1.79% of all L1-dcache accesses   (71.43%)
   <not supported>      LLC-loads                                                             
   <not supported>      LLC-load-misses                                                       

      18.991871520 seconds time elapsed

      18.510092000 seconds user
       0.479950000 seconds sys

$ perf stat -M PipelineL2 ./out/build/linux-gcc-profile/PiBench/PiBench --benchmark-no-analysis --benchmark-samples 10 "[!benchmark][AvxPi]"
 Performance counter stats for './out/build/linux-gcc-profile/PiBench/PiBench --benchmark-no-analysis --benchmark-samples 10 [!benchmark][AvxPi]':

        73,431,668      ex_ret_brn_misp                  #      0.2 %  bad_speculation_mispredicts
                                                  #      0.0 %  bad_speculation_pipeline_restarts  (25.00%)
   165,497,956,702      de_src_op_disp.all                                                      (25.00%)
           462,005      resyncs_or_nc_redirects                                                 (25.00%)
   105,036,989,901      ls_not_halted_cyc                                                       (25.00%)
   164,421,923,897      ex_ret_ops                                                              (25.00%)
     5,164,675,949      ex_no_retire.load_not_complete   #     59.6 %  backend_bound_cpu      
                                                  #      4.4 %  backend_bound_memory     (25.00%)
   403,173,685,733      de_no_dispatch_per_slot.backend_stalls                                        (25.00%)
    74,857,151,606      ex_no_retire.not_complete                                               (25.00%)
   105,021,496,039      ls_not_halted_cyc                                                       (25.00%)
     5,458,038,363      ex_ret_ucode_ops                 #     25.2 %  retiring_fastpath      
                                                  #      0.9 %  retiring_microcode       (25.00%)
   105,061,826,471      ls_not_halted_cyc                                                       (25.00%)
   164,437,770,686      ex_ret_ops                                                              (25.00%)
    61,030,838,582      de_no_dispatch_per_slot.no_ops_from_frontend #      2.0 %  frontend_bound_bandwidth  (24.99%)
     8,019,190,455      cpu/de_no_dispatch_per_slot.no_ops_from_frontend,cmask=0x6/ #      7.6 %  frontend_bound_latency   (24.99%)
   105,079,829,837      ls_not_halted_cyc                                                       (24.99%)

      19.015071526 seconds time elapsed

      18.536473000 seconds user
       0.476960000 seconds sys


As expected the bulk of the work is done by vector instructions in the hot loop. Notably there isn't any loop unrolling, which would be intriguing to explore if/how much that would help.

$ perf annotate PiLib::AvxPi --stdio --no-source
 Percent |      Source code & Disassembly of PiBench for cycles:u (36227 samples, percent: local period)
--------------------------------------------------------------------------------------------------------
...
    0.00 :   1cea0:  vmulpd %ymm1,%ymm4,%ymm0
    0.00 :   1cea4:  add    $0x1,%rax
    0.00 :   1cea8:  vaddpd %ymm3,%ymm1,%ymm1
    0.00 :   1ceac:  vfmadd132pd %ymm0,%ymm5,%ymm0
    0.00 :   1ceb1:  vdivpd %ymm0,%ymm3,%ymm0
   46.52 :   1ceb5:  vaddpd %ymm0,%ymm2,%ymm2
   53.48 :   1ceb9:  cmp    %rax,%rdx
    0.00 :   1cebc:  jne    1cea0 <PiLib::AvxPi(long)+0x80>
...