clpeak's enqueueReadBuffer very slow #775

pioto1225 · 2024-11-11T06:27:06Z

Setup: Intel Arc A770 (56a0)
MSI PRO Z890-P WIFI, Intel Core Ultra 9 285K
Debian Trixie (6.11.5 kernel):

clpeak's enqueueReadBuffer performance is very poor and it is actually hanging on the non-blocking version.

dmesg shows some fence issues:
[ 352.610742] Fence expiration time out i915-0000:04:00.0:clpeak[5952]:a8! [ 352.729366] Fence expiration time out i915-0000:04:00.0:clpeak[5952]:a6!
it used to report over 8 GBPS (Z690/i9 14900k).

The text was updated successfully, but these errors were encountered:

BartusW · 2024-11-15T09:03:48Z

Hello, thanks for feedback. We are isolating unexpected performance degradation with Debian.

pioto1225 · 2024-11-15T12:05:14Z

Thanks!
I checked and the same thing happens in Ubuntu 24.04, please see the logs.

$ uname -a
Linux pioto-pc 6.8.0-48-generic #48-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 14:04:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.1 LTS
Release:	24.04
Codename:	noble

`$ clpeak

Platform: Intel(R) OpenCL Graphics
Device: Intel(R) Arc(TM) A770 Graphics
Driver version : 24.39.31294.12 (Linux x64)
Compute units : 512
Clock frequency : 2400 MHz

Global memory bandwidth (GBPS)
  float   : 396.60
  float2  : 402.80
  float4  : 405.45
  float8  : 410.53
  float16 : 415.86

Single-precision compute (GFLOPS)
  float   : 13007.47
  float2  : 11129.21
  float4  : 10398.92
  float8  : 10019.28
  float16 : 9692.58

Half-precision compute (GFLOPS)
  half   : 19535.93
  half2  : 19471.56
  half4  : 19512.96
  half8  : 19439.48
  half16 : 19321.75

No double precision support! Skipped

Integer compute (GIOPS)
  int   : 5455.01
  int2  : 5448.29
  int4  : 5434.15
  int8  : 5402.65
  int16 : 5369.13

Integer compute Fast 24bit (GIOPS)
  int   : 5440.72
  int2  : 5441.88
  int4  : 5430.21
  int8  : 5401.83
  int16 : 5378.00

Transfer bandwidth (GBPS)
  enqueueWriteBuffer              : 15.58
  enqueueReadBuffer               : 1.67
  enqueueWriteBuffer non-blocking : 18.00
  enqueueReadBuffer non-blocking  : `

It never gets beyond that enqueReadBuffer non-blocking

dmesg (clinfo was run twice):

$ sudo dmesg | grep i915
[ 3.957174] i915 0000:04:00.0: [drm] VT-d active for gfx access
[ 4.004342] i915 0000:04:00.0: vgaarb: deactivate vga console
[ 4.004799] i915 0000:04:00.0: [drm] Local memory IO size: 0x00000003fa000000
[ 4.004801] i915 0000:04:00.0: [drm] Local memory available: 0x00000003fa000000
[ 4.018474] i915 0000:04:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 4.022317] i915 0000:04:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[ 4.042430] i915 0000:04:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[ 4.042433] i915 0000:04:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[ 4.050146] i915 0000:04:00.0: [drm] GT0: GUC: submission enabled
[ 4.050148] i915 0000:04:00.0: [drm] GT0: GUC: SLPC enabled
[ 4.050365] i915 0000:04:00.0: [drm] GT0: GUC: RC enabled
[ 4.063814] i915 0000:04:00.0: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[ 4.077969] [drm] Initialized i915 1.6.0 20230929 for 0000:04:00.0 on minor 1
[ 4.078531] i915 display info: display version: 13
[ 4.078532] i915 display info: cursor_needs_physical: no
[ 4.078537] i915 display info: has_cdclk_crawl: no
[ 4.078538] i915 display info: has_cdclk_squash: yes
[ 4.078538] i915 display info: has_ddi: yes
[ 4.078538] i915 display info: has_dp_mst: yes
[ 4.078539] i915 display info: has_dsb: yes
[ 4.078539] i915 display info: has_fpga_dbg: yes
[ 4.078539] i915 display info: has_gmch: no
[ 4.078540] i915 display info: has_hotplug: yes
[ 4.078540] i915 display info: has_hti: no
[ 4.078540] i915 display info: has_ipc: yes
[ 4.078540] i915 display info: has_overlay: no
[ 4.078541] i915 display info: has_psr: yes
[ 4.078541] i915 display info: has_psr_hw_tracking: no
[ 4.078541] i915 display info: overlay_needs_physical: no
[ 4.078542] i915 display info: supports_tv: no
[ 4.078542] i915 display info: has_hdcp: yes
[ 4.078542] i915 display info: has_dmc: yes
[ 4.078542] i915 display info: has_dsc: yes
[ 4.115701] snd_hda_intel 0000:05:00.0: bound 0000:04:00.0 (ops i915_audio_component_bind_ops [i915])
[ 4.115836] fbcon: i915drmfb (fb0) is primary device
[ 4.115837] i915 0000:04:00.0: [drm] fb0: i915drmfb frame buffer device
[ 4.696427] i915 0000:04:00.0: [drm] GT0: HuC: authenticated for all workloads
[ 4.696434] mei_pxp i915.mei-gsc.1024-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:04:00.0 (ops i915_pxp_tee_component_ops [i915])
[ 33.797738] i915 0000:04:00.0: Using 41-bit DMA addresses
[ 87.555553] Fence expiration time out i915-0000:04:00.0:clpeak[2947]:a8!
[ 87.685324] Fence expiration time out i915-0000:04:00.0:clpeak[2947]:a6!
[ 400.879244] Fence expiration time out i915-0000:04:00.0:clpeak[5008]:aa!
[ 401.005055] Fence expiration time out i915-0000:04:00.0:clpeak[5008]:a8!

Ubuntu 24.10 is the same.

Until last week I was running A770 with i9-14900k I never saw this issue, and then I swapped CPU to Core Ultra 9 285k, and then the problem surfaced. But I do not know, it could be a coincidence.
I did notice this bug: #726, but the issue I am having is also happening on the latest 6.12-rc kernels.

pioto1225 · 2024-11-24T16:46:34Z

I did little investigation and the issue has to do with the host buffer alignment for D2H transfers only (H2D direction does not seem to care). I suppose it is not really related to Arc graphics or NEO driver or the operating system but to the host CPU and how the application does allocations.

Raptor-lake already suffered in clpeak doing worse D2H than H2D, but it got much worse with Arrowlake:

i9-14900k: ~8GBps D2H/~20GBps H2D
Core Ultra 9 285K: ~1.6GBps D2H/~20GBps H2D

strace reveals timely H2D transfer in clpeak (100ms), while D2H are so slow that after 0.5 sec NEO is checking if the GPU suffered reset (DRM_IOCTL_I915_GET_RESET_STATS).

$ strace -tt clpeak -p 0 --transfer-bandwidth 2>&1 | grep -e 'DRM' -e 'enque'
...
15:55:28.964476 write(1, "      enqueueWriteBuffer        "..., 40      enqueueWriteBuffer              : ) = 40
...
15:55:29.630382 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97880) = 0
15:55:29.630392 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97530) = 0
15:55:29.636631 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92f80) = 0
15:55:29.738278 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93490) = 0
15:55:29.738290 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93490) = 0
15:55:29.738308 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97880) = 0
15:55:29.738318 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97530) = 0
15:55:29.744423 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92f80) = 0
15:55:29.845263 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93490) = 0
15:55:29.845282 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93490) = 0
...
15:55:31.262395 write(1, "      enqueueReadBuffer         "..., 40      enqueueReadBuffer               : ) = 40
15:55:31.262419 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97460) = 0
15:55:31.262433 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97110) = 0
15:55:31.268495 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92b60) = 0
15:55:31.768556 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:32.268568 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:32.536409 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93070) = 0
15:55:32.536425 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93070) = 0
15:55:32.536459 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97460) = 0
15:55:32.536471 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97110) = 0
15:55:32.542946 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92b60) = 0
15:55:33.043000 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:33.543013 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:33.810814 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93070) = 0
15:55:33.810845 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93070) = 0
....

So as mentioned earlier upstream clpeak with 16-byte host aligned buffer does 1.6 GBps D2H on ArrowLake:

Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 24.39.31294.12 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 20.31
      enqueueReadBuffer               : 1.69
      enqueueWriteBuffer non-blocking : 21.36
      enqueueReadBuffer non-blocking  : ^C

L0 tests show healthy D2H transfers matching H2D (L0 tests seem to use page aligned buffers)

using blitter (I believe that is what OpenCL is using):

$ ./ze_bandwidth -t d2h -i 10 -g 1 -s 2147483648 -w 2
...
--------------------------------------------------------------------------------
Device->Host
	[Device 0 2147483648]:  BW = 22.441546 GBPS  Latency =  95692.32 usec
[Total    2147483648]:  BW = 22.439343 GBPS  Latency =  95701.72 usec
--------------------------------------------------------------------------------

and (somewhat worse) using compute engine (not relevant):

$ ./ze_bandwidth -t d2h -i 10 -g 0 -s 2147483648 -w 2
...
--------------------------------------------------------------------------------
Device->Host
	[Device 0 2147483648]:  BW = 13.989714 GBPS  Latency = 153504.48 usec
[Total    2147483648]:  BW = 13.989035 GBPS  Latency = 153511.92 usec
--------------------------------------------------------------------------------

Forcing 64B host buffer alignment in clpeak fixes ArrowLake performance, which matches L0 tests.

$ ./clpeak  --transfer-bandwidth

Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 24.39.31294.12 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 20.51
      enqueueReadBuffer               : 21.02
      enqueueWriteBuffer non-blocking : 21.83
      enqueueReadBuffer non-blocking  : 22.36
      enqueueMapBuffer(for read)      : 20.37
        memcpy from mapped ptr        : 26.28
      enqueueUnmap(after write)       : 21.91
        memcpy to mapped ptr          : 26.92

This was referenced Nov 24, 2024

Use stricter host buffer alignment (64B) required by modern CPUs. krrishnarraj/clpeak#121

Merged

Use stricter host buffer alignment (64B) required by modern CPUs. ProjectPhysX/OpenCL-Benchmark#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clpeak's enqueueReadBuffer very slow #775

clpeak's enqueueReadBuffer very slow #775

pioto1225 commented Nov 11, 2024 •

edited

Loading

BartusW commented Nov 15, 2024

pioto1225 commented Nov 15, 2024 •

edited

Loading

pioto1225 commented Nov 24, 2024 •

edited

Loading

clpeak's enqueueReadBuffer very slow #775

clpeak's enqueueReadBuffer very slow #775

Comments

pioto1225 commented Nov 11, 2024 • edited Loading

BartusW commented Nov 15, 2024

pioto1225 commented Nov 15, 2024 • edited Loading

pioto1225 commented Nov 24, 2024 • edited Loading

pioto1225 commented Nov 11, 2024 •

edited

Loading

pioto1225 commented Nov 15, 2024 •

edited

Loading

pioto1225 commented Nov 24, 2024 •

edited

Loading