Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clpeak's enqueueReadBuffer very slow #775

Open
pioto1225 opened this issue Nov 11, 2024 · 3 comments
Open

clpeak's enqueueReadBuffer very slow #775

pioto1225 opened this issue Nov 11, 2024 · 3 comments

Comments

@pioto1225
Copy link

pioto1225 commented Nov 11, 2024

Setup: Intel Arc A770 (56a0)
MSI PRO Z890-P WIFI, Intel Core Ultra 9 285K
Debian Trixie (6.11.5 kernel):

clpeak's enqueueReadBuffer performance is very poor and it is actually hanging on the non-blocking version.
image

dmesg shows some fence issues:
[ 352.610742] Fence expiration time out i915-0000:04:00.0:clpeak[5952]:a8! [ 352.729366] Fence expiration time out i915-0000:04:00.0:clpeak[5952]:a6!
it used to report over 8 GBPS (Z690/i9 14900k).

@BartusW
Copy link
Contributor

BartusW commented Nov 15, 2024

Hello, thanks for feedback. We are isolating unexpected performance degradation with Debian.

@pioto1225
Copy link
Author

pioto1225 commented Nov 15, 2024

Thanks!
I checked and the same thing happens in Ubuntu 24.04, please see the logs.

$ uname -a
Linux pioto-pc 6.8.0-48-generic #48-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 14:04:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.1 LTS
Release:	24.04
Codename:	noble

`$ clpeak

Platform: Intel(R) OpenCL Graphics
Device: Intel(R) Arc(TM) A770 Graphics
Driver version : 24.39.31294.12 (Linux x64)
Compute units : 512
Clock frequency : 2400 MHz

Global memory bandwidth (GBPS)
  float   : 396.60
  float2  : 402.80
  float4  : 405.45
  float8  : 410.53
  float16 : 415.86

Single-precision compute (GFLOPS)
  float   : 13007.47
  float2  : 11129.21
  float4  : 10398.92
  float8  : 10019.28
  float16 : 9692.58

Half-precision compute (GFLOPS)
  half   : 19535.93
  half2  : 19471.56
  half4  : 19512.96
  half8  : 19439.48
  half16 : 19321.75

No double precision support! Skipped

Integer compute (GIOPS)
  int   : 5455.01
  int2  : 5448.29
  int4  : 5434.15
  int8  : 5402.65
  int16 : 5369.13

Integer compute Fast 24bit (GIOPS)
  int   : 5440.72
  int2  : 5441.88
  int4  : 5430.21
  int8  : 5401.83
  int16 : 5378.00

Transfer bandwidth (GBPS)
  enqueueWriteBuffer              : 15.58
  enqueueReadBuffer               : 1.67
  enqueueWriteBuffer non-blocking : 18.00
  enqueueReadBuffer non-blocking  : `

It never gets beyond that enqueReadBuffer non-blocking

dmesg (clinfo was run twice):

$ sudo dmesg | grep i915
[ 3.957174] i915 0000:04:00.0: [drm] VT-d active for gfx access
[ 4.004342] i915 0000:04:00.0: vgaarb: deactivate vga console
[ 4.004799] i915 0000:04:00.0: [drm] Local memory IO size: 0x00000003fa000000
[ 4.004801] i915 0000:04:00.0: [drm] Local memory available: 0x00000003fa000000
[ 4.018474] i915 0000:04:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 4.022317] i915 0000:04:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[ 4.042430] i915 0000:04:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[ 4.042433] i915 0000:04:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[ 4.050146] i915 0000:04:00.0: [drm] GT0: GUC: submission enabled
[ 4.050148] i915 0000:04:00.0: [drm] GT0: GUC: SLPC enabled
[ 4.050365] i915 0000:04:00.0: [drm] GT0: GUC: RC enabled
[ 4.063814] i915 0000:04:00.0: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[ 4.077969] [drm] Initialized i915 1.6.0 20230929 for 0000:04:00.0 on minor 1
[ 4.078531] i915 display info: display version: 13
[ 4.078532] i915 display info: cursor_needs_physical: no
[ 4.078537] i915 display info: has_cdclk_crawl: no
[ 4.078538] i915 display info: has_cdclk_squash: yes
[ 4.078538] i915 display info: has_ddi: yes
[ 4.078538] i915 display info: has_dp_mst: yes
[ 4.078539] i915 display info: has_dsb: yes
[ 4.078539] i915 display info: has_fpga_dbg: yes
[ 4.078539] i915 display info: has_gmch: no
[ 4.078540] i915 display info: has_hotplug: yes
[ 4.078540] i915 display info: has_hti: no
[ 4.078540] i915 display info: has_ipc: yes
[ 4.078540] i915 display info: has_overlay: no
[ 4.078541] i915 display info: has_psr: yes
[ 4.078541] i915 display info: has_psr_hw_tracking: no
[ 4.078541] i915 display info: overlay_needs_physical: no
[ 4.078542] i915 display info: supports_tv: no
[ 4.078542] i915 display info: has_hdcp: yes
[ 4.078542] i915 display info: has_dmc: yes
[ 4.078542] i915 display info: has_dsc: yes
[ 4.115701] snd_hda_intel 0000:05:00.0: bound 0000:04:00.0 (ops i915_audio_component_bind_ops [i915])
[ 4.115836] fbcon: i915drmfb (fb0) is primary device
[ 4.115837] i915 0000:04:00.0: [drm] fb0: i915drmfb frame buffer device
[ 4.696427] i915 0000:04:00.0: [drm] GT0: HuC: authenticated for all workloads
[ 4.696434] mei_pxp i915.mei-gsc.1024-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:04:00.0 (ops i915_pxp_tee_component_ops [i915])
[ 33.797738] i915 0000:04:00.0: Using 41-bit DMA addresses
[ 87.555553] Fence expiration time out i915-0000:04:00.0:clpeak[2947]:a8!
[ 87.685324] Fence expiration time out i915-0000:04:00.0:clpeak[2947]:a6!
[ 400.879244] Fence expiration time out i915-0000:04:00.0:clpeak[5008]:aa!
[ 401.005055] Fence expiration time out i915-0000:04:00.0:clpeak[5008]:a8!

Ubuntu 24.10 is the same.

Until last week I was running A770 with i9-14900k I never saw this issue, and then I swapped CPU to Core Ultra 9 285k, and then the problem surfaced. But I do not know, it could be a coincidence.
I did notice this bug: #726, but the issue I am having is also happening on the latest 6.12-rc kernels.

@pioto1225
Copy link
Author

pioto1225 commented Nov 24, 2024

I did little investigation and the issue has to do with the host buffer alignment for D2H transfers only (H2D direction does not seem to care). I suppose it is not really related to Arc graphics or NEO driver or the operating system but to the host CPU and how the application does allocations.

Raptor-lake already suffered in clpeak doing worse D2H than H2D, but it got much worse with Arrowlake:

  • i9-14900k: ~8GBps D2H/~20GBps H2D
  • Core Ultra 9 285K: ~1.6GBps D2H/~20GBps H2D

strace reveals timely H2D transfer in clpeak (100ms), while D2H are so slow that after 0.5 sec NEO is checking if the GPU suffered reset (DRM_IOCTL_I915_GET_RESET_STATS).

$ strace -tt clpeak -p 0 --transfer-bandwidth 2>&1 | grep -e 'DRM' -e 'enque'
...
15:55:28.964476 write(1, "      enqueueWriteBuffer        "..., 40      enqueueWriteBuffer              : ) = 40
...
15:55:29.630382 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97880) = 0
15:55:29.630392 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97530) = 0
15:55:29.636631 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92f80) = 0
15:55:29.738278 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93490) = 0
15:55:29.738290 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93490) = 0
15:55:29.738308 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97880) = 0
15:55:29.738318 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97530) = 0
15:55:29.744423 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92f80) = 0
15:55:29.845263 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93490) = 0
15:55:29.845282 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93490) = 0
...
15:55:31.262395 write(1, "      enqueueReadBuffer         "..., 40      enqueueReadBuffer               : ) = 40
15:55:31.262419 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97460) = 0
15:55:31.262433 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97110) = 0
15:55:31.268495 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92b60) = 0
15:55:31.768556 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:32.268568 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:32.536409 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93070) = 0
15:55:32.536425 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93070) = 0
15:55:32.536459 ioctl(4, DRM_IOCTL_I915_GEM_USERPTR, 0x7fff79b97460) = 0
15:55:32.536471 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b97110) = 0
15:55:32.542946 ioctl(4, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7fff79b92b60) = 0
15:55:33.043000 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:33.543013 ioctl(4, DRM_IOCTL_I915_GET_RESET_STATS, 0x7fff79b92ff0) = 0
15:55:33.810814 ioctl(4, DRM_IOCTL_I915_GEM_WAIT or DRM_IOCTL_RADEON_GEM_OP, 0x7fff79b93070) = 0
15:55:33.810845 ioctl(4, DRM_IOCTL_GEM_CLOSE, 0x7fff79b93070) = 0
....

So as mentioned earlier upstream clpeak with 16-byte host aligned buffer does 1.6 GBps D2H on ArrowLake:

Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 24.39.31294.12 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 20.31
      enqueueReadBuffer               : 1.69
      enqueueWriteBuffer non-blocking : 21.36
      enqueueReadBuffer non-blocking  : ^C

L0 tests show healthy D2H transfers matching H2D (L0 tests seem to use page aligned buffers)

using blitter (I believe that is what OpenCL is using):

$ ./ze_bandwidth -t d2h -i 10 -g 1 -s 2147483648 -w 2
...
--------------------------------------------------------------------------------
Device->Host
	[Device 0 2147483648]:  BW = 22.441546 GBPS  Latency =  95692.32 usec
[Total    2147483648]:  BW = 22.439343 GBPS  Latency =  95701.72 usec
--------------------------------------------------------------------------------

and (somewhat worse) using compute engine (not relevant):

$ ./ze_bandwidth -t d2h -i 10 -g 0 -s 2147483648 -w 2
...
--------------------------------------------------------------------------------
Device->Host
	[Device 0 2147483648]:  BW = 13.989714 GBPS  Latency = 153504.48 usec
[Total    2147483648]:  BW = 13.989035 GBPS  Latency = 153511.92 usec
--------------------------------------------------------------------------------

Forcing 64B host buffer alignment in clpeak fixes ArrowLake performance, which matches L0 tests.

$ ./clpeak  --transfer-bandwidth

Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 24.39.31294.12 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 20.51
      enqueueReadBuffer               : 21.02
      enqueueWriteBuffer non-blocking : 21.83
      enqueueReadBuffer non-blocking  : 22.36
      enqueueMapBuffer(for read)      : 20.37
        memcpy from mapped ptr        : 26.28
      enqueueUnmap(after write)       : 21.91
        memcpy to mapped ptr          : 26.92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants