Bug 108625

Summary: AMDGPU - Can't even get Xorg to start - Kernel driver hangs with ring buffer timeout on ARM64
Product: DRI Reporter: Carsten Haitzler <raster>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: blocker    
Priority: medium CC: pbrobinson
Version: unspecified   
Hardware: ARM   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
log - dmesg
none
log - xorg
none
log - xorg - gdb attach + bt none

Description Carsten Haitzler 2018-11-01 15:59:10 UTC
So we're going to have fun with this one...

Start Xorg. It hangs in screen setup:

  #0  ioctl () at ../sysdeps/unix/sysv/linux/aarch64/ioctl.S:25
  #1  0x0000ffffbb149334 in drmIoctl () from /lib/aarch64-linux-gnu/libdrm.so.2
  #2  0x0000ffffba5166b4 in amdgpu_cs_query_fence_status () from /lib/aarch64-linux-gnu/libdrm_amdgpu.so.1
  #3  0x0000ffffb9ef37f8 in ?? () from /usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #4  0x0000ffffb9dd148c in ?? () from /usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #5  0x0000ffffb993d448 in ?? () from /usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #6  0x0000ffffb993d4ac in ?? () from /usr/lib/aarch64-linux-gnu/dri/radeonsi_dri.so
  #7  0x0000ffffba54425c in ?? () from /usr/lib/xorg/modules/drivers/amdgpu_drv.so
  #8  0x0000ffffba537ca8 in ?? () from /usr/lib/xorg/modules/drivers/amdgpu_drv.so
  #9  0x0000aaaae7133348 in MapWindow ()
  #10 0x0000aaaae710c820 in ?? ()
  #11 0x0000ffffbad52720 in __libc_start_main (main=0x0, argc=0, argv=0x0, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:310

And that ioctl hangs because of:

  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=10, last emitted seq=11
  [drm] GPU recovery disabled.

The amdgpu kernel driver reports:

  [drm] amdgpu kernel modesetting enabled.
  amdgpu 0000:89:00.0: enabling device (0100 -> 0102)
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_mc.bin
  amdgpu 0000:89:00.0: BAR 2: releasing [mem 0x14010000000-0x140101fffff 64bit pref]
  amdgpu 0000:89:00.0: BAR 0: releasing [mem 0x14000000000-0x1400fffffff 64bit pref]
  amdgpu 0000:89:00.0: BAR 0: assigned [mem 0x14000000000-0x140ffffffff 64bit pref]
  amdgpu 0000:89:00.0: BAR 2: assigned [mem 0x14100000000-0x141001fffff 64bit pref]
  amdgpu 0000:89:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
  amdgpu 0000:89:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
  [drm] amdgpu: 4096M of VRAM memory ready
  [drm] amdgpu: 4096M of GTT memory ready.
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_pfp_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_me_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_ce_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_rlc.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_mec_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_mec2_2.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_sdma.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_sdma1.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_uvd.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_vce.bin
  amdgpu 0000:89:00.0: firmware: direct-loading firmware amdgpu/polaris11_k_smc.bin
  [drm] Initialized amdgpu 3.26.0 20150101 for 0000:89:00.0 on minor 1
  amdgpu 0000:89:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none

So here is where the fun begins. Kernel is:

  Linux noisy 4.18.0-2-arm64 #1 SMP Debian 4.18.10-2 (2018-10-07) aarch64 GNU/Linux

It's Debian unstable on a Cavium Thunder-X2 64bit ARM system (2 CPUs with 32 cores each, 256 cores total with 4 way SMT enabled) with a bunch of PCIE slots. There is an Nvidia card that works.... to a decent degree and an on-board PCIE dumb framebuffer display device (ASPEED), but I'd rather a more open stack etc. - I've fiddled with xorg configs to get it to ignore other devices other than the AMD one like with:

  Section "ServerFlags"
         Option "AutoAddGPU" "false"
  EndSection
  
  Section "Device"
         Identifier "amdgpu"
         Driver "amdgpu"
         BusID "PCI:137:0:0"
         Option "DRI" "2"
         Option "TearFree" "on"
  EndSection

I've even put the AMD card in the same slot as the Nvidia one with the same results, so it's not a slot specific issue it seems. So where should I start poking to see where this very early stage ring gfx timeout is originating from specifically... I'm willing to start the fun of compiling kernels etc. to dig through this. So how can I help solve this and make AMD cards portable and usable? :)
Comment 1 Alex Deucher 2018-11-01 17:55:44 UTC
Please attach your full dmesg output and xorg log if using X.
Comment 2 Carsten Haitzler 2018-11-02 12:14:57 UTC
Created attachment 142337 [details]
log - dmesg
Comment 3 Carsten Haitzler 2018-11-02 12:15:14 UTC
Created attachment 142338 [details]
log - xorg
Comment 4 Carsten Haitzler 2018-11-02 12:15:44 UTC
Created attachment 142339 [details]
log - xorg - gdb attach + bt
Comment 5 Carsten Haitzler 2018-11-02 12:16:40 UTC
Attached them (too big to put inline as comments).
Comment 6 Alex Deucher 2018-11-02 18:41:06 UTC
It looks like something submitted by mesa caused a GPU hang.  You might try starting a bare X server and trying some simple OGL apps to start with.
Comment 7 Alex Deucher 2018-11-02 18:41:22 UTC
Or try a newer or older version of mesa.
Comment 8 Carsten Haitzler 2018-11-04 13:17:39 UTC
Actually no ogl client has even started. this is just the xserver being started by slim (login manager) and that doesn't use OGL. it's really basic xlib stuff. so it is basically a raw xserver... perhaps its the glamor accel stuff... but... no OGL clients. :) never got that far.
Comment 9 Alex Deucher 2018-11-04 16:15:39 UTC
(In reply to Carsten Haitzler from comment #8)
> Actually no ogl client has even started. this is just the xserver being
> started by slim (login manager) and that doesn't use OGL. it's really basic
> xlib stuff. so it is basically a raw xserver... perhaps its the glamor accel
> stuff... but... no OGL clients. :) never got that far.

Yeah, it would be GL via glamor in that case.
Comment 10 Carsten Haitzler 2018-11-05 09:08:04 UTC
so wouldn't that make it a necessity then if its even glamor needing it? i guess i can turn off glamor accel but realistically gl is a necessity so the problem needs to be addressed sooner or later.

the ring gfx timeout smells to me of "not a mesa bug" in that an ioctl going to the drm driver never returns qhen doing a simple query. it hangs, thus something lower down that is having a bad day, if something as simple as querying a fence causes a hang... :)

what is this ring gfx thing exactly (seems to be some command queue) and why would it be timing out? all the way back at seq 10/11 ... like right at the start of its use? it's almost like some interrupt or in memory semaphore thing mapped from the card is messing up? i'm looking for something to look into more specifically.
Comment 11 Alex Deucher 2018-11-05 15:20:07 UTC
Does this patch help?
https://patchwork.freedesktop.org/patch/259364/

Does ARM support write combining?  The driver uses it pretty extensively.  You might try disabling GTT_USWC (uncached write combined) support in the kernel driver and just falling back to cached memory.
Comment 12 Alex Deucher 2018-11-05 15:32:20 UTC
(In reply to Carsten Haitzler from comment #10)
> so wouldn't that make it a necessity then if its even glamor needing it? i
> guess i can turn off glamor accel but realistically gl is a necessity so the
> problem needs to be addressed sooner or later.
> 

If you were starting a bare x server, you usually don't hit the glamor paths too extensively compared to a full desktop environment.

> the ring gfx timeout smells to me of "not a mesa bug" in that an ioctl going
> to the drm driver never returns qhen doing a simple query. it hangs, thus
> something lower down that is having a bad day, if something as simple as
> querying a fence causes a hang... :)
> 
> what is this ring gfx thing exactly (seems to be some command queue) and why
> would it be timing out? all the way back at seq 10/11 ... like right at the
> start of its use? it's almost like some interrupt or in memory semaphore
> thing mapped from the card is messing up? i'm looking for something to look
> into more specifically.

Each engine on the GPU (gfx, compute, video decode, encode, dma, etc.) has a ring buffer used to feed it.  The work sent to the engines is managed by a sw scheduler in the kernel. The kernel driver tests the rings as part of the driver init sequence.  The driver won't come up if the ring tests fail so they are working at least until you start X.  Presumably X submits (via glamor) some work to the GPU which causes the GPU to hang.  The fence never signals because the GPU never finished processing the job due to the hang.

Another simplier test would be to boot up to a console (no X) and then try running some of the libdrm amdgpu tests.  They are really simple (copying data and round and verifying it using different engines, allocating freeing memory, etc.).
https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu
See if some of the simple copy or write tests work.
Comment 13 Alex Deucher 2018-11-09 20:32:26 UTC
Does returning false in drm_arch_can_wc_memory() for ARM fix the issue?
Comment 14 Alex Deucher 2018-11-09 20:33:30 UTC
(In reply to Alex Deucher from comment #13)
> Does returning false in drm_arch_can_wc_memory() for ARM fix the issue?

This has enabled a working driver for others on ARM.
Comment 15 Carsten Haitzler 2018-11-19 13:00:31 UTC
And lo and behold:

--- ./include/drm/drm_cache.h~  2018-08-12 21:41:04.000000000 +0100
+++ ./include/drm/drm_cache.h   2018-11-16 11:06:16.976842816 +0000
@@ -48,7 +48,7 @@
 #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
        return false;
 #else
-       return true;
+       return false;
 #endif
 }

Makes it work. Of course this isn't a brilliant patch, but indeed there is something up with the way write combined memory is handled on ARM here. but disabling WC for all ARM DRM devices might be too much of a sledgehammer... I'm going to look into a less sledge-hammer solution that might make this work more universally. I'll get back to you on that.
Comment 16 Christian König 2018-11-19 13:39:44 UTC
(In reply to Carsten Haitzler from comment #15)
> Makes it work. Of course this isn't a brilliant patch, but indeed there is
> something up with the way write combined memory is handled on ARM here.

Well disabling WC is also a good way of reducing the performance in general.

E.g. what could be is that because you disabled WC the performance is reduced and because of that the timing is changed....

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.