Summary: | AMDGPU - Can't even get Xorg to start - Kernel driver hangs with ring buffer timeout on ARM64 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Carsten Haitzler <raster> | ||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||
Severity: | blocker | ||||||||||
Priority: | medium | CC: | pbrobinson | ||||||||
Version: | unspecified | ||||||||||
Hardware: | ARM | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
Carsten Haitzler
2018-11-01 15:59:10 UTC
Please attach your full dmesg output and xorg log if using X. Created attachment 142337 [details]
log - dmesg
Created attachment 142338 [details]
log - xorg
Created attachment 142339 [details]
log - xorg - gdb attach + bt
Attached them (too big to put inline as comments). It looks like something submitted by mesa caused a GPU hang. You might try starting a bare X server and trying some simple OGL apps to start with. Or try a newer or older version of mesa. Actually no ogl client has even started. this is just the xserver being started by slim (login manager) and that doesn't use OGL. it's really basic xlib stuff. so it is basically a raw xserver... perhaps its the glamor accel stuff... but... no OGL clients. :) never got that far. (In reply to Carsten Haitzler from comment #8) > Actually no ogl client has even started. this is just the xserver being > started by slim (login manager) and that doesn't use OGL. it's really basic > xlib stuff. so it is basically a raw xserver... perhaps its the glamor accel > stuff... but... no OGL clients. :) never got that far. Yeah, it would be GL via glamor in that case. so wouldn't that make it a necessity then if its even glamor needing it? i guess i can turn off glamor accel but realistically gl is a necessity so the problem needs to be addressed sooner or later. the ring gfx timeout smells to me of "not a mesa bug" in that an ioctl going to the drm driver never returns qhen doing a simple query. it hangs, thus something lower down that is having a bad day, if something as simple as querying a fence causes a hang... :) what is this ring gfx thing exactly (seems to be some command queue) and why would it be timing out? all the way back at seq 10/11 ... like right at the start of its use? it's almost like some interrupt or in memory semaphore thing mapped from the card is messing up? i'm looking for something to look into more specifically. Does this patch help? https://patchwork.freedesktop.org/patch/259364/ Does ARM support write combining? The driver uses it pretty extensively. You might try disabling GTT_USWC (uncached write combined) support in the kernel driver and just falling back to cached memory. (In reply to Carsten Haitzler from comment #10) > so wouldn't that make it a necessity then if its even glamor needing it? i > guess i can turn off glamor accel but realistically gl is a necessity so the > problem needs to be addressed sooner or later. > If you were starting a bare x server, you usually don't hit the glamor paths too extensively compared to a full desktop environment. > the ring gfx timeout smells to me of "not a mesa bug" in that an ioctl going > to the drm driver never returns qhen doing a simple query. it hangs, thus > something lower down that is having a bad day, if something as simple as > querying a fence causes a hang... :) > > what is this ring gfx thing exactly (seems to be some command queue) and why > would it be timing out? all the way back at seq 10/11 ... like right at the > start of its use? it's almost like some interrupt or in memory semaphore > thing mapped from the card is messing up? i'm looking for something to look > into more specifically. Each engine on the GPU (gfx, compute, video decode, encode, dma, etc.) has a ring buffer used to feed it. The work sent to the engines is managed by a sw scheduler in the kernel. The kernel driver tests the rings as part of the driver init sequence. The driver won't come up if the ring tests fail so they are working at least until you start X. Presumably X submits (via glamor) some work to the GPU which causes the GPU to hang. The fence never signals because the GPU never finished processing the job due to the hang. Another simplier test would be to boot up to a console (no X) and then try running some of the libdrm amdgpu tests. They are really simple (copying data and round and verifying it using different engines, allocating freeing memory, etc.). https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu See if some of the simple copy or write tests work. Does returning false in drm_arch_can_wc_memory() for ARM fix the issue? (In reply to Alex Deucher from comment #13) > Does returning false in drm_arch_can_wc_memory() for ARM fix the issue? This has enabled a working driver for others on ARM. And lo and behold: --- ./include/drm/drm_cache.h~ 2018-08-12 21:41:04.000000000 +0100 +++ ./include/drm/drm_cache.h 2018-11-16 11:06:16.976842816 +0000 @@ -48,7 +48,7 @@ #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3) return false; #else - return true; + return false; #endif } Makes it work. Of course this isn't a brilliant patch, but indeed there is something up with the way write combined memory is handled on ARM here. but disabling WC for all ARM DRM devices might be too much of a sledgehammer... I'm going to look into a less sledge-hammer solution that might make this work more universally. I'll get back to you on that. (In reply to Carsten Haitzler from comment #15) > Makes it work. Of course this isn't a brilliant patch, but indeed there is > something up with the way write combined memory is handled on ARM here. Well disabling WC is also a good way of reducing the performance in general. E.g. what could be is that because you disabled WC the performance is reduced and because of that the timing is changed.... |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.