104932 – Hang when running X11/Wayland on GFX8/Polaris10/Ellesmere/Rx-480-8GiB (agd5f a5592a6df4f45a018b48f252ad1c498e683e9b9d, hwentland's DC-Patches-Jan-31-2018.mbox applied)

Bug 104932 - Hang when running X11/Wayland on GFX8/Polaris10/Ellesmere/Rx-480-8GiB (agd5f a5592a6df4f45a018b48f252ad1c498e683e9b9d, hwentland's DC-Patches-Jan-31-2018.mbox applied)

Summary: Hang when running X11/Wayland on GFX8/Polaris10/Ellesmere/Rx-480-8GiB (agd5f ...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-02-03 20:25 UTC by Robin Kauffman
Modified:	2018-03-07 17:57 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
Kernel serial log while firing up the Awesome WM (406 bytes, text/plain) 2018-02-03 20:25 UTC, Robin Kauffman	no flags	Details
Kernel serial log while firing up gnome-shell as a Wayland compositor (369 bytes, text/plain) 2018-02-03 20:26 UTC, Robin Kauffman	no flags	Details
Xorg.0.log from firing up X directly with Awesome as a WM (Boring) (38.19 KB, text/plain) 2018-02-03 20:27 UTC, Robin Kauffman	no flags	Details
View All

Description Robin Kauffman 2018-02-03 20:25:19 UTC

Created attachment 137158 [details]
Kernel serial log while firing up the Awesome WM

Hello-
    I've been having (for a while now, and going back a bit in terms of AMDGPU tree commit history) an issue whereby the framebuffer & graphics stack (*usually* not the kernel writ large) will hang upon firing up XOrg's XServer or a Wayland compositor (in my case, GNOME-Shell).
    What compounded my frustration was that I'd rarely (if ever) see any output from the Xorg server, the Wayland compositor, or the kernel indicating there was any sort of problem, save for a trapped SIGQUIT (trap3) hitting gnome-shell.
    Finally, in firing up X11 directly w/ Awesome as the WM I obtained some kernel output which will hopefully prove useful in diagnosing this issue.  It's below (and attached):
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500002
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088002
amdgpu 0000:01:00.0: VM fault (0x02, vmid 5, pasid 32768) at page 5242882, read from 'TC6' (0x54433600) (136)
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, last signaled seq=45, last emitted seq=48
[drm] IP block:gfx_v8_0 is hung!
[drm] GPU recovery disabled.

Doubt it's tremendously useful, but kernel output with gnome-shell (in this case acting as a Wayland compositor) is below and attached as well:
traps: gnome-shell[1260] trap int3 ip:7f6e4b956e01 sp:7ffe72a8aa50 error:0 in libglib-2.0.so.0.5200.3[7f6e4b908000+10f000]
traps: gnome-shell[1336] trap int3 ip:7f0f89b3ee01 sp:7ffefd8d3df0 error:0 in libglib-2.0.so.0.5200.3[7f0f89af0000+10f000]
traps: gnome-shell[1351] trap int3 ip:7f0984cb6e01 sp:7ffd3caa7650 error:0 in libglib-2.0.so.0.5200.3[7f0984c68000+10f000]

Lastly, my Xorg.0.log is attached, but you'll find it's quite boring (no complaints, which is good, but seemingly no useful debugging information, either).

Comment 1 Robin Kauffman 2018-02-03 20:26:14 UTC

Created attachment 137159 [details]
Kernel serial log while firing up gnome-shell as a Wayland compositor

Forgot that Bugzilla nominally allows only one attachment per comment.

Comment 2 Robin Kauffman 2018-02-03 20:27:11 UTC

Created attachment 137160 [details]
Xorg.0.log from firing up X directly with Awesome as a WM (Boring)

Finally, Xorg.0.log

Comment 3 Robin Kauffman 2018-02-03 20:47:09 UTC

Oops, discovered too late that INT3 is used to set a breakpoint.  Regardless, it's still something SIGTRAP-ish (i.e. SIGDUMP-CORE-AND-TERMINATE).

Comment 4 Michel Dänzer 2018-02-05 09:06:48 UTC

Which versions of Mesa & LLVM are you using?

Comment 5 Robin Kauffman 2018-02-07 17:51:06 UTC

(In reply to Michel Dänzer from comment #4)
> Which versions of Mesa & LLVM are you using?

LLVM & Clang 7.0 Git, LLVM commit e0c16f05e9fbc1dcd291814ceab9dbc5, Clang commit e0c16f05e9fbc1dcd291814ceab9dbc5.  Both were merged 2018/01/31.  Let me know if you need more detail.

Comment 6 Robin Kauffman 2018-02-10 19:33:49 UTC

(In reply to Michel Dänzer from comment #4)
> Which versions of Mesa & LLVM are you using?

My sincere apologies, I neglected to include the version of Mesa et al.  Unfortunately, it was reading comprehension that failed me, not attention or due diligence.  Here's what's installed:

libdrm Git master, commit 9e34ad590e0e1003a597b8cc790a3f36830ba993, merged 1/31/2018.
libclc Git master, commit 13541fc0c779cbadbda793c67a0402558ff92ab6, merged 1/31/2018.
Mesa Git master, commit bbef9474fa52d9aba06eeede52558fc5ccb762dd, merged 1/31/2018.

Hopefully this now makes my answer complete enough, and again, apologies for the delay.

Comment 7 Robin Kauffman 2018-02-11 22:54:38 UTC

(In reply to Michel Dänzer from comment #4)
> Which versions of Mesa & LLVM are you using?

I should also add that LLVM & Clang were compiled/installed *prior* to merging libclc & Mesa (probably prior to libdrm as well, but in full honesty I can't remember), and that libdrm, libclc & Mesa were merged in that order.

Comment 8 Robin Kauffman 2018-03-07 17:57:28 UTC

Hi Again-
    Well, after (repeatedly) breaking the Law of Sysadminning (changing more than one thing at a time), and upgrading userland (including, but sadly not limited to, the 3D stack) as well as the kernel, I have a working desktop, and what little I can divine from where I ended up is that it may well *not* have been RadeonSI/libdrm/AMDGPU at fault to begin with (almost labeled the resolution NOTOURBUG, but the probability that there might have been some slight issue with things actually pertaining to the RadeonSI/AMDGPU graphics driver is to my mind nonzero, even if scant).
    I unfortunately neglected to bisect the kernel driver (or parts of userland) to see if I could get things working by going back in time and seeing if there was some commit somewhere that broke things (at least for my already-rickety userland).  Such a commit may well exist, but going back to try to find it would be a lengthy endeavor, and given that things more-or-less work for me now, the motivation for doing so has all but dried up.
    If someone wants to follow-up with a likely cause for the kernel complaint I *did* see earlier, by all means do so, but I have a sneaking suspicion that I likely made things Not Work by having a haphazard and semi-out-of-date userland (in this case, *outside* of libdrm/LLVM/Clang/Mesa/etc).
    Thanks, and apologies for any time sunk on anyone else's behalf trying to suss this one out.

        -Robin K.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.