Bug 103401

Summary: X with radeon/amdgpu crashes under lightdm and sddm but not xdm and startx
Product: Mesa Reporter: Sergey Kondakov <virtuousfox>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED WORKSFORME QA Contact: Default DRI bug account <dri-devel>
Severity: critical    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: dmesg_20171022
kernel config
XFX_RX-580P8DFD6_mod21-himem+default_voltage.rom
X log from lightdm
X log from startx
glxinfo_20171023

Description Sergey Kondakov 2017-10-22 12:27:53 UTC
Created attachment 134986 [details]
dmesg_20171022

My Radeon HD6780 card has fried itself recently and after that when I tried temporary HD5xxx-something and my new RX 580 lightdm refuses to work, logging as if driver crashed. But it works fine if I login via console and start the session with startx. Without writing anything meaningful in logs sddm also doesn't want to start but xdm seem to work, as ugly as it is.

The only suspicious thing in logs are these bits from dmesg:
[    5.975054] [drm] Detected VRAM RAM=8192M, BAR=256M
[    5.975059] [drm] RAM width 256bits GDDR5
[    5.976134] [drm] amdgpu: 8192M of VRAM memory ready
[    5.976139] [drm] amdgpu: 8192M of GTT memory ready.
[    5.976158] [drm] GART: num cpu pages 2097152, num gpu pages 2097152
[    5.976341] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f400111000 flags=0x0010]
[    5.976356] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f400110040 flags=0x0010]
[    5.976369] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f4001113c0 flags=0x0010]
[    5.976381] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f4001111c0 flags=0x0010]
[    5.976394] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f400111040 flags=0x0010]
[    5.976406] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f400111200 flags=0x0010]
[    5.976419] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f400111080 flags=0x0010]
[    5.976431] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f400111240 flags=0x0010]
[    5.976444] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f4001110c0 flags=0x0010]
[    5.976456] amdgpu 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x000000f400111280 flags=0x0010]
[    5.976469] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f400111100 flags=0x0010]
[    5.976481] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f4001112c0 flags=0x0010]
[    5.976493] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f400111140 flags=0x0010]
[    5.976505] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f400111300 flags=0x0010]
[    5.976517] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f400111180 flags=0x0010]
[    5.976529] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f400111340 flags=0x0010]
[    5.976541] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f400111380 flags=0x0010]
[    5.976553] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f4001101c0 flags=0x0010]
[    5.976565] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x000000f4001103c0 flags=0x0010]
[    5.978360] [drm] PCIE GART of 8192M enabled (table at 0x0000000000040000).
...
[    6.236889] amdgpu: [powerplay] [AVFS] Something is broken. See log!

The last time I saw something similar is attempt in using GeForce 6600 with nouveau where it spammed system with infinite AMD-Vi errors until it hanged. With disabled iommu there are no errors in log but system still hanged.
BUT with nvidia driver nothing errors out or crashes.

It may have something to do with system updates or my custom kernel config. Or it may not, I have no idea.

X log from lightdm with RX580 says:
(II) [KMS] Kernel modesetting enabled.
(EE).
(EE) Backtrace:
(EE) 0: /usr/bin/X (xorg_backtrace+0x65) [0x55b88ef352c5]
(EE) 1: /usr/bin/X (0x55b88ed80000+0x1b9079) [0x55b88ef39079]
(EE) 2: /lib64/libpthread.so.0 (0x7f4bb2d61000+0x12270) [0x7f4bb2d73270]
(EE) 3: /lib64/libc.so.6 (0x7f4bb29a1000+0x98396) [0x7f4bb2a39396]
(EE) 4: /usr/lib64/dri/radeonsi_dri.so (0x7f4bace31000+0x7b5fa9) [0x7f4bad5e6fa9]
(EE) 5: /usr/lib64/dri/radeonsi_dri.so (0x7f4bace31000+0x7b63ab) [0x7f4bad5e73ab]
(EE) 6: /usr/lib64/dri/radeonsi_dri.so (0x7f4bace31000+0x73430) [0x7f4bacea4430]
(EE) 7: /lib64/ld-linux-x86-64.so.2 (0x7f4bb4f89000+0xfb8a) [0x7f4bb4f98b8a]
(EE) 8: /lib64/ld-linux-x86-64.so.2 (0x7f4bb4f89000+0xfc96) [0x7f4bb4f98c96]
(EE) 9: /lib64/ld-linux-x86-64.so.2 (0x7f4bb4f89000+0x143ee) [0x7f4bb4f9d3ee]
(EE) 10: /lib64/libc.so.6 (_dl_catch_error+0x84) [0x7f4bb2ad57e4]
(EE) 11: /lib64/ld-linux-x86-64.so.2 (0x7f4bb4f89000+0x13b99) [0x7f4bb4f9cb99]
(EE) 12: /lib64/libdl.so.2 (0x7f4bb4491000+0xf96) [0x7f4bb4491f96]
(EE) 13: /lib64/libc.so.6 (_dl_catch_error+0x84) [0x7f4bb2ad57e4]
(EE) 14: /lib64/libdl.so.2 (0x7f4bb4491000+0x1665) [0x7f4bb4492665]
(EE) 15: /lib64/libdl.so.2 (dlopen+0x41) [0x7f4bb4492021]
(EE) 16: /usr/lib64/libgbm.so.1 (0x7f4baf259000+0x4ca4) [0x7f4baf25dca4]
(EE) 17: /usr/lib64/libgbm.so.1 (0x7f4baf259000+0x4dcc) [0x7f4baf25ddcc]
(EE) 18: /usr/lib64/libgbm.so.1 (0x7f4baf259000+0x5208) [0x7f4baf25e208]
(EE) 19: /usr/lib64/libgbm.so.1 (gbm_create_device+0x57) [0x7f4baf25c127]
(EE) 20: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (0x7f4baf679000+0xf139) [0x7f4baf688139]
(EE) 21: /usr/bin/X (InitOutput+0xa3d) [0x55b88ee1a11d]
(EE) 22: /usr/bin/X (0x55b88ed80000+0x57fd3) [0x55b88edd7fd3]
(EE) 23: /lib64/libc.so.6 (__libc_start_main+0xea) [0x7f4bb29c1f4a]
(EE) 24: /usr/bin/X (_start+0x2a) [0x55b88edc1eba]
(EE).
(EE) Segmentation fault at address 0x0
Comment 1 Sergey Kondakov 2017-10-22 12:30:13 UTC
Created attachment 134987 [details]
kernel config

config.HSF from https://build.opensuse.org/package/show/home:X0F:HSF:Kernel/kernel-source for kernel 4.13.8-664.g7aed50c-HSF
Comment 2 Sergey Kondakov 2017-10-22 12:49:57 UTC
Created attachment 134988 [details]
XFX_RX-580P8DFD6_mod21-himem+default_voltage.rom

Custom BIOS that I use which is based on stock secondary, 1150/2100, XFX BIOS that was aimed at low core power and overclocked memory. I've used 1300/2000 while keeping tight timings (in comparison to primary, 1366/2000, stock BIOS) and automatic voltage control ('6528[1-8]' values). But I did lower power throttling while dramatically increasing aggressiveness of fans and temperature throttling & shutdown.

Stock BIOSes both had static '750mV' voltage for lowest core frequency and amdgpu kernel module didn't like that, it would complain about any static voltage.
I've also made custom Hynix timings for 500Mhz (replacement for default 300Mhz in hope that it will switch to 1000Mhz less) which wasn't easy but it seem to work without issues now.
Comment 3 Michel Dänzer 2017-10-23 10:29:09 UTC
Please attach the full corresponding Xorg log file and output of glxinfo.

Looks like the crash happens in Mesa.

Would be helpful if you could get more information about the crash, either with gdb (see https://www.x.org/wiki/Development/Documentation/ServerDebugging/) or addr2line.
Comment 4 Sergey Kondakov 2017-10-23 12:20:27 UTC
Created attachment 135003 [details]
X log from lightdm

It's not much bigger than a snippet above.
Comment 5 Sergey Kondakov 2017-10-23 12:21:13 UTC
Created attachment 135004 [details]
X log from startx

Not a hint for trouble.
Comment 6 Sergey Kondakov 2017-10-23 12:23:26 UTC
Created attachment 135005 [details]
glxinfo_20171023
Comment 7 Sergey Kondakov 2017-10-23 12:27:54 UTC
(In reply to Michel Dänzer from comment #3)
> Please attach the full corresponding Xorg log file and output of glxinfo.
> 
> Looks like the crash happens in Mesa.
> 
> Would be helpful if you could get more information about the crash, either
> with gdb (see
> https://www.x.org/wiki/Development/Documentation/ServerDebugging/) or
> addr2line.

That's would be hard since it crashes only when launched by a system service. At best I can try to install debug symbols to convert those numbers to something readable.
Comment 8 Sergey Kondakov 2017-10-25 07:47:52 UTC
I tried to get a trace via that Xgdb script but X stuck with black screen and nothing in logs. Then I ran system update and suddenly lightdm and sddm started working. So, that issue resolved itself or maybe was a result of https://bugs.freedesktop.org/show_bug.cgi?id=99591 that was fixed just now in openSUSE.

There are still those "IO_PAGE_FAULT" and powerplay's "Something is broken" but they don't seem to manifest in noticeable manner yet.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.