Bug 101998

Summary: X server 1.19.3 failure with amdgpu (radeon M295X) on Ubuntu HWE16.04 kernel 4.10.0-28
Product: xorg Reporter: Meik Neubauer <tech>
Component: Driver/modesettingAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: blocker    
Priority: medium CC: adam.chou, agoins, chenhan.hsiao.tw, tsunghanliu
Version: 7.7 (2012.06)   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=102184
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
log file showing segmentation fault
none
older logfile without errors (compare this to the current logfile)
none
2017/08/02 error reproduce dmesg output
none
2017/08/02 error reproduce Xorg.0.log
none
'ps -ef' output after 'service lightdm stop'
none
20170804 reproduce (standby/resume) error report 1/3
none
20170804 reproduce (standby/resume) error report 2/3
none
20170804 reproduce (standby/resume) error report 3/3
none
20170804 reproduce (standby/resume) dmesg output
none
20170804 reproduce (standby/resume) Xorg.0.log.old
none
20170804 reproduce (standby/resume) Xorg.0.log
none
20170814 error reproduce Dependencies 1/3
none
20170814 error reproduce Dependencies 2/3
none
20170814 error reproduce Dependencies 3/3
none
20170814 error reproduce Disassembly + Journalerrors
none
20170814 error reproduce Proccpuinfo
none
20170814 error reproduce Procmaps
none
20170814 error reproduce Procstatus
none
20170814 error reproduce Registers
none
20170814 error reproduce Stacktrace
none
20170814 error reproduce Threadstacktrace
none
apport .crash file
none
extracted Core Dump from attachment 133603 (apport .crash file) none

Description Meik Neubauer 2017-08-01 12:52:31 UTC
Created attachment 133168 [details]
log file showing segmentation fault

Note: The problem only occurs if an external monitor is attached!

Kernel initialization is ok. Seeing boot msgs on all ext. monitors driven by AMD.

Subsequent Greeter load failures are accompanied by these messages:

Aug  1 13:24:59 mnmaster kernel: [   40.455942] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* channel eq failed: 5 tries
Aug  1 13:24:59 mnmaster kernel: [   40.455954] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* channel eq failed

Greeter is then stable after a few failures and login is possible.
Xorg.0.log shows segmentation fault:

[   119.868] randr: falling back to unsynchronized pixmap sharing
[   120.087] (EE) 
[   120.087] (EE) Backtrace:
[   120.087] (EE) 0: /usr/lib/xorg/Xorg (xorg_backtrace+0x4e) [0x55cb016cbf7e]
[   120.087] (EE) 1: /usr/lib/xorg/Xorg (0x55cb0151a000+0x1b5ce9) [0x55cb016cfce9]
[   120.087] (EE) 2: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f1587a1d000+0x11390) [0x7f1587a2e390]
[   120.087] (EE) 
[   120.087] (EE) Segmentation fault at address 0x0
[   120.087] (EE) 
Fatal server error:
[   120.087] (EE) Caught signal 11 (Segmentation fault). Server aborting


I will add the current Xorg.0.log and also an older Xorg.0.log from April that used the 4.8.0-36 kernel. Please compare these logs to see that the root cause seems to be that amdgpu cannot allocate a frame buffer.

I should also mention that kern.log is flooded with these messages (but that may not be related to the error):
Aug  1 14:44:19 mnmaster kernel: [ 2759.820005] [drm:drm_mode_addfb2 [drm]] [FB:74]
Aug  1 14:44:19 mnmaster kernel: [ 2759.847678] [drm:drm_mode_addfb2 [drm]] [FB:77]
Comment 1 Meik Neubauer 2017-08-01 12:54:47 UTC
Created attachment 133169 [details]
older logfile without errors (compare this to the current logfile)
Comment 2 Michel Dänzer 2017-08-02 01:37:08 UTC
Please attach the corresponding dmesg output as well.

(In reply to Meik Neubauer from comment #0)
> [   120.087] (EE) Backtrace:
> [   120.087] (EE) 0: /usr/lib/xorg/Xorg (xorg_backtrace+0x4e)
> [0x55cb016cbf7e]
> [   120.087] (EE) 1: /usr/lib/xorg/Xorg (0x55cb0151a000+0x1b5ce9)
> [0x55cb016cfce9]

We need to know where in the Xorg code this is located. Make sure the xserver-xorg-core-dbg(sym) package is installed, then try

 addr2line -i -e /usr/lib/xorg/Xorg 0x55cb016cfce9

If that fails to resolve the source code location, manually start Xorg in gdb and get a backtrace when it crashes.
Comment 3 Meik Neubauer 2017-08-02 14:14:39 UTC
The Ubuntu HWE16.04 package xserver-xorg-core-hwe-16.04-dbg exists but does not seem to contain any debugging symbols.
Installed files:
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/xserver-xorg-core-hwe-16.04-dbg
/usr/share/doc/xserver-xorg-core-hwe-16.04-dbg/changelog.Debian.gz
/usr/share/doc/xserver-xorg-core-hwe-16.04-dbg/copyright

Hence gdb will not debug this.


Also: What is the recommended way to safely bring down the existing xserver?
I tried 'service lightdm stop' from vt1, but this leaves Xorg and several other X-related processes (xinit, xfailsave) hanging around.


Nevertheless, I will attach a dmesg output with the corresponding Xorg.0.log.
Comment 4 Meik Neubauer 2017-08-02 14:16:17 UTC
Created attachment 133199 [details]
2017/08/02 error reproduce dmesg output
Comment 5 Meik Neubauer 2017-08-02 14:16:41 UTC
Created attachment 133200 [details]
2017/08/02 error reproduce Xorg.0.log
Comment 6 Timo Aaltonen 2017-08-03 08:42:16 UTC
enable ddebs as described in
https://wiki.ubuntu.com/Debug%20Symbol%20Packages

and install xserver-xorg-core-hwe-16.04-dbgsym
Comment 7 Meik Neubauer 2017-08-04 11:52:37 UTC
Now on Ubuntu HWE16.04 kernel 4.10.0-30

Problem persists. Unable to drive external monitors.

I was unable to start Xorg in gdb. I do 'service lightdm stop', but this does not stop Xorg. Killing the process results in a new Xorg process being started.
I have not found out yet what is causing this behavior. Any clues welcome.
I will attach a 'ps -ef' output from after lightdm stop. This shows the related X processes.

What I could do is reproduce this after login, then suspend the machine and then resume. This produces a similar error and it brings up the Ubuntu error reporting. I took some screenshots of this. Not sure this will help, but I will attach it anyway, together with related logs and dmesg.
Comment 8 Meik Neubauer 2017-08-04 11:56:27 UTC
Created attachment 133236 [details]
'ps -ef' output after 'service lightdm stop'
Comment 9 Meik Neubauer 2017-08-04 11:58:47 UTC
Created attachment 133237 [details]
20170804 reproduce (standby/resume) error report 1/3
Comment 10 Meik Neubauer 2017-08-04 11:59:09 UTC
Created attachment 133238 [details]
20170804 reproduce (standby/resume) error report 2/3
Comment 11 Meik Neubauer 2017-08-04 11:59:34 UTC
Created attachment 133239 [details]
20170804 reproduce (standby/resume) error report 3/3
Comment 12 Meik Neubauer 2017-08-04 12:00:02 UTC
Created attachment 133240 [details]
20170804 reproduce (standby/resume) dmesg output
Comment 13 Meik Neubauer 2017-08-04 12:00:43 UTC
Created attachment 133241 [details]
20170804 reproduce (standby/resume) Xorg.0.log.old
Comment 14 Meik Neubauer 2017-08-04 12:01:17 UTC
Created attachment 133242 [details]
20170804 reproduce (standby/resume) Xorg.0.log
Comment 15 Michel Dänzer 2017-08-07 07:46:55 UTC
This is weird. The backtrace looks like ppriv->notify_on_damage == TRUE in ms_dirty_update, which indicates that msRequestSharedPixmapNotifyDamage was called before. But xf86-video-amdgpu doesn't call that hook.
Comment 16 Meik Neubauer 2017-08-14 12:18:27 UTC
Tried to reproduce today with kernel 4.10.0-32.

Booted, then 'service lightdm stop'. This time Xorg was unloaded from the system.

Started 'gdb --args /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch'
[Rem.: this is the 'normal' cmdline I see in a running system]
then 'service lightdm start', but seemingly it did not use the Xorg started with gdb, but created a new process. Hit the error again.

Rebooted with external monitor detached.
Ubuntu error reporting came up with the Xorg failure.
I will attach screenshots of this.
Comment 17 Meik Neubauer 2017-08-14 12:26:09 UTC
Created attachment 133483 [details]
20170814 error reproduce Dependencies 1/3
Comment 18 Meik Neubauer 2017-08-14 12:26:40 UTC
Created attachment 133484 [details]
20170814 error reproduce Dependencies 2/3
Comment 19 Meik Neubauer 2017-08-14 12:26:54 UTC
Created attachment 133485 [details]
20170814 error reproduce Dependencies 3/3
Comment 20 Meik Neubauer 2017-08-14 12:27:31 UTC
Created attachment 133486 [details]
20170814 error reproduce Disassembly + Journalerrors
Comment 21 Meik Neubauer 2017-08-14 12:27:48 UTC
Created attachment 133487 [details]
20170814 error reproduce Proccpuinfo
Comment 22 Meik Neubauer 2017-08-14 12:28:32 UTC
Created attachment 133488 [details]
20170814 error reproduce Procmaps
Comment 23 Meik Neubauer 2017-08-14 12:28:52 UTC
Created attachment 133489 [details]
20170814 error reproduce Procstatus
Comment 24 Meik Neubauer 2017-08-14 12:29:16 UTC
Created attachment 133490 [details]
20170814 error reproduce Registers
Comment 25 Meik Neubauer 2017-08-14 12:29:36 UTC
Created attachment 133491 [details]
20170814 error reproduce Stacktrace
Comment 26 Meik Neubauer 2017-08-14 12:30:00 UTC
Created attachment 133492 [details]
20170814 error reproduce Threadstacktrace
Comment 27 Michel Dänzer 2017-08-15 01:36:28 UTC
(In reply to Meik Neubauer from comment #16)
> Started 'gdb --args /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth
> /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch'

Can you try simply starting Xorg something like

 Xorg :0 -retro

That doesn't trigger the crash? What does

 DISPLAY=:0 xrandr

say at that point?


If you remove the xserver-xorg-video-amdgpu package, the problem doesn't occur anymore?
Comment 28 Meik Neubauer 2017-08-18 12:02:57 UTC
I tried it out but couldn't get Xorg running under gdb, or attach gdb to a running Xorg without freezing the whole system.

I will upload a .crash file which has a coredump in it, which you can extract with apport-unpack <.crash-file>
Comment 29 Meik Neubauer 2017-08-18 12:05:26 UTC
Created attachment 133603 [details]
apport .crash file

use
apport-unpack <.crash-filename> <empty_unpack_dir>
to retrieve the coredump (164MB)
Comment 30 Michel Dänzer 2017-08-22 07:03:00 UTC
(In reply to Meik Neubauer from comment #29)
> apport-unpack <.crash-filename> <empty_unpack_dir>
> to retrieve the coredump (164MB)

Unfortunately, that aborts for me with a Python exception, maybe because I'm on Debian instead of Ubuntu.

Anyway, right now I'm really looking for answers to the questions in comment 27.
Comment 31 Michel Dänzer 2017-08-22 07:08:06 UTC
(In reply to Meik Neubauer from comment #28)
> I tried it out but couldn't get Xorg running under gdb, or attach gdb to a
> running Xorg without freezing the whole system.

Sounds like you tried running gdb from the same machine? You have to run it remotely, see https://www.x.org/wiki/Development/Documentation/ServerDebugging/ for more information.
Comment 32 Meik Neubauer 2017-08-22 07:30:43 UTC
(In reply to Michel Dänzer from comment #30)
> (In reply to Meik Neubauer from comment #29)
> > apport-unpack <.crash-filename> <empty_unpack_dir>
> > to retrieve the coredump (164MB)
> 
> Unfortunately, that aborts for me with a Python exception, maybe because I'm
> on Debian instead of Ubuntu.

I will add the core dump in a simple compressed format...
Comment 33 Meik Neubauer 2017-08-22 07:32:24 UTC
Created attachment 133663 [details]
extracted Core Dump from attachment 133603 [details] (apport .crash file)
Comment 34 Michel Dänzer 2017-08-22 09:23:50 UTC
gdb cannot resolve symbols with only the core dump. Anyway, the screenshots showed the interesting parts of the backtrace. The problem is I have no clue how it could end up running the crashing code path.

We really need answers to the questions in comment 27.
Comment 35 Meik Neubauer 2017-08-23 14:32:02 UTC
(In reply to Michel Dänzer from comment #34)
> gdb cannot resolve symbols with only the core dump. Anyway, the screenshots
> showed the interesting parts of the backtrace. The problem is I have no clue
> how it could end up running the crashing code path.

I am not sure I understand this correctly. Maybe I am not getting the point here, because I am not familiar with the intel platform (working on old Mainframes instead), but from what I can get out of gdb is that the call stack is pretty clear and the failure occurs in hw/xfree86/drives/modesetting/driver.c in routine ms_dirty_update where
ent->slave_dst->drawable.pScreen-> SharedPixmapNotifyDamage(ent->slave_dst);
loads a zero address for the call.

The question is, which routine is responsible for setting up the structure in
ent->slave_dst->drawable.pScreen
and why is that not done correctly.

The related data areas seem to be all available in the core dump.

I agree that backtracking this is painful bit-counting, but it does not seem out of reach.

I would first try to identify the routine that sets up
ent->slave_dst->drawable.pScreen

I have not tried this yet, because without a complete build setup on my machine this is cumbersome task. I guess that with a ready development environment on your side this is much easier to find out.


> We really need answers to the questions in comment 27.

I cannot set up a second machine. Due to network security restrictions here I cannot connect between two PCs. And I do not have an external monitor in any other location. So running a remote gdb debugging session is out of reach.
Also, this is my everyday production laptop, so I cannot conduct any wild experiments on it.
Comment 36 Michel Dänzer 2017-08-23 15:29:44 UTC
(In reply to Meik Neubauer from comment #35)
> [...] from what I can get out of gdb is that the call stack is pretty clear
> [...]

Right, which is why I'm saying we don't need more backtraces right now.


> The question is, which routine is responsible for setting up the structure in
> ent->slave_dst->drawable.pScreen and why is that not done correctly.

As I said in comment 15, the question is how ppriv->notify_on_damage ends up being TRUE in ms_dirty_update. The only way I can see that being the case is if  msRequestSharedPixmapNotifyDamage was called before, but xf86-video-amdgpu doesn't have any code calling the RequestSharedPixmapNotifyDamage hook.


> > We really need answers to the questions in comment 27.
> 
> I cannot set up a second machine. Due to network security restrictions here
> I cannot connect between two PCs. And I do not have an external monitor in
> any other location. So running a remote gdb debugging session is out of
> reach.

None of the questions in comment 27 are related to remote access or gdb.
Comment 37 Michel Dänzer 2017-08-24 03:52:50 UTC
I was able to reproduce the problem. It's a modesetting driver bug, this code in ms_dirty_update:

            msPixmapPrivPtr ppriv =
                msGetPixmapPriv(&ms->drmmode, ent->slave_dst);

probably needs to be:

            msPixmapPrivPtr ppriv =
                msGetPixmapPriv(&ms->drmmode, ent->slave_dst->master_pixmap);

otherwise it tries to access random memory when the slave screen doesn't use the modesetting driver.

There might be more related issues.
Comment 38 Meik Neubauer 2017-08-24 11:50:01 UTC
(In reply to Michel Dänzer from comment #36)
> > > We really need answers to the questions in comment 27.

I enabled the integrated intel GFX for external monitors in the BIOS.
Then I booted the Linux kernel with parameter
module_blacklist=amdgpu

The error does not show up.
Currently driving two external monitors over the intel GFX. A hot-plugged third is detected but not auto-configured correctly (no output).
Comment 39 Robert Liu 2018-05-23 03:48:41 UTC
I have a similar issue. After applied the patch in the comment #37, Xorg process does not crash when connecting a monitor.
Comment 40 Michel Dänzer 2018-05-24 08:33:52 UTC
*** Bug 106638 has been marked as a duplicate of this bug. ***
Comment 41 Michel Dänzer 2018-05-24 08:35:14 UTC
Someone who cares about the modesetting driver should create a proper patch and send it to the xorg-devel mailing list for review.
Comment 42 Michel Dänzer 2018-08-06 08:58:45 UTC
*** Bug 107437 has been marked as a duplicate of this bug. ***
Comment 43 Michel Dänzer 2018-08-29 08:27:41 UTC
Thanks for the report, fixed in xserver Git master:

commit f79e5368512b72bb463925983d265b070261b7aa
Author: Jim Qu <Jim.Qu@amd.com>
Date:   Mon Aug 27 13:37:38 2018 +0800

    modesetting: code refactor for PRIME sync

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.