Created attachment 133168 [details] log file showing segmentation fault Note: The problem only occurs if an external monitor is attached! Kernel initialization is ok. Seeing boot msgs on all ext. monitors driven by AMD. Subsequent Greeter load failures are accompanied by these messages: Aug 1 13:24:59 mnmaster kernel: [ 40.455942] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* channel eq failed: 5 tries Aug 1 13:24:59 mnmaster kernel: [ 40.455954] [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* channel eq failed Greeter is then stable after a few failures and login is possible. Xorg.0.log shows segmentation fault: [ 119.868] randr: falling back to unsynchronized pixmap sharing [ 120.087] (EE) [ 120.087] (EE) Backtrace: [ 120.087] (EE) 0: /usr/lib/xorg/Xorg (xorg_backtrace+0x4e) [0x55cb016cbf7e] [ 120.087] (EE) 1: /usr/lib/xorg/Xorg (0x55cb0151a000+0x1b5ce9) [0x55cb016cfce9] [ 120.087] (EE) 2: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f1587a1d000+0x11390) [0x7f1587a2e390] [ 120.087] (EE) [ 120.087] (EE) Segmentation fault at address 0x0 [ 120.087] (EE) Fatal server error: [ 120.087] (EE) Caught signal 11 (Segmentation fault). Server aborting I will add the current Xorg.0.log and also an older Xorg.0.log from April that used the 4.8.0-36 kernel. Please compare these logs to see that the root cause seems to be that amdgpu cannot allocate a frame buffer. I should also mention that kern.log is flooded with these messages (but that may not be related to the error): Aug 1 14:44:19 mnmaster kernel: [ 2759.820005] [drm:drm_mode_addfb2 [drm]] [FB:74] Aug 1 14:44:19 mnmaster kernel: [ 2759.847678] [drm:drm_mode_addfb2 [drm]] [FB:77]
Created attachment 133169 [details] older logfile without errors (compare this to the current logfile)
Please attach the corresponding dmesg output as well. (In reply to Meik Neubauer from comment #0) > [ 120.087] (EE) Backtrace: > [ 120.087] (EE) 0: /usr/lib/xorg/Xorg (xorg_backtrace+0x4e) > [0x55cb016cbf7e] > [ 120.087] (EE) 1: /usr/lib/xorg/Xorg (0x55cb0151a000+0x1b5ce9) > [0x55cb016cfce9] We need to know where in the Xorg code this is located. Make sure the xserver-xorg-core-dbg(sym) package is installed, then try addr2line -i -e /usr/lib/xorg/Xorg 0x55cb016cfce9 If that fails to resolve the source code location, manually start Xorg in gdb and get a backtrace when it crashes.
The Ubuntu HWE16.04 package xserver-xorg-core-hwe-16.04-dbg exists but does not seem to contain any debugging symbols. Installed files: /. /usr /usr/share /usr/share/doc /usr/share/doc/xserver-xorg-core-hwe-16.04-dbg /usr/share/doc/xserver-xorg-core-hwe-16.04-dbg/changelog.Debian.gz /usr/share/doc/xserver-xorg-core-hwe-16.04-dbg/copyright Hence gdb will not debug this. Also: What is the recommended way to safely bring down the existing xserver? I tried 'service lightdm stop' from vt1, but this leaves Xorg and several other X-related processes (xinit, xfailsave) hanging around. Nevertheless, I will attach a dmesg output with the corresponding Xorg.0.log.
Created attachment 133199 [details] 2017/08/02 error reproduce dmesg output
Created attachment 133200 [details] 2017/08/02 error reproduce Xorg.0.log
enable ddebs as described in https://wiki.ubuntu.com/Debug%20Symbol%20Packages and install xserver-xorg-core-hwe-16.04-dbgsym
Now on Ubuntu HWE16.04 kernel 4.10.0-30 Problem persists. Unable to drive external monitors. I was unable to start Xorg in gdb. I do 'service lightdm stop', but this does not stop Xorg. Killing the process results in a new Xorg process being started. I have not found out yet what is causing this behavior. Any clues welcome. I will attach a 'ps -ef' output from after lightdm stop. This shows the related X processes. What I could do is reproduce this after login, then suspend the machine and then resume. This produces a similar error and it brings up the Ubuntu error reporting. I took some screenshots of this. Not sure this will help, but I will attach it anyway, together with related logs and dmesg.
Created attachment 133236 [details] 'ps -ef' output after 'service lightdm stop'
Created attachment 133237 [details] 20170804 reproduce (standby/resume) error report 1/3
Created attachment 133238 [details] 20170804 reproduce (standby/resume) error report 2/3
Created attachment 133239 [details] 20170804 reproduce (standby/resume) error report 3/3
Created attachment 133240 [details] 20170804 reproduce (standby/resume) dmesg output
Created attachment 133241 [details] 20170804 reproduce (standby/resume) Xorg.0.log.old
Created attachment 133242 [details] 20170804 reproduce (standby/resume) Xorg.0.log
This is weird. The backtrace looks like ppriv->notify_on_damage == TRUE in ms_dirty_update, which indicates that msRequestSharedPixmapNotifyDamage was called before. But xf86-video-amdgpu doesn't call that hook.
Tried to reproduce today with kernel 4.10.0-32. Booted, then 'service lightdm stop'. This time Xorg was unloaded from the system. Started 'gdb --args /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch' [Rem.: this is the 'normal' cmdline I see in a running system] then 'service lightdm start', but seemingly it did not use the Xorg started with gdb, but created a new process. Hit the error again. Rebooted with external monitor detached. Ubuntu error reporting came up with the Xorg failure. I will attach screenshots of this.
Created attachment 133483 [details] 20170814 error reproduce Dependencies 1/3
Created attachment 133484 [details] 20170814 error reproduce Dependencies 2/3
Created attachment 133485 [details] 20170814 error reproduce Dependencies 3/3
Created attachment 133486 [details] 20170814 error reproduce Disassembly + Journalerrors
Created attachment 133487 [details] 20170814 error reproduce Proccpuinfo
Created attachment 133488 [details] 20170814 error reproduce Procmaps
Created attachment 133489 [details] 20170814 error reproduce Procstatus
Created attachment 133490 [details] 20170814 error reproduce Registers
Created attachment 133491 [details] 20170814 error reproduce Stacktrace
Created attachment 133492 [details] 20170814 error reproduce Threadstacktrace
(In reply to Meik Neubauer from comment #16) > Started 'gdb --args /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth > /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch' Can you try simply starting Xorg something like Xorg :0 -retro That doesn't trigger the crash? What does DISPLAY=:0 xrandr say at that point? If you remove the xserver-xorg-video-amdgpu package, the problem doesn't occur anymore?
I tried it out but couldn't get Xorg running under gdb, or attach gdb to a running Xorg without freezing the whole system. I will upload a .crash file which has a coredump in it, which you can extract with apport-unpack <.crash-file>
Created attachment 133603 [details] apport .crash file use apport-unpack <.crash-filename> <empty_unpack_dir> to retrieve the coredump (164MB)
(In reply to Meik Neubauer from comment #29) > apport-unpack <.crash-filename> <empty_unpack_dir> > to retrieve the coredump (164MB) Unfortunately, that aborts for me with a Python exception, maybe because I'm on Debian instead of Ubuntu. Anyway, right now I'm really looking for answers to the questions in comment 27.
(In reply to Meik Neubauer from comment #28) > I tried it out but couldn't get Xorg running under gdb, or attach gdb to a > running Xorg without freezing the whole system. Sounds like you tried running gdb from the same machine? You have to run it remotely, see https://www.x.org/wiki/Development/Documentation/ServerDebugging/ for more information.
(In reply to Michel Dänzer from comment #30) > (In reply to Meik Neubauer from comment #29) > > apport-unpack <.crash-filename> <empty_unpack_dir> > > to retrieve the coredump (164MB) > > Unfortunately, that aborts for me with a Python exception, maybe because I'm > on Debian instead of Ubuntu. I will add the core dump in a simple compressed format...
Created attachment 133663 [details] extracted Core Dump from attachment 133603 [details] (apport .crash file)
gdb cannot resolve symbols with only the core dump. Anyway, the screenshots showed the interesting parts of the backtrace. The problem is I have no clue how it could end up running the crashing code path. We really need answers to the questions in comment 27.
(In reply to Michel Dänzer from comment #34) > gdb cannot resolve symbols with only the core dump. Anyway, the screenshots > showed the interesting parts of the backtrace. The problem is I have no clue > how it could end up running the crashing code path. I am not sure I understand this correctly. Maybe I am not getting the point here, because I am not familiar with the intel platform (working on old Mainframes instead), but from what I can get out of gdb is that the call stack is pretty clear and the failure occurs in hw/xfree86/drives/modesetting/driver.c in routine ms_dirty_update where ent->slave_dst->drawable.pScreen-> SharedPixmapNotifyDamage(ent->slave_dst); loads a zero address for the call. The question is, which routine is responsible for setting up the structure in ent->slave_dst->drawable.pScreen and why is that not done correctly. The related data areas seem to be all available in the core dump. I agree that backtracking this is painful bit-counting, but it does not seem out of reach. I would first try to identify the routine that sets up ent->slave_dst->drawable.pScreen I have not tried this yet, because without a complete build setup on my machine this is cumbersome task. I guess that with a ready development environment on your side this is much easier to find out. > We really need answers to the questions in comment 27. I cannot set up a second machine. Due to network security restrictions here I cannot connect between two PCs. And I do not have an external monitor in any other location. So running a remote gdb debugging session is out of reach. Also, this is my everyday production laptop, so I cannot conduct any wild experiments on it.
(In reply to Meik Neubauer from comment #35) > [...] from what I can get out of gdb is that the call stack is pretty clear > [...] Right, which is why I'm saying we don't need more backtraces right now. > The question is, which routine is responsible for setting up the structure in > ent->slave_dst->drawable.pScreen and why is that not done correctly. As I said in comment 15, the question is how ppriv->notify_on_damage ends up being TRUE in ms_dirty_update. The only way I can see that being the case is if msRequestSharedPixmapNotifyDamage was called before, but xf86-video-amdgpu doesn't have any code calling the RequestSharedPixmapNotifyDamage hook. > > We really need answers to the questions in comment 27. > > I cannot set up a second machine. Due to network security restrictions here > I cannot connect between two PCs. And I do not have an external monitor in > any other location. So running a remote gdb debugging session is out of > reach. None of the questions in comment 27 are related to remote access or gdb.
I was able to reproduce the problem. It's a modesetting driver bug, this code in ms_dirty_update: msPixmapPrivPtr ppriv = msGetPixmapPriv(&ms->drmmode, ent->slave_dst); probably needs to be: msPixmapPrivPtr ppriv = msGetPixmapPriv(&ms->drmmode, ent->slave_dst->master_pixmap); otherwise it tries to access random memory when the slave screen doesn't use the modesetting driver. There might be more related issues.
(In reply to Michel Dänzer from comment #36) > > > We really need answers to the questions in comment 27. I enabled the integrated intel GFX for external monitors in the BIOS. Then I booted the Linux kernel with parameter module_blacklist=amdgpu The error does not show up. Currently driving two external monitors over the intel GFX. A hot-plugged third is detected but not auto-configured correctly (no output).
I have a similar issue. After applied the patch in the comment #37, Xorg process does not crash when connecting a monitor.
*** Bug 106638 has been marked as a duplicate of this bug. ***
Someone who cares about the modesetting driver should create a proper patch and send it to the xorg-devel mailing list for review.
*** Bug 107437 has been marked as a duplicate of this bug. ***
Thanks for the report, fixed in xserver Git master: commit f79e5368512b72bb463925983d265b070261b7aa Author: Jim Qu <Jim.Qu@amd.com> Date: Mon Aug 27 13:37:38 2018 +0800 modesetting: code refactor for PRIME sync
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.