Bug 91954

Summary: "link training failed": nouveau does not recover from monitor suspend
Product: xorg Reporter: Dan Callaghan <djc>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: NEW --- QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: bghome, jbeh, patrys, roflawl2009, sgonzalez
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kernel messages showing "link training failed"
none
Kernel logs
none
Kernel log of detaching and re-attaching external monitor
none
Kernel 4.5-rc6 log of detaching and re-attaching external monitor none

Description Dan Callaghan 2015-09-10 03:07:55 UTC
When my monitor wakes up from "power save mode" (DPMS suspend) nouveau does not display any output and the monitor goes back to power save after a few seconds. The kernel says:

kernel: nouveau E[   PDISP][0000:03:00.0][0x00000006] 02:0006:0f42: link training failed

The behaviour of the monitor might be a bit dodgy. When booting the machine I have to mash buttons on the monitor to keep it awake until the firmware starts displaying an image, otherwise I will get no output from the firmware at all.

When the monitor goes into power save mode, it stays awake showing a message "Power save mode" for about 3 seconds, then it actually powers down (its LED turns orange). I have noticed that nouveau can recover if I shake the mouse during those three seconds, but it can't recover if I let the monitor power all the way down.

Complete logs to follow.

Hardware:
Lenovo Thinkstation P500, all legacy/CSM features disabled ("pure UEFI")
Nvidia Quadro K620 (GM107/NV117)
Thinkvision 2840m connected via Displayport

Software:
xorg-x11-server-Xorg-1.17.2-2.fc22.x86_64
xorg-x11-drv-nouveau-1.0.11-2.fc22.x86_64
Kernel is 4.2.0 plus Fedora patches plus these four commits suggested by Ben Skeggs:
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?h=linux-4.3&id=7c11c99b3c66a8e03494e56ce6e6c5303ee85934
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?h=linux-4.3&id=f10956d4455fcb24ecbdca30e6d9d88c95dc2588
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?h=linux-4.3&id=fe0f5d08806dcf7fd51092dfc6ea666ea2392692
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?h=linux-4.3&id=2a89359415da2fc1250b4c205de3c384bd781f54
Comment 1 Dan Callaghan 2015-09-10 03:10:33 UTC
Created attachment 118176 [details]
kernel messages showing "link training failed"

Complete kernel messages showing "link training failed".

At 11:34 the console went blank and the monitor started to go into power save, but I hit a key while it was still powered on showing a "Power save mode" message, before it actually powered down. Nouveau said "training complete" and I got the image back.

At 11:44 the monitor went into power save again and I let it power all the way down. Nouveau says "link training failed" and then "training complete" but the monitor just goes back to power save, there is no image.
Comment 2 Dan Callaghan 2015-09-10 05:35:24 UTC
The Nvidia proprietary driver is able to recover from the monitor going to sleep.

Here is an mmiotrace showing the monitor going to sleep (xset dpms force suspend), eventually powering down, and then waking up and nvidia recovering the display:

https://fedorapeople.org/~dcallagh/fdo-bz91954-mmiotrace.log.xz
Comment 3 Fredy Neeser 2015-10-20 16:42:48 UTC
For a similar issue with a Lenovo W530 and any Fedora kernel above 4.1.3-201,
see

  https://bugzilla.redhat.com/show_bug.cgi?id=1260053
  External monitor remains black with kernel 4.1.6-200.fc22.x86_64 - nouveau reports "link training failed"
Comment 4 Julien HENRY 2015-12-17 13:40:00 UTC
I have a similar issue on Fedora 23 using kernel rawhide 4.4.0-0.rc5.git0.1.fc24.x86_64 + xorg-x11-drv-nouveau-1.0.12-1.fc23.x86_64.

I have two monitors: DVI + Display port.

After letting them go to suspend mode, when I tried to move the mouse the DVI one woke up but the one on display port was completely corrupted.
I went to monitor settings in order to change resolution as an attempt to fix the issue without reboot and it froze completly the UI (I can still SSH from another computer).

[14390.513212] nouveau 0000:01:00.0: disp: outp 05:0006:0f44: link training failed
[14390.670611] ------------[ cut here ]------------
[14390.670617] WARNING: CPU: 3 PID: 26365 at include/drm/drm_crtc.h:1565 drm_helper_choose_encoder_dpms+0x8a/0x90 [drm_kms_helper]()
[14390.670618] Modules linked in: tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_broute bridge stp llc ebtable_filter ebtable_nat ebtables ip6table_raw ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_filter ip6_tables iptable_raw iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle binfmt_misc gspca_zc3xx gspca_main v4l2_common videodev media snd_usb_audio snd_usbmidi_lib snd_rawmidi joydev snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt iTCO_vendor_support intel_rapl eeepc_wmi iosf_mbi x86_pkg_temp_thermal coretemp asus_wmi sparse_keymap kvm rfkill irqbypass snd_timer
[14390.670639]  snd crct10dif_pclmul crc32_pclmul crc32c_intel soundcore mei_me mei lpc_ich i2c_i801 shpchp tpm_infineon tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc nouveau serio_raw r8169 mxm_wmi i2c_algo_bit drm_kms_helper uas mii ttm usb_storage drm hid_roccat_konepure hid_roccat hid_roccat_common video wmi fjes
[14390.670651] CPU: 3 PID: 26365 Comm: kworker/3:1 Tainted: G        W       4.4.0-0.rc5.git0.1.fc24.x86_64 #1
[14390.670652] Hardware name: ASUS All Series/Z87-C, BIOS 2103 08/15/2014
[14390.670663] Workqueue: events nvif_notify_work [nouveau]
[14390.670664]  0000000000000000 000000005f9a7b6d ffff8801a0fefd10 ffffffff813b022f
[14390.670666]  0000000000000000 ffff8801a0fefd48 ffffffff810a2ef2 ffff880212424000
[14390.670667]  ffff880036a50600 ffff880214c47000 0000000000000000 ffff880214eb63e8
[14390.670668] Call Trace:
[14390.670672]  [<ffffffff813b022f>] dump_stack+0x44/0x55
[14390.670674]  [<ffffffff810a2ef2>] warn_slowpath_common+0x82/0xc0
[14390.670675]  [<ffffffff810a303a>] warn_slowpath_null+0x1a/0x20
[14390.670677]  [<ffffffffa01192ca>] drm_helper_choose_encoder_dpms+0x8a/0x90 [drm_kms_helper]
[14390.670679]  [<ffffffffa01193bb>] drm_helper_connector_dpms+0x4b/0x100 [drm_kms_helper]
[14390.670695]  [<ffffffffa020574b>] nouveau_connector_hotplug+0x5b/0xb0 [nouveau]
[14390.670700]  [<ffffffffa0165a77>] nvif_notify_work+0x27/0xa0 [nouveau]
[14390.670702]  [<ffffffff81791f8e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[14390.670704]  [<ffffffff810ba9bd>] ? pwq_dec_nr_in_flight+0x4d/0xa0
[14390.670705]  [<ffffffff810bb07e>] process_one_work+0x19e/0x3f0
[14390.670706]  [<ffffffff810bb31e>] worker_thread+0x4e/0x450
[14390.670708]  [<ffffffff8178de30>] ? __schedule+0x3e0/0x9b0
[14390.670709]  [<ffffffff810bb2d0>] ? process_one_work+0x3f0/0x3f0
[14390.670710]  [<ffffffff810bb2d0>] ? process_one_work+0x3f0/0x3f0
[14390.670711]  [<ffffffff810c10a8>] kthread+0xd8/0xf0
[14390.670712]  [<ffffffff810c0fd0>] ? kthread_worker_fn+0x160/0x160
[14390.670714]  [<ffffffff8179284f>] ret_from_fork+0x3f/0x70
[14390.670715]  [<ffffffff810c0fd0>] ? kthread_worker_fn+0x160/0x160
[14390.670716] ---[ end trace 0fc951b1df0a1d95 ]--
Comment 5 Julien HENRY 2016-01-04 13:02:46 UTC
Created attachment 120791 [details]
Kernel logs
Comment 6 Géza Búza 2016-02-24 09:53:21 UTC
I just wanted to mention that kernel 4.5-rc5 is also affected by this issue.
Comment 7 Géza Búza 2016-02-26 10:57:35 UTC
Created attachment 121981 [details]
Kernel log of detaching and re-attaching external monitor

By looking at the log, my theory is that sometimes nouveau does not store the preferred mode correctly when probing for available modes and tries to set an invalid mode later.

In error_connecting_monitor.log at line 344 the list of available modes can be seen for screen attached via display port. The native mode is 1920x1080, so the Modeline 67 should be remembered. Now if you look at the line 392, you will see that Modeline 52 is set instead, which is not on the list of allowed modes.

Can any developer confirm this?
Comment 8 Géza Búza 2016-03-02 21:43:28 UTC
Created attachment 122088 [details]
Kernel 4.5-rc6 log of detaching and re-attaching external monitor

A fix is landed in kernel 4.5-rc6 which addresses this issue. See Ben's commit here: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=95664e66fad964c3dd7945d6edfb1d0931844664

I have been running rc6 kernel for 3 days now and haven't run into this bug since then.

However I still see the "link training failed" error in the logs when the monitor is unplugged. Fortunately it does not bring down Xorg server.
See the attached log file for more details.
Comment 9 Dolphykins 2016-07-08 20:07:30 UTC
I'm still running into this issue. I posted more detailed information and the output of dmesg and journalctl on Fedoraforum, here: http://forums.fedoraforum.org/showthread.php?t=310633

When my system is suspended and wakes back up, the primary LCD is stuck on a blank white image and sometimes cannot recover when switching TTY's. Dmesg fills up with this:
[ 270.115821] nouveau 0000:01:00.0: disp: outp 00:0006:0344: link training failed
[ 270.122646] nouveau 0000:01:00.0: disp: outp 00:0006:0344: link training failed
[ 270.123698] nouveau 0000:01:00.0: disp: outp 00:0006:0344: link not trained before attach

Hardware:
HP EliteBook 8440p, Legacy Boot
Nvidia GT218M (NVS 3100M)

Software:
Fedora 24, latest updates, Gnome 3.20.2
Gallium 0.4 on NVA8
Kernel - 4.6.3-300.fc24.x86_64
Comment 10 Dolphykins 2016-11-17 22:12:35 UTC
Issue is still present for my system several months and updates later.

Fedora 24, latest updates, Gnome and LXDE both experience issue, along with login screen.
GDM 3.20.1
Nouveau running on 4.8.7-200.fc24.x86_64
Comment 11 Martin Peres 2016-12-06 12:50:30 UTC
We are trying to push in DRM link-status that will allow the userspace to react to errors like this.

I may be convinced to add support for this in Nouveau-drm and -nouveau, since I studied it and have patches for -modesetting already.
Comment 12 Dan Callaghan 2018-03-27 02:26:44 UTC
I'm no longer seeing this issue with the 4.15.10-200.fc26 kernel in Fedora.

When the monitor goes to sleep, and I wake it back up, I still see two kernel messages like this:

nouveau 0000:03:00.0: disp: outp 02:0006:0f42: training failed
nouveau 0000:03:00.0: disp: outp 02:0006:0f42: training failed

but it seems that nouveau recovers regardless. My X display comes back and everything keeps working.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.