107546 – Screen is frozen on second connection of DP MST dock

Bug 107546 - Screen is frozen on second connection of DP MST dock

Summary: Screen is frozen on second connection of DP MST dock

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high normal
Assignee:	Stanislav Lisovskiy
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-08-11 18:14 UTC by Sergey Menshikov
Modified:	2019-05-14 11:12 UTC (History)
CC List:	7 users (show)

See Also:
i915 platform:	SKL
i915 features:	display/DP MST

Attachments
dmesg for drm-tip kernel connection error (51.28 KB, text/plain) 2018-08-22 01:42 UTC, Sergey Menshikov	no flags	Details
xrandr without the dock (1.85 KB, text/plain) 2018-08-22 01:43 UTC, Sergey Menshikov	no flags	Details
xrandr with the dock with dual displays connected (2.29 KB, text/plain) 2018-08-22 01:43 UTC, Sergey Menshikov	no flags	Details
xrandr in the "displays disconnected but not disabled in xrandr" state (1.64 KB, text/plain) 2018-09-26 21:56 UTC, Chris Hobbs	no flags	Details
dmesg for freeze on 4.15.0.26 with log_buf_len=4M and dr.debug=0x1e (380.17 KB, application/zip) 2018-10-06 00:40 UTC, Sergey Menshikov	no flags	Details
Hang when turning 3rd monitor on (775.30 KB, text/x-log) 2018-10-19 17:07 UTC, Benjamin Berg	no flags	Details
dmesg drm debug (968.99 KB, text/plain) 2018-11-13 23:37 UTC, Vedran Furač	no flags	Details
dmesg drm 4.10 (2.12 MB, text/plain) 2018-11-20 22:04 UTC, Vedran Furač	no flags	Details
dmesg drm 4.19.13 (14.83 KB, text/plain) 2019-01-11 01:48 UTC, Vedran Furač	no flags	Details
View All

Description Sergey Menshikov 2018-08-11 18:14:51 UTC

Kernel 4.15.0.30 Ubuntu 18.04.1 Xorg/Gnome Lenovo 4th gen X1 20FBxxx Skylake i7 Intel built-in video, Lenovo OneLink+ DP MST dock with dual dell 24" displays

Error messages upon dock disconnect:

[drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[drm:intel_mst_pre_enable_dp [i915]] *ERROR* failed to allocate vcpi

Next connection leads to frozen screen on laptop, no change on exterenal monitors. 

[drm:intel_wait_ddi_buf_idle [i915]] *ERROR* Timeout waiting for DDI BUF C idle bit

Disconnecting the dock restores laptop screen functionality, but subsequent connection of the dock has the same symptoms repeating. Reboot fixes the issue until first dock disconnect. 

My workaround is to enable GuC firmware loading and boot without the dock attached.

sudo vi /etc/modprobe.d/i915.conf

options i915 enable_guc_loading=1 enable_guc_submission=1

sudo update-initramfs -u
sudo reboot

(reboot while not connected to dock)

Comment 1 Jani Saarinen 2018-08-13 08:27:38 UTC

HI,
Could you try using latest drm-tip: https://cgit.freedesktop.org/drm-tip and send dmesg with drm.debug=0x1e log_buf_len=4M from start to problem and attach logs as plain text here?

Why you use as workaround to enable GuC firmware loading? Why you think this is connected to this problem?

Comment 2 Stanislav Lisovskiy 2018-08-13 10:14:10 UTC

(In reply to Sergey Menshikov from comment #0)
> Kernel 4.15.0.30 Ubuntu 18.04.1 Xorg/Gnome Lenovo 4th gen X1 20FBxxx Skylake
> i7 Intel built-in video, Lenovo OneLink+ DP MST dock with dual dell 24"
> displays
> 
> Error messages upon dock disconnect:
> 
> [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
> [drm:intel_mst_pre_enable_dp [i915]] *ERROR* failed to allocate vcpi
> 
> Next connection leads to frozen screen on laptop, no change on exterenal
> monitors. 
> 
> [drm:intel_wait_ddi_buf_idle [i915]] *ERROR* Timeout waiting for DDI BUF C
> idle bit
> 
> Disconnecting the dock restores laptop screen functionality, but subsequent
> connection of the dock has the same symptoms repeating. Reboot fixes the
> issue until first dock disconnect. 
> 

Does it happen only to Laptop integrated screen? This might be a duplicate of
https://bugs.freedesktop.org/show_bug.cgi?id=106250, which in fact consists of 
two different bugs: one is related to watermark calculation algorithm,  basically exceeding available resources for Y Tiled framebuffer for 4K screenmodes - which manifests as black laptop integrated screen and if you enable drm.debug=0x14, you should see some complaints regarding exceeding available ddb_allocation.
Another bug, which it might be related to is that sometimes we don't get RRSetCrtcConfig request from X Server client, upon receiving the hotplug event,
which results in not receiving drm_mode_setcrtc call for one of the screeens.
This one seems to affect external displays and happens every second time you plug/unplug external display.
So please attach dmesg traces with drm.debug=0x14 enabled, so that we could differentiate if its a duplicate or something else.
Also you can try to manually enable the screen by using xrandr --output (output name) --crtc (number) - try different crtc numbers here, if it helps it might also mean that we simply don't get the modesetting call.

Comment 3 Sergey Menshikov 2018-08-13 19:59:31 UTC

Jani, do I have to build anything beyond kernel and modules?

I built kernel and modules and installed them, but the experience is not exactly the same. (I am following this recipe https://01.org/linuxgraphics/documentation/build-guide-0)

Re: GuC firmware loading as a workaround: my theory is that it is a race condition with modesetting happening during the time displays disconnect (taken from here https://patchwork.kernel.org/patch/8603521/). Anything can tip off race a condition, so I tried a number of i915 options, including firmware loading, it seems to be helping (hard to prove negative).

One additional piece of data I failed to mention: my set up is to turn off laptop display when connecting two external monitors, and turn it back on when I disconnect them.


Stanislav, laptop screen literally freezes - the image is still there, mouse moves, nothing happens on click or keyboard input. Two external monitors stay black.

Comment 4 Stanislav Lisovskiy 2018-08-14 06:55:35 UTC

Then it looks like it's something different. However if this is a race condition, typically that shouldn't fail everytime. We need dmesg traces then.
Also you can still try to connect to your machine and execute modesetting through xrandr and check what does it say. Also would be nice to get connector states from xrandr when you disconnect and connect - we might see something interesting there.

Comment 5 Sergey Menshikov 2018-08-22 01:42:29 UTC

Created attachment 141231 [details]
dmesg for drm-tip kernel connection error

Comment 6 Sergey Menshikov 2018-08-22 01:43:13 UTC

Created attachment 141232 [details]
xrandr without the dock

Comment 7 Sergey Menshikov 2018-08-22 01:43:53 UTC

Created attachment 141233 [details]
xrandr with the dock with dual displays connected

Comment 8 Sergey Menshikov 2018-08-22 01:46:36 UTC

Having issue reproducing with drm-tip kernel - it hangs on dock connection, differently. Connecting the dock with drm-tip kernel results in strange picture on external displays (blown up section of screen, displayed in a limited square on one of the external screens). Disconnecting the dock leaves everything blank. Talking laptop to sleep restores laptop panel. dmesg (without debug) is attached. Any help on what's wrong it appreciated (perhaps I built drm-tip wrong).

On 4.15.0.30: unable to reproduce with log_buf_size option set to anything. Will try some more.

I was able to reproduce it again without kernel options on the 4.15.0.30 kernel. While the screen is frozen, "xrandr -display :0.0 -q"  from external ssh actually hangs until I disconnect the dock.

xrandr output with screens connected and disconnected during normal operation is attached.

Comment 9 Stanislav Lisovskiy 2018-08-22 08:02:45 UTC

Sergey Menshikov, so does this issue happen everytime with drm-tip kernel?

If you were not able to reproduce 4.15.0.30 would be nice to get dmesg for that one also in order to compare drm messages. Or did you mean that traces themselves affect the behavior?

I saw at least a few messages, which seem rather suspicious to me:

[   34.784041] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   45.024099] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   55.264172] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CONNECTOR:107:DP-3] flip_done timed out
[   65.504195] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   75.744168] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   86.496342] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CONNECTOR:107:DP-3] flip_done timed out
[   96.736191] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:42:plane 1B] flip_done timed out

Basically it waited for flipping operation to finish and then failed because of timeout.

Comment 10 Artur Souza 2018-08-22 14:22:00 UTC

I am experiencing the same issue as well as my second monitor never turning on (not sure if related or not). I am running drm-tip. What kind of info is still needed to better investigate this bug?

Comment 11 Lakshmi 2018-08-23 11:28:44 UTC

> On 4.15.0.30: unable to reproduce with log_buf_size option set to anything.
> Will try some more.
Do you have dmesg log in this case?

Comment 12 Lakshmi 2018-08-30 07:20:28 UTC

Sergey, can you attach the logs in the scenario where yo were unable to reproduce the issue. That will help in debugging the issue.

Comment 13 Lakshmi 2018-09-03 06:56:28 UTC

Reporter, any progress?(In reply to Artur Souza from comment #10)
> I am experiencing the same issue as well as my second monitor never turning
> on (not sure if related or not). I am running drm-tip. What kind of info is
> still needed to better investigate this bug?

Since you were able to reproduce with latest drm-tip, can you send dmesg logs with drm.debug=0x1e log_buf_len=4M?

Comment 14 Lakshmi 2018-09-10 05:50:37 UTC

Artur, Ping?

Comment 15 Artur Souza 2018-09-10 14:14:17 UTC

I will try with drm.debug=0x1e log_buf_len=4M and will attach the logs here today.

Comment 16 Artur Souza 2018-09-10 21:32:10 UTC

I couldn't reproduce with drm-tip (07cf212bc704357ee60aba52ec40bab538222040)

Comment 17 Stanislav Lisovskiy 2018-09-11 12:56:32 UTC

I could reproduce it now with different DELL docking station - happens only with 2 screens with exactly this (07cf212bc704357ee60aba52ec40bab538222040) drm-tip.

There are lots of time outs like:

[   34.784041] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   45.024099] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:55:pipe B] flip_done timed out

Everything hangs, however ssh access works. I will now investigate what is the problem.

Comment 18 Stanislav Lisovskiy 2018-09-12 10:26:34 UTC

I was wrong. Issue is not anymore reproducible with 07cf212bc704357ee60aba52ec40bab538222040. 

It was clearly reproducible 4.19-rc1 at least. So probably this is already fixed.

Comment 19 Artur Souza 2018-09-13 17:52:49 UTC

I was able to reproduce it today with 4.19-rc2+. The issue is not a kernel freeze but it's X freezing. Going to terminal and restarting X helps recover without the need to reboot.

I will share logs asap.

Comment 20 Stanislav Lisovskiy 2018-09-14 07:29:40 UTC

That is interesting because both for reporter and me I was seeing lots of those coming from kernel:

[   34.784041] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   45.024099] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   55.264172] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CONNECTOR:107:DP-3] flip_done timed out
[   65.504195] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   75.744168] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:55:pipe B] flip_done timed out
[   86.496342] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CONNECTOR:107:DP-3] flip_done timed out
[   96.736191] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:42:plane 1B] flip_done timed out

With recent drm-tip those were gone(became harder to reproduce?). Or is it another issue now, because at least I'm aware of some X Server related issues also - https://bugs.freedesktop.org/show_bug.cgi?id=106250.

Comment 21 Artur Souza 2018-09-14 17:01:43 UTC

One thing that I noticed is that if I remove the computer from the dock with the lid open, then everything works. If I remove from the dock with the lid closed, then the only way to get back is to switch to terminal, restart X (display-manager on Ubuntu) and move on.

Comment 22 Chris Hobbs 2018-09-26 21:43:27 UTC

I also have this bug on my Thinkpad 13 with a Onelink+ dock which contains a displayport MST adapter. I have two screens connected directly over displayport.

This issue only started happening recently. I will attempt to downgrade my kernel until I can no longer reproduce. I am currently running 4.18.9-arch1-1-ARCH.

Starting up with the dock connected or disconnected works fine, and so does disconnecting the dock.

Connecting the dock the first time after starting up with the dock disconnected works, any further connection attempt fails.

Disconnecting the dock when the display is frozen rectifies the situation, and the machine can be rebooted normally to restore display. Restarting X11 doesn't help, it has to be a reboot.

One interesting note is that if you disconnect the dock, then don't issue any more xrandr commands to disable the now-disconnected displays, you can plug the dock back in, and the issue doesn't reproduce. The issue only reproduces if you disable the disconnected displays using xrandr before plugging the dock back in. The external displays do not have to be enabled in xrandr to freeze the display.

And in fact, the "frozen screen" on the laptop is actually not frozen, it just only updates every 5 or so seconds. There's a lot of flickering on the laptop display when plugging in the dock. Unplugging the dock always restores.

On unplugging, nothing appears in dmesg.

When disabling the unplugged outputs in xrandr, this appears in dmesg:

[ 2037.023131] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[ 2037.291632] [drm:intel_encoders_pre_enable.isra.63 [i915]] *ERROR* failed to allocate vcpi

When the dock is plugged back in, this appears in dmesg many times:

[ 2143.988849] [drm:intel_ddi_prepare_link_retrain [i915]] *ERROR* Timeout waiting for DDI BUF C idle bit

the dmesg messages stop as soon as the dock is unplugged, however the laptop display takes a few seconds to recover.

Hope this info helps.

Comment 23 Chris Hobbs 2018-09-26 21:56:38 UTC

Created attachment 141753 [details]
xrandr in the "displays disconnected but not disabled in xrandr" state

I've attached the output of xrandr in the state after disconnecting the dock but before disabling the external displays in xrandr. In other words, this is before any errors in dmesg, and the state which can be recovered from.

Comment 24 Sergey Menshikov 2018-10-06 00:39:21 UTC

Apologies for the delay - the issue became very rare.

I reproduced it with kernel 4.15.0.36 with options drm.debug=0x1e log_buf_len=4M and dmesg is attached.

The key this time was to boot with dock attached, disconnect and reconnect 2 times. I can repro this reliably every time at the moment.

To reproduce with drm-tip kernel I need more guidance (unable to connect the dock at the first place):

> Connecting the dock with drm-tip kernel results in strange picture on external
> displays (blown up section of screen, displayed in a limited square on one of
> the external screens). Disconnecting the dock leaves everything blank. Talking 
> laptop to sleep restores laptop panel.

Comment 25 Sergey Menshikov 2018-10-06 00:40:55 UTC

Created attachment 141917 [details]
dmesg for freeze on 4.15.0.26 with log_buf_len=4M and dr.debug=0x1e

Comment 26 Stanislav Lisovskiy 2018-10-16 08:29:47 UTC

Reporter, can you please try to reproduce the issue with recent drm-tip(>=4.19-rc7) - I can reproduce this issue with 4.19-rc2, but not with the recent one.

Comment 27 Benjamin Berg 2018-10-19 11:53:38 UTC

I am seeing a similar issue on an X1 4th Gen and using a dock with two attached monitors (DP). It seemed to be somewhat worse if one of the monitors has MST enabled (DP 1.2 switch). This was also on GNOME (Fedora 29, Kernel 4.18.13-300.fc29, mutter 3.30.1-2.fc29).

What works for me as a workaround is avoiding the "hotplug" by suspending the machine and then resuming it again after plugging in the monitors. Unplugging generally worked fine for me.

I'll do a local build of drm-tip and see what happens there.

Comment 28 Benjamin Berg 2018-10-19 17:07:46 UTC

Created attachment 142103 [details]
Hang when turning 3rd monitor on

OK, I tried drm-tip (9ef57c71386778d8425b4884252f1919184645a1).

Configuration:
 * X1 Carbon 4th Gen
 * Onelink+ Dock (yeah, same as the reporter it seems, maybe this dock is "special" in some way …)
 * Three monitors:
   - Internal screen, 1920x1080
   - DELL U2515H (right now DP 1.2/MST enabled), attached to dock, 2560x1440
   - DELL P2417H, attached to dock, 1920x1080

When I boot with everything connected:
 * Internal screen works
 * Graphics errors on P2417H
 * U2515H remains off (but laptop thinks it is used)

The attached log file is when booting with the P2417H disabled. I logged in and then switched the P2417H off, which resulted in my laptop to freeze after a short time. I did not manage to recover, but I did try to switch the U2515H off/on a few times and then unplugged the dock.

Comment 29 Benjamin Berg 2018-10-19 17:12:21 UTC

I switched the P2417H *on* not off.

The trick with suspending seems to also work with drm-tip, and I now have all three monitors working at the same time.

Comment 30 Stanislav Lisovskiy 2018-10-22 06:48:50 UTC

Suspending trick sounds really suspicious. Unfortunately I'm not able to reproduce this issue with my Dell docking station, I will probably have to order exactly same as here. However there is one issue that might be related which I'm currently investigating.

Can you try switching to Wayland to check that it helps? Also what happens if you boot with runlevel 3?

Comment 31 Benjamin Berg 2018-10-22 07:54:52 UTC

Hm, sorry if that was not clear. I am on wayland.

Comment 32 Benjamin Berg 2018-10-22 09:41:33 UTC

Hrm, I could have sworn that this has worked much better not long ago. However, I keep having issues, with all kernels I tried so far (at least 4.16.3, 4.18.9; best candidate for working would be 4.17.11, but haven't tested that so far).

I am wondering if there might be some issues with my connectors. This weird suspend trick works reliably though :-/

Comment 33 Benjamin Berg 2018-10-23 14:48:42 UTC

OK … trying to reproduce the issue today, and things just work as suddenly as my issues started.

So either some really weird timing issue, or maybe the dock is just rather bugged. I can try grabbing a debug log if/when I see it again, for now, I don't.

Comment 34 Vedran Furač 2018-11-13 23:35:41 UTC

Hello,

I think I have the same issue on Intel's NUC7i5BNB with dual Dell monitors using daisy chain (and 3rd over HDMI). I've tested many kernels and have come to conclusion the the problem was first introduced in 4.11 and various manifestation of it appear in all subsequent kernels up to 4.19-rc7 from Debian. Kernels 4.9 and 4.10 work fine.

To reproduce it, all I need to do is switch input using primary Dell OSD from DP to mDP (nothing on it) and then switch back to DP input. This results in flood of:

[1615629.735590] [drm:intel_ddi_prepare_link_retrain [i915]] *ERROR* Timeout waiting for DDI BUF C idle bit
[1615629.811616] [drm:intel_dp_start_link_train [i915]] *ERROR* Timed out waiting for DP idle patterns

messages and the X server is frozen. SSH works normally, but if I try to restart X, system usually hard locks and I need to cut the power (not even sysrq magic works). I was unable to find any workarounds. With 4.18 there's even kernel oops (after returning to DP input):

[  321.495462] [drm:intel_dp_check_mst_status [i915]] got esi 42 10 00
[  321.497662] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  321.497665] PGD 0 P4D 0 
[  321.497667] Oops: 0000 [#1] SMP NOPTI
[  321.497669] CPU: 1 PID: 188 Comm: kworker/u8:3 Tainted: G     U     O      4.18.0-2-amd64 #1 Debian 4.18.10-2
[  321.497670] Hardware name:  /NUC7i5BNB, BIOS BNKBL357.86A.0049.2017.0724.1541 07/24/2017
[  321.497689] Workqueue: i915-dp i915_digport_work_func [i915]
[  321.497692] RIP: 0010:refcount_inc_not_zero+0x0/0x50
[  321.497693] Code: c0 74 02 f3 c3 80 3d 64 bb d3 00 00 75 f5 48 c7 c7 70 9a 07 84 c6 05 54 bb d3 00 01 e8 f9 f3 cb ff 0f 0b c3 66 0f 1f 44 00 00 <8b> 07 85 c0 8d 50 01 74 35 85 d2 74 0b f0 0f b1 17 75 ef 83 fa ff 
[  321.497712] RSP: 0018:ffffa1f8c37c7d18 EFLAGS: 00010246
[  321.497713] RAX: 0000000000000000 RBX: ffff935493b129c8 RCX: 0000000000000000
[  321.497714] RDX: ffff93549270bd00 RSI: 0000000000000001 RDI: 0000000000000000
[  321.497715] RBP: 0000000000000000 R08: 00000000fffffffa R09: 0000000000000002
[  321.497716] R10: ffffa1f8c37c7cf0 R11: 0000000000000102 R12: ffff935493b12870
[  321.497717] R13: 0000000000000001 R14: ffffa1f8c37c7dc2 R15: ffff935493b12700
[  321.497718] FS:  0000000000000000(0000) GS:ffff9354be880000(0000) knlGS:0000000000000000
[  321.497720] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  321.497721] CR2: 0000000000000000 CR3: 00000001dda0a001 CR4: 00000000003606e0
[  321.497722] Call Trace:
[  321.497725]  refcount_inc+0x5/0x30
[  321.497730]  drm_dp_get_mst_branch_device+0xc2/0xe0 [drm_kms_helper]
[  321.497735]  drm_dp_mst_hpd_irq+0x104/0x8c0 [drm_kms_helper]
[  321.497750]  ? intel_dp_check_mst_status+0xba/0x1e0 [i915]
[  321.497763]  intel_dp_check_mst_status+0xba/0x1e0 [i915]
[  321.497776]  intel_dp_hpd_pulse+0x176/0x2e0 [i915]
[  321.497778]  ? __switch_to_asm+0x40/0x70
[  321.497791]  i915_digport_work_func+0x8f/0x120 [i915]
[  321.497794]  process_one_work+0x195/0x370
[  321.497795]  worker_thread+0x30/0x390
[  321.497797]  ? process_one_work+0x370/0x370
[  321.497799]  kthread+0x113/0x130
[  321.497800]  ? kthread_create_worker_on_cpu+0x70/0x70
[  321.497802]  ret_from_fork+0x35/0x40


I'll attach full debug output.

Regards,
Vedran

Comment 35 Vedran Furač 2018-11-13 23:37:09 UTC

Created attachment 142457 [details]
dmesg drm debug

Comment 36 Vedran Furač 2018-11-14 20:00:22 UTC

Hello,

An update, seems that even with 4.10 it doesn't work perfectly on NUC. After switching input back to DP input for the 3rd or 4th time, nothing happened, display showed "no signal" error. However, X remained fully functional and after disconnecting the cable and reconnecting it, both displays were detected and worked fine afterwards.


I have similar problem on Thinkpad X1 carbon 5th gen (4.18 kernel). Initially it works fine, but upon disconnecting displays (TB/DP) there's this error:

[30373.030578] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[30373.298845] [drm:intel_encoders_pre_enable.isra.103 [i915]] *ERROR* failed to allocate vcpi

and after connecting them again:

[33022.853059] [drm:intel_dp_start_link_train [i915]] *ERROR* Timed out waiting for DP idle patterns
[33022.861393] [drm:intel_ddi_prepare_link_retrain [i915]] *ERROR* Timeout waiting for DDI BUF C idle bit
...
[33057.741347] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to get link status

and obviously no signal on displays. Luckily there is no locking on Thinkpad like on NUC and X still works fine, but I have to reboot to get external displays working again.

I'll try to get debug output from Thinkpad tomorrow.

Regards,
Vedran

Comment 37 Vedran Furač 2018-11-18 15:57:46 UTC

Hello,

To confirm what other have wrote before, on Thinkpad, (un)plugging the cable while being suspend to RAM mitigates the problem.

Regards,
Vedran

Comment 38 Stanislav Lisovskiy 2018-11-19 08:12:02 UTC

(In reply to Vedran Furač from comment #34)
> Hello,
> 
> I think I have the same issue on Intel's NUC7i5BNB with dual Dell monitors
> using daisy chain (and 3rd over HDMI). I've tested many kernels and have
> come to conclusion the the problem was first introduced in 4.11 and various
> manifestation of it appear in all subsequent kernels up to 4.19-rc7 from
> Debian. Kernels 4.9 and 4.10 work fine.
> 
> To reproduce it, all I need to do is switch input using primary Dell OSD
> from DP to mDP (nothing on it) and then switch back to DP input. This
> results in flood of:
> 
> [1615629.735590] [drm:intel_ddi_prepare_link_retrain [i915]] *ERROR* Timeout
> waiting for DDI BUF C idle bit
> [1615629.811616] [drm:intel_dp_start_link_train [i915]] *ERROR* Timed out
> waiting for DP idle patterns
> 
> messages and the X server is frozen. SSH works normally, but if I try to
> restart X, system usually hard locks and I need to cut the power (not even
> sysrq magic works). I was unable to find any workarounds. With 4.18 there's
> even kernel oops (after returning to DP input):
> 
> [  321.495462] [drm:intel_dp_check_mst_status [i915]] got esi 42 10 00
> [  321.497662] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000000
> [  321.497665] PGD 0 P4D 0 
> [  321.497667] Oops: 0000 [#1] SMP NOPTI
> [  321.497669] CPU: 1 PID: 188 Comm: kworker/u8:3 Tainted: G     U     O    
> 4.18.0-2-amd64 #1 Debian 4.18.10-2
> [  321.497670] Hardware name:  /NUC7i5BNB, BIOS
> BNKBL357.86A.0049.2017.0724.1541 07/24/2017
> [  321.497689] Workqueue: i915-dp i915_digport_work_func [i915]
> [  321.497692] RIP: 0010:refcount_inc_not_zero+0x0/0x50
> [  321.497693] Code: c0 74 02 f3 c3 80 3d 64 bb d3 00 00 75 f5 48 c7 c7 70
> 9a 07 84 c6 05 54 bb d3 00 01 e8 f9 f3 cb ff 0f 0b c3 66 0f 1f 44 00 00 <8b>
> 07 85 c0 8d 50 01 74 35 85 d2 74 0b f0 0f b1 17 75 ef 83 fa ff 
> [  321.497712] RSP: 0018:ffffa1f8c37c7d18 EFLAGS: 00010246
> [  321.497713] RAX: 0000000000000000 RBX: ffff935493b129c8 RCX:
> 0000000000000000
> [  321.497714] RDX: ffff93549270bd00 RSI: 0000000000000001 RDI:
> 0000000000000000
> [  321.497715] RBP: 0000000000000000 R08: 00000000fffffffa R09:
> 0000000000000002
> [  321.497716] R10: ffffa1f8c37c7cf0 R11: 0000000000000102 R12:
> ffff935493b12870
> [  321.497717] R13: 0000000000000001 R14: ffffa1f8c37c7dc2 R15:
> ffff935493b12700
> [  321.497718] FS:  0000000000000000(0000) GS:ffff9354be880000(0000)
> knlGS:0000000000000000
> [  321.497720] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  321.497721] CR2: 0000000000000000 CR3: 00000001dda0a001 CR4:
> 00000000003606e0
> [  321.497722] Call Trace:
> [  321.497725]  refcount_inc+0x5/0x30
> [  321.497730]  drm_dp_get_mst_branch_device+0xc2/0xe0 [drm_kms_helper]
> [  321.497735]  drm_dp_mst_hpd_irq+0x104/0x8c0 [drm_kms_helper]
> [  321.497750]  ? intel_dp_check_mst_status+0xba/0x1e0 [i915]
> [  321.497763]  intel_dp_check_mst_status+0xba/0x1e0 [i915]
> [  321.497776]  intel_dp_hpd_pulse+0x176/0x2e0 [i915]
> [  321.497778]  ? __switch_to_asm+0x40/0x70
> [  321.497791]  i915_digport_work_func+0x8f/0x120 [i915]
> [  321.497794]  process_one_work+0x195/0x370
> [  321.497795]  worker_thread+0x30/0x390
> [  321.497797]  ? process_one_work+0x370/0x370
> [  321.497799]  kthread+0x113/0x130
> [  321.497800]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  321.497802]  ret_from_fork+0x35/0x40
> 
> 
> I'll attach full debug output.
> 
> Regards,
> Vedran

The backtrace attached here indicates that this is possibly a duplicate of
https://bugs.freedesktop.org/show_bug.cgi?id=108616.

Can you please try the fix, which is proposed there?

Comment 39 Vedran Furač 2018-11-20 21:53:52 UTC

Hi Stanislav,

Presumably you are referring to commit 23d800? Unfortunately, I won't be able to test this until it gets packaged for Debian (latest 4.18/19.x?). 

Today I encountered the same freeze on NUC (but not the backtrace so that was most likely a separate issue) with 4.10 kernel. This is the first time it happened with 4.10 after at least 5-6 plugging and unplugging. Couldn't kill X server, had to reboot. I'll attach dmesg log even though i'm not 100% sure it captured the event.

Regards,
Vedran

Comment 40 Vedran Furač 2018-11-20 22:04:35 UTC

Created attachment 142529 [details]
dmesg drm 4.10

My bad, there actually is a backtrace there at 11:41:11, I missed it initially due to ton of output and it looks similar to the one before so this might not be as useful.

Comment 41 Brian Ward 2018-11-29 02:06:22 UTC

Hi, 

I have a Lenovo T460s that has exhibited similar behaviors.  I think there are also many others with similar problems who have mistakenly looked at this bugzilla under xorg: https://bugzilla.redhat.com/show_bug.cgi?id=1470960

I am not on the latest (currently 4.19.3-200.fc28.x86_64) but I'm happy to build to suit.  I'll review the existing comments and attempt to build the drm-tip shortly to see if I can reproduce on that.  

In my case, I notice this when booting from dock or not booting from dock, always with two monitors.  The first docking both monitors work fine.  The first undock followed by the second docking, I get errors like others have noted.

I've tested docking with one monitor, either DVI or HDMI, and each one alone works fine with multiple docks.

One interesting note of difference I can add is that I've noticed my problem only exhibits when I'm docked at the office with one monitor that is DVI and one monitor that is HDMI.  When I have two displayports docked at home and undock, I never have this problem.

Another interesting note I've found is that when I set drm.debug=14 and log_buf_len=16M on the drm module in the boot options, I found that the behavior did not always reproduce.  

If anything stands out as distinctly different from the original post let me know and I can file another bug if I reproduce on drm-tip.  Sergey's config looks like they are both displayports.

T460S
thinkpad ultra dock P/N SD20A06046 Type 40A2 S/N M3-A0AC8E 16/11

Comment 42 Tuomas Kuosmanen 2018-11-29 12:38:56 UTC

I have a Thinkpad T460s. 

One screen with a Dell 27" screen with 2560x1440 resolution via displayport on the dock works fine. 

I added a second monitor (a NEC 1080p screen) also via displayport and I get this issue with two external screens.

So both are on displayport and I get this. Either monitor as the only external monitor works fine.

Comment 43 Brian Ward 2019-01-05 01:14:19 UTC

Hello all,

I tested from latest Fedora kernel and still reproduced my problem.
4.19.12-301.fc29.x86_64

I built from latest drm-tip 4.20 as of 1/2 and could not reproduce the problem. c6a0276a5007c01c64a8a80552b78c115e8a0dae

It would be great if anyone knows how to identify the code that I could use to patch a 4.19 and/or 4.18 kernel to make this work correctly.  Any hints would be appreciated.

Comment 44 Stanislav Lisovskiy 2019-01-07 07:29:14 UTC

(In reply to Brian Ward from comment #43)
> Hello all,
> 
> I tested from latest Fedora kernel and still reproduced my problem.
> 4.19.12-301.fc29.x86_64
> 
> I built from latest drm-tip 4.20 as of 1/2 and could not reproduce the
> problem. c6a0276a5007c01c64a8a80552b78c115e8a0dae
> 
> It would be great if anyone knows how to identify the code that I could use
> to patch a 4.19 and/or 4.18 kernel to make this work correctly.  Any hints
> would be appreciated.

Hi,

I guess you need to apply this patch:

https://patchwork.freedesktop.org/patch/261135/

Comment 45 Brian Ward 2019-01-09 14:30:25 UTC

(In reply to Stanislav Lisovskiy from comment #44)
> Hi,
> 
> I guess you need to apply this patch:
> 
> https://patchwork.freedesktop.org/patch/261135/

Thanks for the suggestion Stan, unfortunately that one didn't change any behavior for me, and it appears it is already committed to the 4.19 branch.

I've tested with a custom build direct from 4.19 as well as custom fedora build from 4.19, and latest fedora release 4.19.13-300.fc29.x86_64, all of which appear to already have that patch.

Looks like there are a lot of big changes in the two files that seem most likely culprits to the errors...

$ git diff v4.20 v4.19 -- drivers/gpu/drm/i915/intel_dp_link_training.c | grep @@ | wc -l
5
$ git diff v4.20 v4.19 -- drivers/gpu/drm/i915/intel_ddi.c | grep @@ | wc -l
16

I'll try to find some time to actually debug the error on v4.19 and find out what changes resolve the problem, but it may be a while.

Comment 46 Stanislav Lisovskiy 2019-01-09 19:41:40 UTC

(In reply to Brian Ward from comment #45)
> (In reply to Stanislav Lisovskiy from comment #44)
> > Hi,
> > 
> > I guess you need to apply this patch:
> > 
> > https://patchwork.freedesktop.org/patch/261135/
> 
> Thanks for the suggestion Stan, unfortunately that one didn't change any
> behavior for me, and it appears it is already committed to the 4.19 branch.
> 
> I've tested with a custom build direct from 4.19 as well as custom fedora
> build from 4.19, and latest fedora release 4.19.13-300.fc29.x86_64, all of
> which appear to already have that patch.
> 
> Looks like there are a lot of big changes in the two files that seem most
> likely culprits to the errors...
> 
> $ git diff v4.20 v4.19 -- drivers/gpu/drm/i915/intel_dp_link_training.c |
> grep @@ | wc -l
> 5
> $ git diff v4.20 v4.19 -- drivers/gpu/drm/i915/intel_ddi.c | grep @@ | wc -l
> 16
> 
> I'll try to find some time to actually debug the error on v4.19 and find out
> what changes resolve the problem, but it may be a while.

Do you get a similar call trace?

> [  321.497722] Call Trace:
> [  321.497725]  refcount_inc+0x5/0x30
> [  321.497730]  drm_dp_get_mst_branch_device+0xc2/0xe0 [drm_kms_helper]
> [  321.497735]  drm_dp_mst_hpd_irq+0x104/0x8c0 [drm_kms_helper]
> [  321.497750]  ? intel_dp_check_mst_status+0xba/0x1e0 [i915]
> [  321.497763]  intel_dp_check_mst_status+0xba/0x1e0 [i915]
> [  321.497776]  intel_dp_hpd_pulse+0x176/0x2e0 [i915]
> [  321.497778]  ? __switch_to_asm+0x40/0x70
> [  321.497791]  i915_digport_work_func+0x8f/0x120 [i915]
> [  321.497794]  process_one_work+0x195/0x370
> [  321.497795]  worker_thread+0x30/0x390
> [  321.497797]  ? process_one_work+0x370/0x370
> [  321.497799]  kthread+0x113/0x130
> [  321.497800]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  321.497802]  ret_from_fork+0x35/0x40

https://patchwork.freedesktop.org/patch/261135/ patch reportedly fixes the issue
in https://bugs.freedesktop.org/show_bug.cgi?id=108616, which has exactly same backtrace. Otherwise I have a feeling that we are talking about different issues in a same ticket here.

Comment 47 Vedran Furač 2019-01-11 01:48:49 UTC

Created attachment 143065 [details]
dmesg drm 4.19.13

Hello,

For me on Intel NUC7i5BNB the problem still occurs on 4.19.13 when switching inputs on monitor (dual monitor setup). Dmesg output attached.

Regards,
Vedran

Comment 48 Shaun 2019-01-22 16:57:38 UTC

I see the same problem (screen freezes apart from mouse pointer) with a Dell Precision 5530 and the Dell TB-16 dock on Ubuntu 18.10.  That issue seems to be resolved by running upstream kernel v4.20.  I had previously tried v4.19 but that still showed the issue.

Even with v4.20 I still see 

    [drm:intel_mst_pre_enable_dp [i915]] *ERROR* failed to allocate vcpi

when plugging in the dock but the external screen works (on the displayport anyway, I'm not sure about the HDMI port).

Comment 49 Chris Hobbs 2019-01-24 00:00:41 UTC

I can confirm this is fixed in 4.20, with the same "[drm:intel_mst_pre_enable_dp [i915]] *ERROR* failed to allocate vcpi" message on unplug, but no crash. Should this be resolved?

Comment 50 Brian Ward 2019-01-28 14:10:03 UTC

I can confirm both my custom, early 4.20 build, and the latest fedora 4.20 build 4.20.3-200.fc29.x86_64 resolve my problems, with the same result as Chris Hobbs:

I still see 
[   75.398398] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[   75.666835] [drm:intel_mst_pre_enable_dp [i915]] *ERROR* failed to allocate vcpi

But the system does not hang anymore.  I'm happy with that.  

I have been unable to get a backtrace such as Stanislav commented in #46 because my issue does not generate any backtraces in the logs and I have yet to try to debug it manually.  My issue is likely different from any issue resolved by https://patchwork.freedesktop.org/patch/261135/ given the results are the same after confirming the patch in the kernel.

Comment 51 Lakshmi 2019-03-18 10:57:15 UTC

Stan, If the above mentioned issues Comment 48/Comment 49/Comment 50 are not related to the original problem, a new bug report is needed?

Comment 52 Stanislav Lisovskiy 2019-03-18 12:26:23 UTC

(In reply to Lakshmi from comment #51)
> Stan, If the above mentioned issues Comment 48/Comment 49/Comment 50 are not
> related to the original problem, a new bug report is needed?

As I understand those were solved by running 4.20 kernel as described in correspondent comments, so I guess no additional bugs needed. There were multiple fixes concerning DP MST(one crash fixed by my patch, some dp-mst connector issues) so, considering we have already 5.0.0 kernel, we should resolve it already.

Comment 53 Jani Saarinen 2019-05-14 11:12:34 UTC

According to Stan we should close. Please re-open if you still see issues on this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.