Bug 107738 - System hangs when hot-plugging Thunderbolt 3 dock with dual output (DP MST) connected
Summary: System hangs when hot-plugging Thunderbolt 3 dock with dual output (DP MST) c...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high normal
Assignee: Stanislav Lisovskiy
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-29 16:04 UTC by Léo Grange
Modified: 2018-12-28 09:26 UTC (History)
3 users (show)

See Also:
i915 platform: SKL
i915 features: display/DP MST


Attachments
Kernel log from journalctl output (119.45 KB, text/x-log)
2018-08-29 16:04 UTC, Léo Grange
no flags Details
Xorg log (87.76 KB, text/x-log)
2018-08-29 16:06 UTC, Léo Grange
no flags Details
Kernel log using drm-tip kernel (117.47 KB, text/x-log)
2018-08-29 22:45 UTC, Léo Grange
no flags Details
Kernel log, drm-tip with DRM debug info (357.03 KB, text/x-log)
2018-08-30 08:16 UTC, Léo Grange
no flags Details
Kernel log, stock 4.18.5 kernel, without hotplug (no issue) (1.38 MB, text/x-log)
2018-08-30 08:27 UTC, Léo Grange
no flags Details
Possible NULL dereference fix (1009 bytes, patch)
2018-10-03 12:05 UTC, Stanislav Lisovskiy
no flags Details | Splinter Review
Assign intel_dp->is_mst only after mgr structure is properly initialized (1.31 KB, patch)
2018-11-06 08:43 UTC, Stanislav Lisovskiy
no flags Details | Splinter Review

Description Léo Grange 2018-08-29 16:04:25 UTC
Created attachment 141349 [details]
Kernel log from journalctl output

-- chipset: Intel HD Graphics 520 (GT2)
-- system architecture: x86_64
-- xorg-server: 1.20.1 (using generic modesetting driver)
-- libdrm: 2.4.93
-- kernel version: 4.18.5-arch1-1-ARCH
-- Linux distribution: Archlinux
-- Machine: Lenovo T470 20JN (Intel Core i5-6300U)
-- Display connector: HDMI over DP MST adapter plugged on Thunderbolt 3 port
-- Adapter reference: Cable Matters USB-C Multiport Travel Dock with Dual HDMI and PD

When plugging the TB3 dock with two monitors already attached to it, the system hangs (unable to switch to a TTY or to blind-logging to reboot cleanly).
If the dock is plugged before booting, everything is working fine including the two monitors.
Additionally, if only one monitor is attached to the dock, the hot-plugging appears to work too.

I am not sure of the root cause of the issue, as the kernel shows several Oops, one of them being in Xorg process context (see below and attachment), but I think the problem is either in the generic DRM/KMS code or in the Intel DRM code.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
CPU: 2 PID: 1216 Comm: Xorg Tainted: G           O      4.18.5-arch1-1-ARCH #1
[...]
Call Trace:
drm_dp_mst_wait_tx_reply+0x13e/0x1e0 [drm_kms_helper]
[...]
? drm_mode_connector_property_set_ioctl+0x60/0x60 [drm]


I attached the kernel log from journalctl, the dock was re-connected at 16:32:30 (line 1029 in the log).
I don't have much time these days, but I will do my best to try to reproduce the bug with the drm-tip branch and to update the issue as soon as I have new inputs.
Please let me know if some other information is required.
Comment 1 Léo Grange 2018-08-29 16:06:17 UTC
Created attachment 141350 [details]
Xorg log
Comment 2 Léo Grange 2018-08-29 22:45:06 UTC
I was able to reproduce the bug with the drm-tip branch (commit 3c17d3c5703ec98d04ef2b5b735f297081ec0531).
I will attach the kernel log, but it seems to be the same error than with my Arch Linux stock kernel.

However, with drm-tip, I was unable to startup with the dock attached (several drm-related error during the systemd boot, and it hangs just before the auto-logging).
Should I open another bug report for that, or this kind of regression is expected on the drm-tip branch?
Comment 3 Léo Grange 2018-08-29 22:45:39 UTC
Created attachment 141366 [details]
Kernel log using drm-tip kernel
Comment 4 Lakshmi 2018-08-30 06:18:59 UTC
Leo, could you attach a dmesg log with kernel parameters drm.debug=0x1e log_buf_len=4M?

How often it occurs?
Comment 5 Lakshmi 2018-08-30 06:30:20 UTC
Recomended that dmesg shall have whole boot information.
Comment 6 Léo Grange 2018-08-30 08:12:34 UTC
(In reply to Lakshmi from comment #4)
> Leo, could you attach a dmesg log with kernel parameters drm.debug=0x1e
> log_buf_len=4M?
> 
> How often it occurs?

Thank you for looking at the issue Lakshmi.
Indeed, I forgot to put the debug flags...
This time the bug cause only one of the kernel oops I had in the previous cases, and this is in a kworker context instead of in the Xorg one (time 37.197 in attached dmesg).

It occurs 100% of the time, as far as I tested, when I hotplug the adapter.


(In reply to Lakshmi from comment #5)
> Recomended that dmesg shall have whole boot information.

Maybe I missed something, I believed my attachments contained the whole boot info. Which additional information is missing and do you have some pointers on how to get it?
Comment 7 Léo Grange 2018-08-30 08:16:30 UTC
Created attachment 141371 [details]
Kernel log, drm-tip with DRM debug info

Booted without the adapter, opened an X session and then plugged the adapter (at ~36s in the logs) with the 2 monitors attached to it.
Comment 8 Léo Grange 2018-08-30 08:27:46 UTC
Created attachment 141372 [details]
Kernel log, stock 4.18.5 kernel, without hotplug (no issue)

In case it helps, I wanted to attach the dmesg of a case for which the adapter works without any issue.
Unfortunately, as some probably unrelated bugs happen with the drm-tip kernel, I had to use my distribution stock kernel (4.18.5).
The adapter was connected _before_ to power on the laptop. In this case, everything works as expected: early KMS detects and uses the 3 monitors (laptop display and the 2 external ones), Xorg can be extended on these without issue.
Comment 9 Lakshmi 2018-08-30 08:39:05 UTC
(In reply to Léo Grange from comment #7)
> Created attachment 141371 [details]
> Kernel log, drm-tip with DRM debug info
> 
> Booted without the adapter, opened an X session and then plugged the adapter
> (at ~36s in the logs) with the 2 monitors attached to it.

For now, this is enough. I will come back to you, if I need more info.
Comment 10 Stanislav Lisovskiy 2018-10-03 12:05:25 UTC
Created attachment 141847 [details] [review]
Possible NULL dereference fix

Reporter, can you please check if this attached experimental patch helps against oops.
Comment 11 Léo Grange 2018-10-05 15:09:33 UTC
(In reply to Stanislav Lisovskiy from comment #10)
> Created attachment 141847 [details] [review] [review]
> Possible NULL dereference fix
> 
> Reporter, can you please check if this attached experimental patch helps
> against oops.

Just had enough time to test quickly this afternoon, using the latest drm-tip (commit 6b7a44d1597) with your patch applied.
Tried a few different configurations (coldplug, hotplug a few times...): everything appear to work as expected!

I think you can close this issue for now, if a related issue appear during further tests I will let you know.

Thanks a lot for your work and for the quality of the Intel graphics support on Linux in general!
Comment 12 Stanislav Lisovskiy 2018-10-08 07:59:46 UTC
(In reply to Léo Grange from comment #11)
> (In reply to Stanislav Lisovskiy from comment #10)
> > Created attachment 141847 [details] [review] [review] [review]
> > Possible NULL dereference fix
> > 
> > Reporter, can you please check if this attached experimental patch helps
> > against oops.
> 
> Just had enough time to test quickly this afternoon, using the latest
> drm-tip (commit 6b7a44d1597) with your patch applied.
> Tried a few different configurations (coldplug, hotplug a few times...):
> everything appear to work as expected!
> 
> I think you can close this issue for now, if a related issue appear during
> further tests I will let you know.
> 
> Thanks a lot for your work and for the quality of the Intel graphics support
> on Linux in general!

Great! Now I need to make this patch find it's way to upstream :)
Comment 13 Stanislav Lisovskiy 2018-10-08 08:31:01 UTC
(In reply to Léo Grange from comment #11)

BTW: Can you try also without this patch - could be it was just fixed with recent drm-tip. Also considering that I've just added a "not NULL" check against mgr->mst_primary, if that really helps, this means that there are some internal logic problems, which probably need to be fixed somewhere else, while this check is merely fixing a symptom.
Comment 14 Lakshmi 2018-10-16 11:19:46 UTC
Leo, can you verify if issue can reproduced with latest drm-tip without the patch (comment 10). This information is very much needed.
Comment 15 Lionel Landwerlin 2018-11-01 21:44:56 UTC
(In reply to Lakshmi from comment #14)
> Leo, can you verify if issue can reproduced with latest drm-tip without the
> patch (comment 10). This information is very much needed.

I think I had a repro on fairly recent drm-tip on https://bugs.freedesktop.org/show_bug.cgi?id=108616

I've been running with Stanislav's patch for a few hours and so far no hang, even after a few rapid unplug/replug.
Comment 16 Léo Grange 2018-11-02 15:32:43 UTC
Sorry for the lack of response during all this time...
I was quite busy the last weeks and had no access to the concerned device.

I will do my best to test with/without the patch during the next week, using the latest drm-tip branch.
The description of bug #108616 seems indeed similar to my issue, but it occurs only during the plug of the dock, not unplug in my case.
Comment 17 Stanislav Lisovskiy 2018-11-06 08:43:31 UTC
Created attachment 142386 [details] [review]
Assign intel_dp->is_mst only after mgr structure is properly initialized

Please check my new patch, which attempts to cure the origin of a problem, but not the symptom. Also please remove my first patch before that - otherwise it will hide the problem.
Comment 18 Ville Syrjala 2018-11-14 11:00:57 UTC

Presumably fixed by

commit 23d8003907d094f77cf959228e2248d6db819fa7
Author: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
Date:   Fri Nov 9 11:00:12 2018 +0200

    drm/dp_mst: Check if primary mstb is null
Comment 19 Francesco Balestrieri 2018-12-28 09:26:53 UTC
Closing with no feedback from the reporter. Leo, please reopen if the issue persists.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.