Bug 92005

Summary: Linux 4.2 DisplayPort MST deadlock?
Product: DRI Reporter: Adam J. Richter <adam_richter2004>
Component: GeneralAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: nkim, sandeep
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
don't send hotplug for input ports.
none
cleanup ports better, and hotplug in a different place.
none
dmesg log of some apparent delays, but not the deadlock, from kernel booted with "drm.debug=6 loglevel=7" none

Description Adam J. Richter 2015-09-15 02:20:39 UTC
In Linux-4.2, there appears to be mutex contention and possible occasionally a deadlock between two kernel functions, drm_mode_getconnector and drm_fb_helper_hotplug_event over the mutex dev->mode_config.mutex .  
Neither of these functions is specific to MultiStreamTransport or even DisplayPort generally, but I think that the DP-MST code might be unique in causing the contention and deadlock, either due to unanticipated the unusual tree structure of DP-MST or because of a bug in the DP-MST code.  In particular, I have at least once observed the following call trace, where I think drm_mode_getconnector took the mutex, though a long hierarcy of calls, eventually ended up calling drm_fb_helper_hotplug_event, which tried to take it again.

I hesitate to open this ticket, because I am not sure that "dev" variable at the top of this stack trace is the same one as at the bottom, especially considering that I did not notice the system complaining about attempting to block on a mutex where mutex->owner == current, even though CONFIG_DEBUG_MUTEXES was set.  The system that I got this trace from was blocked infinitely as far as I could tell, which is unusual, in that the problem that I usually observe has to do with "xrandr" taking on the order of a minute to complete, and often being inaccurate, but usually not hanging forever.

I suspect that what happened probably involved some intervening hotplug event or perhaps involving kernel work functions in a way that I am not completely clear about where mutex->owner could somehow have been set to the "current" of a kernel work thread instead of the X server.

Anyhow, the part that I think would likely be helpful to anyone working on this (and basically the reason I am posting now, rather than waiting) is that this stack trace might indicate some confusion in assemptions about whether dev->mode_config.mutex is help by the caller of certain functions in the middle of this stack trace.

[<ffffffffa01837c8>] drm_fb_helper_hotplug_event+0x138/0x150 [drm_kms_helper]
[<ffffffffa02de31e>] intel_fbdev_output_poll_changed+0x1e/0x30 [i915]
[<ffffffffa017755b>] drm_kms_helper_hotplug_event+0x2b/0x40 [drm_kms_helper]
[<ffffffffa02f1c15>] intel_dp_mst_hotplug+0x15/0x20 [i915]
[<ffffffffa017aef4>] drm_dp_destroy_port+0xd4/0xe0 [drm_kms_helper]
[<ffffffffa017af15>] drm_dp_put_port+0x15/0x20 [drm_kms_helper]
[<ffffffffa017b04e>] drm_dp_destroy_mst_branch_device+0x4e/0x100 [drm_kms_helper]
[<ffffffffa017b115>] drm_dp_put_mst_branch_device+0x15/0x20 [drm_kms_helper]
[<ffffffffa017b6fd>] drm_dp_mst_i2c_xfer+0x9d/0x270 [drm_kms_helper]
[<ffffffff814cca91>] __i2c_transfer+0x121/0x430
[<ffffffff814cce19>] i2c_transfer+0x79/0xb0
[<ffffffffa00bb4a9>] drm_do_probe_ddc_edid+0xc9/0x130 [drm]
[<ffffffffa00bb0fa>] drm_do_get_edid+0x17a/0x250 [drm]
[<ffffffffa00bca55>] drm_get_edid+0x45/0x3d0 [drm]
[<ffffffffa017b9ee>] drm_dp_mst_get_edid+0x7e/0xa0 [drm_kms_helper]
[<ffffffffa02f1c99>] intel_dp_mst_get_modes+0x29/0x50 [i915]
[<ffffffffa0177908>] drm_helper_probe_single_connector_modes_merge_bits+0x108/0x4e0 [drm_kms_helper]
[<ffffffffa0177cf3>] drm_helper_probe_single_connector_modes+0x13/0x20 [drm_kms_helper]
[<ffffffffa00b6ab9>] drm_mode_getconnector+0x389/0x410 [drm]
[<ffffffffa00a8685>] drm_ioctl+0x1a5/0x670 [drm]
[<ffffffffa00c4e53>] drm_compat_ioctl+0x33/0x40 [drm]
[<ffffffffa026bde2>] i915_compat_ioctl+0x32/0x40 [i915]
[<ffffffff812475f9>] compat_SyS_ioctl+0xc9/0x15d0
[<ffffffff8161ed22>] sysenter_dispatch+0xf/0x29
[<ffffffffffffffff>] 0xffffffffffffffff

I expect that I will update or close this ticket as (or if) I learn more.

I hope this information is helpful.  Comments, and, of course, fixes, are most welcome.
Comment 1 Dave Airlie 2015-09-16 00:39:18 UTC
Created attachment 118304 [details] [review]
don't send hotplug for input ports.

this might help, it'll at least decrease things.
Comment 2 Dave Airlie 2015-09-16 01:23:16 UTC
Created attachment 118305 [details] [review]
cleanup ports better, and hotplug in a different place.

actually this is a more comprehensive version.
Comment 3 Adam J. Richter 2015-09-16 01:44:21 UTC
Created attachment 118306 [details]
dmesg log of some apparent delays, but not the deadlock, from kernel booted with "drm.debug=6 loglevel=7"

I have been asked to try attach a dmesg log of this problem with the kernel booted with a "drm.debug=6" kernel command line argument.  The exact problem that produced this interesting stack trace is difficult to reproduce, but long waits when doing "xrandr --current" and other misbehavior happen frequently and I suspect may be related to this mutex contention.  In this case, "xarndr" takes a long time to run, examination of /proc/<x-server-pid>/stack shows that the X server seems to be usually requesting new EDID data via DisplayPort MST, and, in this particular instance, the end result was the one of the two screens connected to the DisplayPort MST hub is not showing any video (and "xrandr" does not seem to see its EDID information either).

I apologize if this attachment is essentially spam for a bug that might be completely separate from the issue for which I opened this ticket.  I am posting it just on the chance that it might be relevant (and also because it as close as I think I'll get today to fulfilling the request to get a drm.debug=6 dmesg log of the original problem.
Comment 4 Adam J. Richter 2015-09-27 03:19:15 UTC
Hi, Dave.

I have tried the patch you attached to this bug report on 2015-09-16 on a couple of different systems with Linux 4.3-rc2, and confirm that I have so far not seen any DP-MST related kernel null pointer dereferences.  I do see such oopses on Linux-4.3-rc2 without your patch.  So, I would encourage you to send it or something similar upstream if you have not done so already.

By the way, just to warn you that not everything is perfect, I still see apparently invalid video from one of two DP-MST hubs that I have been using with 4.3-rc2 + your patch, but that is outside the scope of this freedesktop.org bug report.  So, I expect to test a little more and close this bug as "resolved" once your patch or the equivalent appears in a mainline Linux release candidate.

Thank you very much for your help!
Comment 5 Adam J. Richter 2015-09-30 02:32:05 UTC
The patch that Dave Airlie posted to this bug on 2015-09-16 is in Linux-4.3-rc3.  I have tried Linux 4.3-rc3, and do not see any kernel memory null pointer dereferences.  So, I am changing the status of this ticket to "RESOLVED FIXED".

Thank you very much for all of your help, Dave.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.