89366 – DisplayPort MST (multi-stream transport) "atomic sleep" Linux kernel bug

Bug 89366 - DisplayPort MST (multi-stream transport) "atomic sleep" Linux kernel bug

Summary: DisplayPort MST (multi-stream transport) "atomic sleep" Linux kernel bug

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	General (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Dave Airlie
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-02-28 07:19 UTC by Adam J. Richter
Modified:	2015-09-05 02:52 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
cancel work to avoid oops (1.00 KB, patch) 2015-06-19 00:17 UTC, Dave Airlie	no flags	Details \| Splinter Review
try again, this should handle things better (1.58 KB, patch) 2015-06-19 00:53 UTC, Dave Airlie	no flags	Details \| Splinter Review
dmesg from second patch (1.02 MB, text/plain) 2015-06-19 02:40 UTC, Jeff Mickey	no flags	Details
Show Obsolete (1) View All

Description Adam J. Richter 2015-02-28 07:19:18 UTC

[This is a slightly edited version of an email that I attempted to send to the dri-devel mailing list.]

CONFIG_DEBUG_ATOMIC_SLEEP complains about the following locking problem in linux-4.0-rc1/drivers/gpu/drm/drm_dp_mst_topology.c:

drm_dp_mst_wait_tx_reply --> wait_event_timeout --> check_txmsg_state  --> mutex_lock

I believe that any function called in the "condition" argument in the wait_event_timeout macro (in this case, check_txmsg_state) is not allowed to block when the condition is being evalutated to determine whether to unblock the process.

I think the problem is real.  On two different computers and three different DisplayPort MST hubs, plugging in a DisplayPort hub or having it plugged in from boot time results in a hang within a few minutes of doing a few "xrandr" commands.

At first glance, it looked to me like it might be safe to remove the mutex_{,un}lock calls from check_txmsg_state (which is not called from anywhere else), and change the integer field txmsg->state to be an atomic_t (although I'd be surprised if there is existing hardware that supports an MST hub where the accessing that field is not atomic.  However, altough removing those mutex calls eliminated the complaint from CONFIG_DEBUG_ATOMIC_SLEEP, it also resulted in the system sometimes seeming to ignore the MST hub and otherwise eventually getting a kernel memory fault in the DisplayPort MST code or another spontaneous reset (possibly deadlock follwed by a watchdog reset).

Advice is welcome, although I am not blocked in my own efforts to analyze this further.  I plan to post updates when I have more news.

Comment 1 Jesse Barnes 2015-03-05 02:00:30 UTC

Dave has a fix!

Comment 2 Adam J. Richter 2015-03-05 05:00:37 UTC

Hi, Jesse.  Thanks for your encouraging response.  Can you tell me what I should do obtain the possible fix to test?

I assume that by "Dave" you mean Dave Arlie.  When I pull the git tree at git://people.freedesktop.org/~airlied/linux , I do not see any *_mst_* files in linux/drivers/gpu/drm.

Thanks for your response, and thanks in advance for any guidance on this.

Comment 3 Adam J. Richter 2015-03-10 02:49:40 UTC

Just to keep the information current, I'll mention that the problem is still present in Linux 4.0-rc3.

Comment 4 Adam J. Richter 2015-03-17 00:24:15 UTC

linux-4.0-rc4 has the possible partial fix that I suggested (in drm_dp_mst_topology.c remove mutex_{,un}lock from check_txmsg_state).  Unfortunately, as I originally mentioned was the case when I tried that patch, I also get a kernel memory fault from it when I plug in a DisplayPort multi-stream transport (MST), in this case at drm_dp_add_port+0x2dc.

I expect I'll try to track this down further, although doing so in linux-4.0-rc4 is slightly more complex because 'rc4 seems to be crashing a couple of other X programs that seem to run fine with rc3, implying that there may now be another kernel-release graphics bug creating symptoms at the same time.

I'll try to post an update if I have further news.  Thanks for pushing my suggested change through though.  I believe that change will be part of the complete fix.

Comment 5 Adam J. Richter 2015-03-17 04:36:22 UTC

I wrote that I would provide an update if I had news, so here goes.

I think I have found the source of the next crash.  In drm_dp_mst_topology.c,  drm_dp_send_link_address() can call a hotplug handler that change change port->mstb, but the callers of this function assume the value has not changed, and sometimes get a null pointer dereference when attempting to set posrt->mstb->link_address_sent to true.  So, I made a change that consolidates all the uses of link_address_sent inside drm_dp_send_link_address() to avoid this.

Unfortunately, there is another crash after that.  So, I think I'm probably stretching this bug report to much by trying to cover it, since I narrowly wrote the subject as just being about the atomic sleep symptom, which was address in 4.0-rc4.

So, I think I should mark this bug as resolved, and then open a new bug with a broader more functional problem description like "MST hotplug causes kernel memory fault" or something like that.

I'll leave this bug report open for at least the next ~15 hours to see if anyone asks me to do otherwise.  If I don't see any objections, I'll close this bug, open the new one, and put a comment in with a link to the new bug report.

Comment 6 Jesse Barnes 2015-05-22 04:14:36 UTC

My comment was based on an irc discussion with Dave. I'll ping him again (yes, Dave Airlie of Red Hat, to whom this bug is assigned).

Comment 7 Adam J. Richter 2015-05-22 06:49:45 UTC

Thank you, Jesse.

Comment 8 Dave Airlie 2015-05-22 20:02:38 UTC

please file a new bug for the new bugs

Comment 9 Dave Airlie 2015-05-25 05:09:25 UTC

I'm not having much look reproducing these.

I'm not sure if maybe differing userspaces might have different access patterns.

can you give some more detailed info on the hw you have, the only DP MST machine I have is a Haswell Lenovo t440s with dock, and a few Dell DP monitors.

Comment 10 kijiki0 2015-05-27 02:47:31 UTC

I've seen what appears to be the same bug (well, same as the last incarnation):
Kernel takes a #PF in drm_dp_check_and_send_link_address because the 2nd param (mstb) is NULL.

It happens when I plug my T440p into a dock connected to 2x Dell 2001FPs.  Each monitor is connected to the dock's DisplayPort ports via a DVI<->DisplayPort adapter, as the monitors only support DVI.

Comment 11 kijiki0 2015-05-27 02:56:35 UTC

Oh, forgot to mention, I've seen it with Ubuntu's 3.19.0-18, as well as a mainline 4.1-rc2.

Comment 12 kijiki0 2015-05-30 03:49:58 UTC

Interestingly, I removed one of the DP->DVI adapters and connected that monitor to the dock's DVI port, and it seems to work fine now.

So to summarize:
2x DP->DVI = BAD
1x DP->DVI, 1x DVI = GOOD

Comment 13 Jeff Mickey 2015-06-18 02:06:18 UTC

I have a t440s laptop with a dock and a single monitor (monoprice) attached via a displayport cable.

I am hitting the error mentioned in comment #5 very regularly, and finally got a panic that saved the backtrace

https://bugs.archlinux.org/task/45369

https://bbs.archlinux.org/viewtopic.php?pid=1537752#p1537752

I can very easily reproduce this issue locally with the archlinux vanilla kernel. Please let me know if there is anything I can do to help debug this. If there is another bug opened for what Adam described, please let me know and I'll remove this comment and add this comment there.

Comment 14 Dave Airlie 2015-06-19 00:17:35 UTC

Created attachment 116586 [details] [review]
cancel work to avoid oops

this should fix this, by cancelling the work queue earlier.

Comment 15 Dave Airlie 2015-06-19 00:53:56 UTC

Created attachment 116587 [details] [review]
try again, this should handle things better

the last patch had lockdep warnings

Comment 16 Jeff Mickey 2015-06-19 02:40:13 UTC

Created attachment 116591 [details]
dmesg from second patch

Comment 17 Jeff Mickey 2015-06-19 02:40:29 UTC

Hmmm. Well it didn't seem to do too well. Survived for a bit longer, but when I disconnected from the dock, the laptop froze. Also seemed to really spam my dmesg.

I've attached the kernel logs, unfortunately the kernel freeze (if it was that) was not captured in these logs.

Comment 18 Dave Airlie 2015-06-19 03:51:44 UTC

well it was just meant to stop the oops you were seeing,

I'm suspecting the dock firmware is failing to deal with MST monitor

Jun 18 19:33:06 nevada kernel: [drm:drm_dp_mst_handle_down_rep] Got NAK reply: req 0x21, reason 0x08, nak data 0x10

those are never good.

not sure what is causing all the reprobing in that log, that is just wierd,

Comment 19 Jeff Mickey 2015-06-19 05:44:20 UTC

Thanks for looking into it! If there's anything else you'd like me to test please let me know.

It's a standard lenovo dock (the "ultra" dock I believe) at the latest firmware, with a monoprice monitor.

Will look into what the nacks mean.

Comment 20 Jeff Mickey 2015-09-03 23:47:53 UTC

With 4.1.6 I don't see these issues anymore on arch with an Ultra Dock (glorified port expander essentially). Just reporting this here in case other people on this bug still see this in 4.1.6 or later so they can update with what they see. Thanks for all the help again Dave, you really made using my laptop bearable in the interim with your information.

Comment 21 Adam J. Richter 2015-09-05 02:52:59 UTC

I am sorry I failed to mention earlier that around 2015-07-28, I opened a separate freedesktop.org bug report for the problem I mentioned in comment #5, at https://bugs.freedesktop.org/show_bug.cgi?id=91481 , which includes an illustrative if probably incorrect patch, and to which I will add a note about why I think a patch like the one I provided there is probably still necessary in 4.1.6.

I think that the discussion after comment #5 here is all about this other bug, so I am marking this ticket as resolved.  However, if you think you are discussing a problem covered by this bug report and not the new one, please feel change this ticket's status from resolved to whatever status you believe is more appropriate.

Otherwise, I invite everyone to move to continue the discussion at the new bug report.

Thank you all for your fixes and information.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.