Bug 93746 - Black screen when reconnecting display
Summary: Black screen when reconnecting display
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-17 21:05 UTC by Bernd Steinhauser
Modified: 2018-07-10 14:43 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (68.59 KB, text/plain)
2016-01-21 17:21 UTC, Bernd Steinhauser
no flags Details
Xorg log (66.45 KB, text/plain)
2016-01-21 17:21 UTC, Bernd Steinhauser
no flags Details
First proposed patch to fix this on radeon-kms (3.73 KB, patch)
2016-02-17 23:05 UTC, Mario Kleiner
no flags Details | Splinter Review
Annotated and trimmed kernel log which points to the cause of the problem. (56.67 KB, text/plain)
2016-02-17 23:09 UTC, Mario Kleiner
no flags Details
Patch for fix on radeon-kms (v2) reviewed and tested. (4.06 KB, patch)
2016-02-19 00:04 UTC, Mario Kleiner
no flags Details | Splinter Review
Port of the radeon-kms patch v2 to amdgpu (4.06 KB, patch)
2016-02-19 00:06 UTC, Mario Kleiner
no flags Details | Splinter Review

Description Bernd Steinhauser 2016-01-17 21:05:07 UTC
One of my screens (Eizo EV2455) has an odd behaviour: When switched off, it completely kills the connection, which is very annoying, but that's not the issue here.

However, when I turn it on again, I'm presented with a black screen on that screen only (the other two work fine).
I can see the mouse pointer, I can see it changing shape when over a text field, but the window itself is not visible, its hidden behind the blackness.

The issue goes away when I set
Option      "DRI" "2"

Please see
https://bugs.kde.org/show_bug.cgi?id=357988
as well.

GPU is an AMD Kaveri. Kernel is 4.4.0. Mesa is currently scm, but I've seen this with 10.x as well. xf86-video-ati is 7.6.1.

What other info do you need?
Comment 1 Alex Deucher 2016-01-18 15:26:31 UTC
Please attach your xorg log and dmesg output.
Comment 2 Bernd Steinhauser 2016-01-21 17:21:03 UTC
Created attachment 121189 [details]
dmesg output

Output by dmesg up to the point where the screen was switched off and on again, including a VT change (working around the problem) afterwards.
Comment 3 Bernd Steinhauser 2016-01-21 17:21:35 UTC
Created attachment 121190 [details]
Xorg log

Corresponding X log.
Comment 4 Bernd Steinhauser 2016-01-21 17:22:34 UTC
(In reply to Bernd Steinhauser from comment #0)
> The issue goes away when I set
> Option      "DRI" "2"

I think I have to revert myself here. I've seen this with DRI2 as well. I have yet to find out what changed.
Comment 5 Michel Dänzer 2016-01-27 07:32:07 UTC
Does this only happen with Option "TearFree"? Only with a 4.4 kernel?

It sounds like it could be the problem discussed in http://lists.freedesktop.org/archives/dri-devel/2016-January/098823.html and followups.

P.S. How does this relate to bug 90987?
Comment 6 Bernd Steinhauser 2016-01-27 07:43:42 UTC
(In reply to Michel Dänzer from comment #5)
> Does this only happen with Option "TearFree"? Only with a 4.4 kernel?
No, it doesn't seem to be related to the Pageflip option either.

> 
> It sounds like it could be the problem discussed in
> http://lists.freedesktop.org/archives/dri-devel/2016-January/098823.html and
> followups.
Could be, yes. However, I think that I had problems with linux 4.2 as well. 4.3 for sure. I bought this screen (the Eizo) in September and since 4.2 came out in August, this seems likely.
I will try it out, though during my next test run tomorrow.

> 
> P.S. How does this relate to bug 90987?
Not sure. The difference is, that in this one, only that single screen misbehaves, in bug 90987, all of them do.
I can reproduce both individually.
Also, while this one is fixable by switching to VT2 and back, bug 90987 seems only fixable by restarting X (or at least I didn't find another way yet).
Comment 7 Bernd Steinhauser 2016-01-28 18:51:43 UTC
I'm not a 100% sure, but quite certain, that I couldn't reproduce this bug with kernel 4.2.5, so I think it's at least related to the linked email.

I also was able to reproduce this in such a way, that all screens go black.
From the description of the option, I think that this was because I didn't have TearFree set.

So I'm not sure anymore if it is really a separate bug or a dupe of 90987. Could be that the options just change the buffer behavior in such a way that it can either be fixed by switching to VT2 or not.
If it's the same, I would have been able to reproduce with 4.2.5, with the effect described in bug 90987.
Comment 8 Michel Dänzer 2016-01-29 02:34:19 UTC
(In reply to Bernd Steinhauser from comment #7)
> I'm not a 100% sure, but quite certain, that I couldn't reproduce this bug
> with kernel 4.2.5, so I think it's at least related to the linked email.

The problem referenced there was only introduced in 4.4, so if you can reproduce this with 4.3, it's probably a different issue; can you bisect in that case?
Comment 9 Bernd Steinhauser 2016-01-29 06:22:15 UTC
(In reply to Michel Dänzer from comment #8)
> (In reply to Bernd Steinhauser from comment #7)
> > I'm not a 100% sure, but quite certain, that I couldn't reproduce this bug
> > with kernel 4.2.5, so I think it's at least related to the linked email.
> 
> The problem referenced there was only introduced in 4.4, so if you can
> reproduce this with 4.3, it's probably a different issue; can you bisect in
> that case?

Will try, but it could take a while, don't know if I find time for that during the weekend.
Comment 10 Bernd Steinhauser 2016-02-07 10:22:39 UTC
Ok, I'm pretty sure I tracked it down:

commit 5b5561b3660db734652fbd02b4b6cbe00434d96b
Author: Mario Kleiner
Date:   Wed Nov 25 20:14:31 2015 +0100

    drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)


Obviously the same (or similar) commit did go into amdgpu which I didn't test as it is marked experimental for Kaveri.

The rev before that didn't show the issue, with that rev, it occured.

To be sure I also tested 4.4.1 vanilla and with the commit reverted.
For vanilla reconnecting the screen resulted in a black screen as described.
With the patch reverted, it did not happen.
Comment 11 Bernd Steinhauser 2016-02-07 18:33:43 UTC
Ok, I compiled a kernel with amdgpu enabled and radeon disabled.
ddx is now xf86-video-amdgpu-scm.

So far I have not seem this issue with amdgpu, even though the corresponding version of that commit is still applied.
(commit 8e36f9d33c134d5c6448ad65b423a9fd94e045cf)

So at least it doesn't happen as often as with radeon (during my last tests, I've seen it every time, but I am not sure it happens always).
I will try it for a couple of days to see if it does occur or not.
Comment 12 Bernd Steinhauser 2016-02-07 18:34:49 UTC
Should note, I set the same options for amdgpu as for radeon:
Section "Device"
    Identifier  "AMDGPU"
    Driver      "amdgpu"
    Option      "TearFree" "On"
    Option      "DRI" "3"
EndSection
Comment 13 Mario Kleiner 2016-02-08 01:50:55 UTC
(In reply to Bernd Steinhauser from comment #12)
> Should note, I set the same options for amdgpu as for radeon:
> Section "Device"
>     Identifier  "AMDGPU"
>     Driver      "amdgpu"
>     Option      "TearFree" "On"
>     Option      "DRI" "3"
> EndSection

Hi,

i just cc'ed you on a series of patches with fixes for Linux 4.4 and later. Can you try if applying those on top of 4.4 helps? At least it should remove the problems discussed in http://lists.freedesktop.org/archives/dri-devel/2016-January/098823.html and maybe this is related.
Comment 14 Mario Kleiner 2016-02-17 23:03:58 UTC
Documenting some progress made "outside" the bug reporter.

So attached a proposed patch. And a trimmed and annotated syslog output from some debugging with Bernd's help:


> On 15/02/16 19:34, Mario Kleiner wrote:
>> Bernd, can you apply the attached patch on top of the others and test
>> with
>> drm.debug=35 again, and ideally provide syslog output with the
>> microsecond
>> timestamps included? This adds some debug statements to see if
>> something goes
>> wrong in radeon_flip_work_func on radeon-kms.
> Sure, I tried that, but I ran into this problem:
> Feb 16 21:21:17.870174 orionis systemd-journald[618]: Missed 3 kernel
> messages
>
> The log is spammed with these messages. So I did some research and was
> pointed to the kernel parameter log_buf_leng, which I set to 16M
> (actually tried values up to 128M), but that didn't help.
>
> I'll attach the log anyway, but if you tell me how to get rid of that
> issue, I can of course give it another go.
> For journald, I set RateLimitBurst=0, which should prevent it from not
> accepting messages due to the log spam.
>
> I also did another experiment. The Dell U2415 is normally connected to
> HDMI-0.
> I connected that to DP-0 (with the Eizo disconnected) and found out,
> that now the Dell shows the same behavior as the Eizo. When disconnected
> and reconnected, the screen will be black.
> If I turn on the DP1.2 setting for the Dell, it will do so even when
> switched off, most likely because the connection is cut with DP1.2
> turned on (really really annoying behavior, that's also the reason why I
> have so many problems with the Eizo, but for that DP1.2 cannot be
> switched off).
>
> Thus, it doesn't seem like the problem is related to specific display
> hardware.
> If you'd like, I can test the HP ZR24W as well (normally connected to DVI).
>
> Best Regards,
> Bernd

Thanks, that one was useful. Can you revert the debug patch i sent you last and instead apply the attached patch and retest? Maybe also remove the patch "[PATCH 1/2] drm/radeon: Make vbl counter/ts queries robust against dpms on/off. [RFC]", so your tree corresponds more closely to what is in 4.4/4.5rc.

This one is a proposed fix for the problem, also for stable 4.4.

Seems to be that when the connection gets cut while a pageflip gets queued by the userspace driver, the radeon-kms driver does DPMS OFF as part of its hotplug work function -- apparently only for DP displays, if i understand the code correctly? Then when radeon_flip_work_func() executes the "wait until real start of vblank" code introduced in Linux 4.4 to fix other regressions, that code executes while the display engine is already disabled and the scanout is no longer moving. That leads the wait code to go into an infinite loop - hence the huge amount of messages in your syslog - flip_work_func hangs -> pageflip hangs -> game over.

So the attached patch should make that new wait code robust against such unexpected things as dpms off/on in parallel. Maybe we could manage to also get this into 4.5-rc5, now that the other vblank fixes have landed in Linus tree.

I don't know if it is expected behavior that pageflips can be queued by userspace while the display is disabled, or if the ddx shouldn't already prevent that? The log suggests that the ddx got the hot(un)plug event before it tried to page flip anyway in TearFree.
Comment 15 Mario Kleiner 2016-02-17 23:05:11 UTC
Created attachment 121821 [details] [review]
First proposed patch to fix this on radeon-kms
Comment 16 Mario Kleiner 2016-02-17 23:09:28 UTC
Created attachment 121822 [details]
Annotated and trimmed kernel log which points to the cause of the problem.

See "-->" for key events.
Comment 17 Bernd Steinhauser 2016-02-18 04:31:49 UTC
(In reply to Mario Kleiner from comment #15)
> Created attachment 121821 [details] [review] [review]
> First proposed patch to fix this on radeon-kms

This does work, I no longer observe the problem, thanks.

Maybe the reporter of bug 90987 should test the patch as well as after all it wasn't clear if the bug was the same or different (or maybe the patch fixes both bugs).

(I know that this was a regression from 4.4, but maybe there was a different way to trigger this before.)
Comment 18 Mario Kleiner 2016-02-19 00:04:58 UTC
Created attachment 121834 [details] [review]
Patch for fix on radeon-kms (v2) reviewed and tested.

Final patch for Linux 4.4 stable and later.
Comment 19 Mario Kleiner 2016-02-19 00:06:15 UTC
Created attachment 121835 [details] [review]
Port of the radeon-kms patch v2 to amdgpu

Identical patch for amdgpu.
Comment 20 Michel Dänzer 2018-07-10 14:43:18 UTC
Thanks for the report. Resolving, as Mario's fixes landed long ago.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.