Bug 91952 - [Bisected Regression] Blank screen from boot until any input on X
Summary: [Bisected Regression] Blank screen from boot until any input on X
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Maarten Lankhorst
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-09-10 01:24 UTC by João Paulo Rechi Vita
Modified: 2017-07-24 22:45 UTC (History)
6 users (show)

See Also:
i915 platform: BYT
i915 features:


Attachments
Revert "drm/i915: get rid of primary_enabled and use atomic state" (8.29 KB, patch)
2015-09-10 01:25 UTC, João Paulo Rechi Vita
no flags Details | Splinter Review
Revert "drm/i915: Use the disable callback for disabling planes." (5.55 KB, patch)
2015-09-10 01:26 UTC, João Paulo Rechi Vita
no flags Details | Splinter Review
dmesg with drm.debug=0x1e log_buf_len=1M when the problem happens (223.57 KB, text/plain)
2015-09-15 22:21 UTC, João Paulo Rechi Vita
no flags Details
dmesg with drm.debug=0x1e log_buf_len=1M when the problem does not happen (179.15 KB, text/plain)
2015-09-15 22:23 UTC, João Paulo Rechi Vita
no flags Details
dmesg with drm.debug=0x1e log_buf_len=1M when the problem happens v2 (207.63 KB, text/plain)
2015-09-16 13:04 UTC, João Paulo Rechi Vita
no flags Details
Kernel config (183.44 KB, text/plain)
2015-09-16 20:54 UTC, João Paulo Rechi Vita
no flags Details
Only disable planes that are potentially enabled. (604 bytes, patch)
2015-09-17 08:55 UTC, Maarten Lankhorst
no flags Details | Splinter Review

Description João Paulo Rechi Vita 2015-09-10 01:24:00 UTC
Affected version
----------------

This happens on v4.2 tag from mainline kernel and is a regression from v4.1 tag. 


Affected platforms
------------------

It happens on cherryview laptop with the built-in display connected over eDP and on a Celeron N2807 (not sure what is the processor family for this one) with a monitor connected over HDMI. In both cases the kernel and userspace are 32 bits.


Description
-----------

The symptom is the screen remains blank from the boot moment until any kind of input action -- cursor movement or keypress -- is performed after X is started. Only after that the GDM login menu is shown. You can tell the screen is powered on and brightness is high. It does not happen all the time, sometimes it takes up to 6-7 reboots to trigger the bug. On the Celeron disabling the GRUB boot menu (timeout=0) makes it happen much more often.


Expected results
----------------

From boot the plymouth splash animation should be displayed and then the GDM login screen.


Bisect results
--------------

I have bisected the problem and came to "drm/i915: Use the disable callback for disabling planes." (27321ae88c70104df1ade701e079932b54360885) as the culprit. Reverting it and its predecessor makes the problem go away, but there were several conflicts during both reverts, so I'm not sure they are 100% correct. I'm attaching the revert commits.

Let me know if there is any other information needed.
Comment 1 João Paulo Rechi Vita 2015-09-10 01:25:39 UTC
Created attachment 118174 [details] [review]
Revert "drm/i915: get rid of primary_enabled and use atomic state"

This is needed to allow reverting "drm/i915: Use the disable callback for disabling planes."
Comment 2 João Paulo Rechi Vita 2015-09-10 01:26:20 UTC
Created attachment 118175 [details] [review]
Revert "drm/i915: Use the disable callback for disabling planes."
Comment 3 Jani Nikula 2015-09-10 12:55:22 UTC
Any chance to test drm-intel-nightly branch of [1] please?

[1] http://cgit.freedesktop.org/drm-intel
Comment 4 João Paulo Rechi Vita 2015-09-10 22:13:13 UTC
I've tested on today's intel-drm-nightly branch (HEAD 333b2479dc32eaf4343acd58adb25d1736c81588) on the Celeron N2807 and could not reproduce it.

I have also tested the same branch in August 27th (when I first saw this bug) on the cherryview, and also could not reproduce it.
Comment 5 João Paulo Rechi Vita 2015-09-11 14:07:37 UTC
I think the problem behind this bug is some kind of race condition, due to how hard it is to reproduce, and to the fact that this unrelated patch [1] that affects the kernel boot time mitigates the symptoms:

author	Andy Whitcroft <apw@canonical.com>	2009-12-02 14:41:53 (GMT)
committer	Tim Gardner <tim.gardner@canonical.com>	2015-08-31 00:32:13 (GMT)
commit	2015a241030e89920d7d3613a81c30c36e7f11ac (patch)
tree	3d515264f0d291a6f83410e3f7421055eb1d26aa
parent	366eebd045e55262304d45c138bdcca004726c55 (diff)
UBUNTU: SAUCE: isapnp_init: make isa PNP scans occur async
The results of scanning for devices is to trigger udev events therefore
we can push this processing async.

This reduces kernel initialisation time (the time from bootloader to
starting userspace) by several 10ths of a second x86 32bit systems.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>


So I'm not sure if this is actually fixed in intel-drm-nightly or if it has a different timing that does not trigger the problem. In any case, it would be nice to have a fix for stable.

[1] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/wily/commit/?id=2015a241030e89920d7d3613a81c30c36e7f11ac
Comment 6 Daniel Drake 2015-09-11 18:06:40 UTC
I can also reproduce on N2807. The bug usually bites on cold boot, and not on reboot. I tested 20 cold boots and hit the bug 19 times.

Reverting that commit does convincingly avoid the problem, tested another 20 cold boots, 0 failures.

I added some printk's to trace the logic in intel_crtc_disable_planes() and 
intel_commit_primary_plane(). There is no obvious behavioural difference in the case of reboot (working) vs cold boot (no display).
Comment 7 João Paulo Rechi Vita 2015-09-15 21:35:00 UTC
I've tested the following diff provided by Maarten on top of v4.2 with no success:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index a387b7db1970..2ba6f10eeca7 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -2570,6 +2570,11 @@ intel_find_initial_plane_obj(struct intel_crtc *intel_crtc,
 
 	kfree(plane_config->fb);
 
+	if (!drm_mm_initialized(&dev_priv->mm.stolen)) {
+		to_intel_plane(primary)->disable_plane(primary,
						       &intel_crtc->base,
						       true);
+		return;
+	}
+
 	/*
 	 * Failed to alloc the obj, check to see if we should share
 	 * an fb with another CRTC instead

I'm going to attach a dmesg log with drm.debug=0x1e as requested on IRC.
Comment 8 João Paulo Rechi Vita 2015-09-15 22:21:56 UTC
Created attachment 118297 [details]
dmesg with drm.debug=0x1e log_buf_len=1M when the problem happens
Comment 9 João Paulo Rechi Vita 2015-09-15 22:23:20 UTC
Created attachment 118298 [details]
dmesg with drm.debug=0x1e log_buf_len=1M when the problem does not happen
Comment 10 Maarten Lankhorst 2015-09-16 09:22:31 UTC
Are these logs from the same kernel?

If so, does the bug go away when you boot with modprobe.blacklist=rtl8723be ?

Because what I see is the following:

NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [NetworkManager:227]
CPU: 1 PID: 227 Comm: NetworkManager Not tainted #41
Hardware name: Endless EC-200/Aptio CRB, BIOS E03 03/02/2015
Comment 11 João Paulo Rechi Vita 2015-09-16 13:03:20 UTC
Yes, these two logs are from the same kernel.

I have not seen that problem before, but certainly is an unrelated problem: the bug still happens when booting with modprobe.blacklist=rtl8723be, as you suggested, and when it happens that soft lockup is not always present.

Also, another piece of information: when the problem does not happen, I see one console message right before the plymouth bootsplash takes over, and it is timestamped 2.xxx seconds. When it does happen I don't even see that message, not the plymouth splash or anything else until X is reached and I move the cursor or press any key. So my uninformed guess would be that the problem is related to something that happens during the first 3s of the boot process.

Sorry for the noise, I'm attaching a 2nd log from the same kernel where the problem happens in which the lockup does not happen. Please let me know if there is something else I can provide or check to help fight this regression. I can provide a video of the problem happening and the expected results if that is of any help.
Comment 12 João Paulo Rechi Vita 2015-09-16 13:04:16 UTC
Created attachment 118313 [details]
dmesg with drm.debug=0x1e log_buf_len=1M when the problem happens v2
Comment 13 João Paulo Rechi Vita 2015-09-16 20:54:29 UTC
Created attachment 118320 [details]
Kernel config

I'm also attaching the kernel config in which the bug happens, since it is not reproducible with i386_defconfig.
Comment 14 Maarten Lankhorst 2015-09-17 08:55:54 UTC
Created attachment 118324 [details] [review]
Only disable planes that are potentially enabled.

Ok one thing I did notice:
[    0.000000] efi: No EFI runtime due to 32/64-bit mismatch with kernel

You're using a 32-bits kernel with 64-bits efi runtime. Not harmful in itself probably. Could check if it goes away with noefi though..

Taking a closer looks at the reverts:

There appears to be a bug in the plane disable code where the call intel_disable_primary_hw_plane does nothing.

This isn't noticed because the primary plane gets disabled through intel_disable_sprite_planes.

I'm guessing the attached patch might solve the issue too.

Can you check?
Comment 15 João Paulo Rechi Vita 2015-09-17 20:59:22 UTC
Yes, I'm aware we're running a 32-bit kernel on a 64-bit EFI, and as you said, this is not a problem in itself. We had the same situation with 4.1 and during my bisect, so I don't think it would make any difference.

After applying the patch you provided (attachment 118324 [details] [review]) on top of a vanilla 4.2 and doing 20 boots (10 reboots and 10 cold boots) on each of the platforms, I don't see the problem anymore.

If you are going to submit a patch to stable/upstream please use my work email for any of the tags (reported/tested) you might want to add to the patch: jprvita@endlessm.com

Thank you very much for the fix!
Comment 16 João Paulo Rechi Vita 2015-09-18 15:46:16 UTC
Testing the following modified version of the patch as requested on IRC things continue to look good:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 87476ff..a5f97cf 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -4863,7 +4863,8 @@ static void intel_crtc_disable_planes(struct drm_crtc *crtc)

        intel_crtc_dpms_overlay_disable(intel_crtc);
        for_each_intel_plane(dev, intel_plane) {
-               if (intel_plane->pipe == pipe) {
+               if (intel_plane->pipe == pipe &&
+                   to_intel_plane_state(intel_plane->base.state)->visible) {
                        struct drm_crtc *from = intel_plane->base.crtc;

                        intel_plane->disable_plane(&intel_plane->base,
Comment 17 Maarten Lankhorst 2016-03-07 07:46:58 UTC
commit 634b3a4a476e96816d5d6cd5bb9f8900a53f56ba
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Mon Nov 23 10:25:28 2015 +0100

    drm/i915: Do a better job at disabling primary plane in the noatomic case.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.