110838 – Black screen at desktop on kernel for ICL only

Bug 110838 - Black screen at desktop on kernel for ICL only

Summary: Black screen at desktop on kernel for ICL only

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high normal
Assignee:	Ville Syrjala
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:	bisected, regression

Depends on:
Blocks:

Reported:	2019-06-04 21:33 UTC by fjdegroo
Modified:	2019-08-20 16:00 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	ICL
i915 features:	display/Other

Attachments
dmesg of failing case (black display) (69.09 KB, text/plain) 2019-06-05 21:33 UTC, fjdegroo	no flags	Details
dmesg of failing case, try #2 (305.25 KB, text/plain) 2019-06-05 22:29 UTC, fjdegroo	no flags	Details
View All

Description fjdegroo 2019-06-04 21:33:14 UTC

Recent 5.2 kernels showing black screen after logging into ICL machine.  Tracked regression using drm-tip packages to sometime between 5-26 and 5-28 builds.  All kernels booting on my KBL.

  GOOD: https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2019-05-24
  GOOD: https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2019-05-26
    *regression*
  BAD: https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2019-05-28

Not sure how to bisect commits from here and find commit that regressed kernel on ICL.  If there is a straight-forward process to do this, please send me instructions.

march: x86_64
HW: ICL D2
Display: DP

Comment 1 Jani Saarinen 2019-06-05 04:37:18 UTC

Can you also try instead of ppa to reproduce the error using latest drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

Also report out what BIOS version you have. You can see from from our CI that we have (https://intel-gfx-ci.01.org/tree/drm-tip/?hosts=icl) eg. on icl-u2: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6188/fi-icl-u2/boot0.log => DMI: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019.

Comment 2 Jani Saarinen 2019-06-05 04:51:45 UTC

Also note that if you see from CI pages og u3: https://intel-gfx-ci.01.org/tree/drm-tip/fi-icl-u3.html. There was one set of patches preventing ICL to boot at all that got fixed on later builds. This bad build also hits to your timeline: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6159/git-log-oneline.log. Maybe try later ppa too. It seems this 6159 (very bad build) was last builds from 28th that could explain as already 6160 is on 29th: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6160/git-log-oneline.log

Comment 3 Mark Janes 2019-06-05 16:08:55 UTC

This was reproduced with a local build of drm-tip on 6/3.

I asked Rodrigo how to interpret the gfx ci web page to see if we could identify the regression in the results, and we couldn't figure it out.

Looking at the link you provided, I'm still unsure of how to figure out what indicates the regression.  Is there already a bug written up about this?

Comment 4 Ville Syrjala 2019-06-05 16:27:26 UTC

Please boot with drm.debug=0xe passed to the kernel cmdline and attach the resulting dmesg once you've hit the black screen. Also pass eg. log_buf_len=4M in case the log gets truncated. That should hopefully tell us if this is a display issue.

Comment 5 Jani Saarinen 2019-06-05 16:32:41 UTC

Also still, please report your BIOS version.

Comment 6 Jani Saarinen 2019-06-05 16:57:56 UTC

Mark, if you look: https://intel-gfx-ci.01.org/tree/drm-tip/fi-icl-u2.html
and column: CI_DRM_6159 you see that it is full empty. We noticed this as none of the icl's did not boot (eg https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6159/fi-icl-u2/)  including shards. There was not bug made, or maybe Martin knows but one patch get reverted and some extra hickups after revert (https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6163/git-log-oneline.log). See those selftests (dmesg warnings). Anyway there should not be any issues as system boots.

Comment 7 Mark Janes 2019-06-05 20:43:37 UTC

Jani, thanks for the pointers.  I'm confident that this is a separate issue.  The system boots fine, but the display is blank.  The issue has persisted on i915's CI builds for weeks.

We are currently bisecting down to the commit.

Comment 8 Mark Janes 2019-06-05 20:53:11 UTC

Bisected to:

Author:     Ville Syrjälä <ville.syrjala@linux.intel.com>
drm/i915: Make sure we have enough memory bandwidth on ICL

ICL has so many planes that it can easily exceed the maximum
effective memory bandwidth of the system. We must therefore check
that we don't exceed that limit

The algorithm is very magic number heavy and lacks sufficient
explanation for now. We also have no sane way to query the
memory clock and timings, so we must rely on a combination of
raw readout from the memory controller and hardcoded assumptions
The memory controller values obviously change as the system
jumps between the different SAGV points, so we try to stabilize
it first by disabling SAGV for the duration of the readout

The utilized bandwidth is tracked via a device wide atomic
private object. That is actually not robust because we can't
afford to enforce strict global ordering between the pipes
Thus I think I'll need to change this to simply chop up the
available bandwidth between all the active pipes. Each pipe
can then do whatever it wants as long as it doesn't exceed
its budget. That scheme will also require that we assume that
any number of planes could be active at any time

TODO: make it robust and deal with all the open questions

v2: Sleep longer after disabling SAGV
v3: Poll for the dclk to get raised (seen it take 250ms
    If the system has 2133MT/s memory then we pointlessly
    wait one full second
v4: Use the new pcode interface to get the qgv points rather
    that using hardcoded numbers
v5: Move the pcode stuff into intel_bw.c (Matt)
    s/intel_sagv_info/intel_qgv_info/
    Do the NV12/P010 as per spec for now (Matt)
    s/IS_ICELAKE/IS_GEN11/
v6: Ignore bandwidth limits if the pcode query fails

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>                                                                                                                          
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Acked-by: Clint Taylor <Clinton.A.Taylor@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190524153614.32410-1-ville.syrjala@linux.intel.com

Comment 9 fjdegroo 2019-06-05 21:32:53 UTC

I tried to set the kernel cmdline to include "set drm.debug=0xe".  Not sure if I did it correctly.  Please tell me if I messed this up.  Attached dmesg.log.

File /var/log/Xorg.0.log was not created.

Bios=R00.3183.A00.1905020411

Comment 10 fjdegroo 2019-06-05 21:33:22 UTC

Created attachment 144464 [details]
dmesg of failing case (black display)

Comment 11 Ville Syrjala 2019-06-05 22:19:57 UTC

(In reply to fjdegroo from comment #10)
> Created attachment 144464 [details]
> dmesg of failing case (black display)

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.1.0-rc5c457d9+ root=UUID=06f9ea84-363e-408d-ad4b-048233161a7a ro quiet splash vt.handoff=1

drm.debug=0xe not there

Comment 12 fjdegroo 2019-06-05 22:29:03 UTC

Created attachment 144465 [details]
dmesg of failing case, try #2

Comment 13 Jani Saarinen 2019-06-06 04:00:30 UTC

There is no way we would revert that changes that was major fix for your fifo underruns. What displays we are talking here? eDP and external ones? 
I ahve ICL booting nicely with edp (4K) and DP and HDMI simultaneously with latest drm.tip. I still see you have some issues on your system.

Comment 14 Jani Saarinen 2019-06-06 06:21:35 UTC

one difference is also that you use ppa repo and not pure drm-tip...

Comment 15 Ville Syrjala 2019-06-06 08:20:43 UTC

[    4.878097] [drm:intel_bw_init_hw [i915]] QGV 0: DCLK=224 tRP=34 tRDPRE=14 tRAS=79 tRCD=34 tRC=113
[    4.878188] [drm:intel_bw_init_hw [i915]] BW0 / QGV 0: num_planes=4 deratedbw=7612
[    4.878273] [drm:intel_bw_init_hw [i915]] BW1 / QGV 0: num_planes=2 deratedbw=12465
[    4.878356] [drm:intel_bw_init_hw [i915]] BW2 / QGV 0: num_planes=1 deratedbw=17032

Looks like it only exposes a single QGV point, which I apparently failed to consider. I do remember thinking about that but apparently it slipped my mind.

BTW do you have SAGV disabled in the BIOS or something? Just wondering how we got to this state...

Comment 16 Jani Saarinen 2019-06-06 13:07:24 UTC

Can you reset to BIOS defaults and comment if that helps. Otherwise I also let Ville to fix some issues ;).

Comment 17 fjdegroo 2019-06-06 16:43:40 UTC

Regarding the bios, we are a power and performance group and so regularly fix the IA/GT/Ring/DRAM frequencies to reduce run to run variance.  This is necessary for the performance issue that we routinely chase.  Part of this is fixing the SAGV to High in the bios.  

I reset the SAGV to Enable in the bios and this display issue went away.  I can now see the desktop after login.

Re-enabling SAGV in the bios is a good workaround to get us unblocked. But long term we will need a solution for getting display when SAGV is fixed.

Comment 18 Lakshmi 2019-06-10 08:08:04 UTC

There is a patch from Ville already in the mailing list fixing this issue.

Comment 19 Ville Syrjala 2019-07-03 18:53:36 UTC

commit 56e9371bc3f3e7d6c1a197a45d550b2ce6af25f6
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Jun 6 15:42:10 2019 +0300

    drm/i915: Deal with machines that expose less than three QGV points

Comment 20 Jani Saarinen 2019-08-20 16:00:46 UTC

Closing

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.