Bug 89727 - [SKL bisected] system hang when changing resolution in games or killing gnome-session twice
Summary: [SKL bisected] system hang when changing resolution in games or killing gnome...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: highest critical
Assignee: Matt Roper
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-23 08:49 UTC by ye.tian
Modified: 2016-10-05 05:30 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg info (121.46 KB, text/plain)
2015-03-23 08:53 UTC, ye.tian
no flags Details

Description ye.tian 2015-03-23 08:49:35 UTC
System Environment:       
-----------------------------------------------------
Platform: SKL 
Regression: Not sure, this bug was introduced by fix another known issue: bug-89388, we verified below commit (exist in drm-intel-next-queued branch) was intent to fix bug-89388, but cause this issue.
The console show this error message:” [drm:gen8_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun”

==Kernel==
--------------------------------------------------
commit c9f038a1a5924352ab8e510e4a45ac57b08db391
Author:     Matt Roper <matthew.d.roper@intel.com>
AuthorDate: Mon Mar 9 11:06:02 2015 -0700
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Tue Mar 17 22:30:17 2015 +0100

    drm/i915: Don't assume primary & cursor are always on for wm calculation (v4)

    Current ILK-style watermark code assumes the primary plane and cursor
    plane are always enabled.  This assumption, along with the combination
    of two independent commits that got merged at the same time, results in
    a NULL dereference.  The offending commits are:Bug detailed description:
--------------------------------------------------

Reproduce steps:
----------------------------
1, Boot up system with edp display connected
2. xinit & 
3. gnome-session
4. kill x
5. Repeat step 2-4, will encounter machine hang up issue.
Comment 1 ye.tian 2015-03-23 08:53:20 UTC
Created attachment 114544 [details]
dmesg info
Comment 2 Gordon Jin 2015-03-23 14:03:15 UTC
This also happens when we change resolution in games (e.g. padman, etqw-demo, and many others) or exit/restart games, so this blocks most of game testing.
Comment 3 Damien Lespiau 2015-03-23 15:14:06 UTC
I can't reproduce this one by starting X and killing it, nor by doing init 3/init 5 cycles. tjaalton on IRC has similar issues though and with the same commit:

<tjaalton> c9f038a1a592435 makes my skl-y hang for good on X restart

I don't see anything wrong with this commit. Maybe a side effect elsewhere in the WM code, triggered by that commit.
Comment 4 Gordon Jin 2015-03-24 00:44:50 UTC
(In reply to Damien Lespiau from comment #3)
> I can't reproduce this one by starting X and killing it, nor by doing init
> 3/init 5 cycles.

xinit is not sufficient. Could you try gnome-session as well?
Comment 5 ye.tian 2015-03-24 02:30:23 UTC
The first bad commit[1] is 
c9f038a1a5924352ab8e510e4a45ac57b08db391 
Author: Matt Roper <matthew.d.roper@intel.com>
Date:   Mon Mar 9 11:06:02 2015 -0700

    drm/i915: Don't assume primary & cursor are always on for wm calculation (v4)

[1]There are 2 ways to reproduce the issue,  "kill gnome-session twice" and "change game resolution". By "change game resolution", we could verify the fisrt bad commit.
Comment 6 Jesse Barnes 2015-03-26 01:06:39 UTC
Matt can you take a look?
Comment 7 wendy.wang 2015-03-27 01:32:25 UTC
This bug show the same bisected first bad commit as Bug 89731 - [SKL bisected regression] system hangs when restarting Xhttps://bugs.freedesktop.org/show_bug.cgi?id=89731
Comment 8 Matt Roper 2015-04-01 22:48:26 UTC
(In reply to Gordon Jin from comment #4)
> (In reply to Damien Lespiau from comment #3)
> > I can't reproduce this one by starting X and killing it, nor by doing init
> > 3/init 5 cycles.
> 
> xinit is not sufficient. Could you try gnome-session as well?

To confirm, you can xinit and kill X as many times as you like without problem, but if you start a full Gnome session, kill it, start a second Gnome session, and then kill it again, you see a hang?

I don't have access to a SKL platform, so I'm a bit blind here, but can you try a few things (assuming my understanding above is correct and also that you still see this issue on the latest -nightly)?
 - Test with xinit, but make sure you move your mouse into and out of the xterm that usually gets started by default to ensure the mouse cursor has to appear/change before you kill X.  I want to rule out the framebuffer reference counting issues we had with universal cursors recently.
 - Start with xinit, but run xrandr to set varying display modes.  I.e., can we easily trigger this crash by just switching modes with nothing else going on?
Comment 9 Matt Roper 2015-04-01 22:58:46 UTC
Also, what kind of displays are you using.  There's a similar report in bug 89731 which indicates:

> IIRC this only happens on the SKL-Y which has a builtin eDP panel,
> but not on SKL-S hooked to my DP monitor.

Are you also using eDP here?

It's unclear to me whether my commit caused the problem here, or whether it just allowed us to get farther along and hit a different problem in existing code.  We probably need someone with a SKL platform to do a little debugging to narrow down the area the crash is happening in.
Comment 10 ye.tian 2015-04-02 06:19:17 UTC
(In reply to Matt Roper from comment #8)
> (In reply to Gordon Jin from comment #4)
> > (In reply to Damien Lespiau from comment #3)
> > > I can't reproduce this one by starting X and killing it, nor by doing init
> > > 3/init 5 cycles.
> > 
> > xinit is not sufficient. Could you try gnome-session as well?
> 
> To confirm, you can xinit and kill X as many times as you like without
> problem, but if you start a full Gnome session, kill it, start a second
> Gnome session, and then kill it again, you see a hang?
> 
 yes, the system hang.I can see a yellow light on the motherboard.

> I don't have access to a SKL platform, so I'm a bit blind here, but can you
> try a few things (assuming my understanding above is correct and also that
> you still see this issue on the latest -nightly)?

 yes, this issue still on the latest -nightly.

>  - Test with xinit, but make sure you move your mouse into and out of the
> xterm that usually gets started by default to ensure the mouse cursor has to
> appear/change before you kill X.  I want to rule out the framebuffer
> reference counting issues we had with universal cursors recently.

 confirmed, I can move the mouse into and out of the xterm before kill X.

>  - Start with xinit, but run xrandr to set varying display modes.  I.e., can
> we easily trigger this crash by just switching modes with nothing else going
> on?

  The screen will turn black when set varying display modes, then the system will hang when kill X.
Comment 11 ye.tian 2015-04-02 06:36:05 UTC
(In reply to Matt Roper from comment #9)
> Also, what kind of displays are you using.  There's a similar report in bug
> 89731 which indicates:
> 
> > IIRC this only happens on the SKL-Y which has a builtin eDP panel,
> > but not on SKL-S hooked to my DP monitor.
> 
> Are you also using eDP here?

 Yes, I am using eDP here.
Comment 12 Daniel Vetter 2015-04-02 08:19:30 UTC
Can we just revert the skl part of this patch as an interim solution to unblock QA? Matt, can you please prepare a patch.
Comment 13 ye.tian 2015-04-02 08:27:24 UTC
(In reply to Matt Roper from comment #8)
> (In reply to Gordon Jin from comment #4)
> > (In reply to Damien Lespiau from comment #3)
> > > I can't reproduce this one by starting X and killing it, nor by doing init
> > > 3/init 5 cycles.
> > 
> > xinit is not sufficient. Could you try gnome-session as well?
> 
> To confirm, you can xinit and kill X as many times as you like without
> problem,
  Confirmed, I kill X more than 15 times, the system still without problem.
Comment 14 Matt Roper 2015-04-02 14:26:01 UTC
(In reply to Daniel Vetter from comment #12)
> Can we just revert the skl part of this patch as an interim solution to
> unblock QA? Matt, can you please prepare a patch.

Reverting the SKL part of my patch will just result in NULL-dereference and immediate crashes when killing X or doing anything else that causes the primary plane to be disabled, so I think that will be even more crippling than the current situation (at least today it sounds like you can bring X down and back up at least once before running into problems).
Comment 15 Daniel Vetter 2015-04-09 08:22:44 UTC
(In reply to Matt Roper from comment #14)
> (In reply to Daniel Vetter from comment #12)
> > Can we just revert the skl part of this patch as an interim solution to
> > unblock QA? Matt, can you please prepare a patch.
> 
> Reverting the SKL part of my patch will just result in NULL-dereference and
> immediate crashes when killing X or doing anything else that causes the
> primary plane to be disabled, so I think that will be even more crippling
> than the current situation (at least today it sounds like you can bring X
> down and back up at least once before running into problems).

Hm right. Can we do a functional revert instead like in the other wm code, i.e. assuming if state->fb == NULL that bpp == 4 and the primary plane spans the full screen? Horrible hacks I know, but that should at least get us out of this until someone can look at skl wm for real.
Comment 16 ye.tian 2015-04-15 02:30:28 UTC
Tested on the latest nightly kernel(5ea91d) and latest mase(cc5860e4, this issue does not exists on skl.
Comment 17 Timo Aaltonen 2015-04-15 05:43:47 UTC
Yes, current nightly works for me too, drm-intel-next-2015-04-10 doesn't. Is it the scaler stuff that fixed it?
Comment 18 Jari Tahvanainen 2016-10-05 05:30:06 UTC
Closing verified+fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.