Bug 89600 - [BSW] GPU hangs after S4
Summary: [BSW] GPU hangs after S4
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: highest major
Assignee: peter.antoine
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-17 02:32 UTC by Jeff Zheng
Modified: 2017-10-06 14:30 UTC (History)
10 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg log (303.57 KB, text/plain)
2015-03-17 02:32 UTC, Jeff Zheng
no flags Details
/sys/class/drm/card0/error (2.66 MB, text/plain)
2015-03-17 02:33 UTC, Jeff Zheng
no flags Details
dmesg after apply the patch (282.60 KB, text/plain)
2015-04-20 05:53 UTC, Jeff Zheng
no flags Details
dmesg info on BDW-Y after S3+S4 (123.40 KB, text/plain)
2015-04-27 08:37 UTC, ye.tian
no flags Details

Description Jeff Zheng 2015-03-17 02:32:05 UTC
Created attachment 114360 [details]
dmesg log

==System Environment==
--------------------------
Only eDP without external monitor.
BIOS: V59 
Regression: Not sure. Hard to test previous version because of bug 89005

Non-working platforms: BSW

==kernel==
--------------------------
-testing: drm-intel-testing-2015-03-13 (fails)

==Bug detailed description==
-----------------------------
I tried S3 and then S4, and dmesg showed GPU hang

==Reproduce steps==
---------------------------- 
1. Boot
2. echo mem > /sys/power/state and resume
3. echo disk> /sys/power/state and resume
4. dmesg shows:
[  289.878592] [drm] stuck on render ring
[  289.899951] [drm] GPU HANG: ecode 8:0:0xfffffffe, reason: Ring hung, action: reset
[  289.899959] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  289.899962] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  289.899965] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  289.899968] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  289.899972] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Comment 1 Jeff Zheng 2015-03-17 02:33:07 UTC
Created attachment 114361 [details]
/sys/class/drm/card0/error
Comment 2 Jeff Zheng 2015-03-31 08:00:59 UTC
exists on drm-intel-testing-2015-03-27
Comment 3 Jeff Zheng 2015-04-13 03:04:24 UTC
exists on drm-intel-testing-2015-04-10
Comment 4 Jeff Zheng 2015-04-14 01:38:16 UTC
glxgear looks fine after this issue appears.
Comment 5 David Weinehall 2015-04-14 14:53:50 UTC
I can reproduce this.  If I test again after the suspend/hibernate has failed it works though (with a similar work/fail/work/fail pattern).

Can you confirm a similar behaviour?
Comment 6 Jeff Zheng 2015-04-15 01:19:52 UTC
Yes. I saw the similar behavior (5 GPU hang after 10 suspend/resume).
Comment 7 David Weinehall 2015-04-15 11:46:38 UTC
Could you try passing the enable_execlists=0 parameter to i915? This seems to "fix" (work around) the issue for me.
Comment 8 Jeff Zheng 2015-04-16 00:21:28 UTC
You're right. I don't see GPU hang with i915.enable_execlists=0
Comment 9 Chris Harris 2015-04-17 15:37:26 UTC
Please could you check if https://patchwork.freedesktop.org/patch/47251/ fixes the problem.  Obviously make sure that you test with i915.enable_execlists=1
Comment 10 Jeff Zheng 2015-04-20 05:53:53 UTC
Created attachment 115204 [details]
dmesg after apply the patch

Still see GPU HANG after applying patch to drm-intel-testing-2015-04-10.
Comment 11 David Weinehall 2015-04-20 08:26:59 UTC
Can you test if you can reproduce the bug with:

* enable_execlists=0 + only S3
* enable_execlists=0 + only S4

In my testing I can only reproduce the issue when s4 is involved; when execlists are disabled S3 will not trigger any bugs (if execlists are used I can trivially trigger a similar -- not necessarily identical -- issue by using just S3).
Comment 12 Jeff Zheng 2015-04-21 03:00:51 UTC
With i915.enable_execlists=0, I tried boot+10 S3, and boot+10 S4, and I didn't see GPU hang in either cases.
Comment 13 peter.antoine 2015-04-21 11:56:54 UTC
Has any seen this on any platform other than BSW?
Currently on a BDW and 30+ cycles and no hangs, execlists are on.
Comment 14 David Weinehall 2015-04-22 15:14:55 UTC
So far my experience is that if I boot the system in single user mode I'm unable to reproduce this issue.  Booting in multi user mode and using X (in my case GNOME 3) during the test will however eventually trigger the GPU hangs.

Running a test set with only S3 involved seems to work fine.  S4 only seems to work fine too. I've only triggered this with the combination of S3 + S4 + X.

So my theory is that some state (that only matters when using X -- presumably the accelerated parts that GNOME 3 requires, but that's just a hunch) that both S3 and S4 messes with isn't handled properly by one of those two code paths.


I've tested both 61.4 and 66 and see similar results on both. I've yet to confirm whether this is specific to Braswell only.  If it is it could be a BIOS issue; I don't think we have any Braswell-specific S3/S4 kernel code (it could still be a kernel issue if Braswell introduces some new properties that we don't restore properly on resume/thaw).
Comment 15 Jeff Zheng 2015-04-23 01:49:05 UTC
Hi David,

With default enable_execlists (I believe to be i915.enable_execlists=1), I can reproduce this issue without X on my test machine after several S3. My test machine are re-worked though.

You can remote ssh into x-bsw14 if you want. rtcwake can also reproduce this issue.
Comment 16 David Weinehall 2015-04-23 08:04:09 UTC
(In reply to Jeff Zheng from comment #15)
> Hi David,
> 
> With default enable_execlists (I believe to be i915.enable_execlists=1), I
> can reproduce this issue without X on my test machine after several S3. My
> test machine are re-worked though.
> 
> You can remote ssh into x-bsw14 if you want. rtcwake can also reproduce this
> issue.

My tests were with enable_execlists=0; I'm perfectly aware that the enable_execlists=1 case is broken.
Comment 17 peter.antoine 2015-04-27 07:15:18 UTC
Seems to be a BSW issue.

Exactly same Kernel that was run on the BDW, install on the BSW first P3 get a GPU hang.

Will see if I can find the cause now.
Comment 18 ye.tian 2015-04-27 07:34:21 UTC
Tested it on BDW-Y with the testing kernel drm-intel-testing-2015-04-23, This problem also exists, or run twice S4.
Comment 19 David Weinehall 2015-04-27 07:50:53 UTC
(In reply to ye.tian from comment #18)
> Tested it on BDW-Y with the testing kernel drm-intel-testing-2015-04-23,
> This problem also exists, or run twice S4.

OK, so this isn't a Braswell-specific issue, after all?
Comment 20 Jeff Zheng 2015-04-27 07:53:16 UTC
Hi Tian Ye, could you please also upload dmesg?
Comment 21 peter.antoine 2015-04-27 08:25:05 UTC
I was on the BDW-U (rvp).
Comment 22 ye.tian 2015-04-27 08:37:51 UTC
Created attachment 115366 [details]
dmesg info on BDW-Y after S3+S4

dmesg info on BDW-Y after S3+S4
Comment 23 ye.tian 2015-04-27 09:07:07 UTC
(In reply to peter.antoine from comment #21)
> I was on the BDW-U (rvp).

I am unable to reproduce this issue on the BDW-U.
Comment 24 David Weinehall 2015-04-27 09:49:14 UTC
When you reproduced it in BDW-Y, was that with execlists enabled or disabled?
Comment 25 peter.antoine 2015-04-27 14:15:47 UTC
This is a problem that has been seen before on other systems.
It is a timing issue with the way the registers are restored when the system comes out of power saving.

I have a patch that has survived 10 S3 and 10 S4 without a GPU hang.

The patch simply changes the order that bits of the system are being re-initialised after a resume.

Patch will be released to linux-gfx within the hour.
Comment 26 Chris Harris 2015-04-27 15:33:11 UTC
The possible fix from Peter is at:

https://patchwork.freedesktop.org/patch/48028/

Please could you try this patch and see if it fixes the problem?
Comment 27 Jeff Zheng 2015-04-28 01:08:06 UTC
(In reply to Chris Harris from comment #26)
> The possible fix from Peter is at:
> 
> https://patchwork.freedesktop.org/patch/48028/
> 
> Please could you try this patch and see if it fixes the problem?


On BSW, I apply the patch to drm-intel-testing-2015-04-23. I tried 10 S3 and then S3+S4 3 times and could not reproduce this issue.

Without the patch, I can easily reproduce this issue.
Comment 28 ye.tian 2015-04-28 01:41:24 UTC
(In reply to David Weinehall from comment #24)
> When you reproduced it in BDW-Y, was that with execlists enabled or disabled?

With execlists enabled.

Tested it on the latest nightly kernel with this patch, this problem does not exists.
Comment 29 David Weinehall 2015-04-28 14:09:35 UTC
I can confirm that the patch does the trick, even with execlists enabled.

@Peter: Nice work!
Comment 30 Jeff Zheng 2015-05-11 02:33:41 UTC
Still exists on drm-intel-testing-2015-05-08... Is the patch checked in?
Comment 31 David Weinehall 2015-05-11 07:30:02 UTC
No, the patch has not been merged yet.  The patch, while indeed fixing the issue, was deemed to be too "ugly".  Discussions are currently taking place on the intel-gfx mailing list as to how a better solution should look like.
Comment 32 peter.antoine 2015-05-11 08:08:33 UTC
The patch was too heavy handed.
I have currently only added the part that is directly causing the issue. There are some other issue that will need to be fixed, but these are being scheduled for later.

New patches are on the mailing list.
Comment 33 vivekanandhan J 2015-05-12 07:02:16 UTC
Observing this with the below stack on BSW

OS: Ubuntu 14.04.01. 64-bit
kernel: Eywa-4.0.0-rc7
Bios: BSW_SPI_Quad_R10_Config3_PreProduction_BRASWEL_X64_R_X068_01_ME-2.0.0.2060
KSC: 1.08

Software Stack:
===============

mesa - 10.6.0-devel ef5d4bcc3a21f1aa3e6a919c8888f26ec754707f
libdrm - 2.4.60 812e8fe6ce46d733c30207ee26c788c61f546294
libva - 0.37.1 9bfde38f19d81b7f33db8c4c8e80420c9e60429e
Xf86_video_intel - 5054e2271210a52bf88b0f12c35d687ce9e8210d
xserver - 1.15.1 b1029716e41e252f149b82124a149da180607c96
intel-driver - 37d1ee43a223766164ccc1de9079cac27c44e8f0
Comment 34 vivekanandhan J 2015-05-12 07:03:00 UTC
Observing this with the below stack on BSW

OS: Ubuntu 14.04.01. 64-bit
kernel: Eywa-4.0.0-rc7
Bios: BSW_SPI_Quad_R10_Config3_PreProduction_BRASWEL_X64_R_X068_01_ME-2.0.0.2060
KSC: 1.08

Software Stack:
===============

mesa - 10.6.0-devel ef5d4bcc3a21f1aa3e6a919c8888f26ec754707f
libdrm - 2.4.60 812e8fe6ce46d733c30207ee26c788c61f546294
libva - 0.37.1 9bfde38f19d81b7f33db8c4c8e80420c9e60429e
Xf86_video_intel - 5054e2271210a52bf88b0f12c35d687ce9e8210d
xserver - 1.15.1 b1029716e41e252f149b82124a149da180607c96
intel-driver - 37d1ee43a223766164ccc1de9079cac27c44e8f0
Comment 35 Jani Nikula 2015-05-12 08:33:53 UTC
Fixed by

commit 364aece01a2dd748fc36a1e8bf52ef639b0857bd
Author: Peter Antoine <peter.antoine@intel.com>
Date:   Mon May 11 08:50:45 2015 +0100

    drm/i915: Avoid GPU hang when coming out of s3 or s4

in drm-intel-fixes.
Comment 36 Jeff Zheng 2015-05-15 06:53:15 UTC
Tested on latest nightly 2015_05_15 and this issue is fixed.
Comment 37 vivekanandhan J 2015-05-15 09:19:22 UTC
gpu hung issue exists with the below stack

Kernel: Eywa-4.1.0-rc3
commit id: 21cb3a48ab8b421aba19939151d7ad4cd8c6e531 
bios ver: 68.1
ksc: 1.08
Comment 38 Jani Nikula 2015-05-15 09:38:41 UTC
(In reply to vivekanandhan J from comment #37)
> gpu hung issue exists with the below stack
> 
> Kernel: Eywa-4.1.0-rc3
> commit id: 21cb3a48ab8b421aba19939151d7ad4cd8c6e531 
> bios ver: 68.1
> ksc: 1.08

Well, do you have the commit referenced in comment #35 above?
Comment 39 appala 2015-10-09 12:40:08 UTC
<changing title>
GPU hang observed with s4 exit only
Comment 40 David Weinehall 2015-10-12 05:55:01 UTC
(In reply to appala from comment #39)
> <changing title>
> GPU hang observed with s4 exit only

If the observed behaviour doesn't match the reported behaviour of a pre-existing bug (which yours does not seem to do), then please file a new bug rather than re-using an existing bug.
Comment 41 appala 2015-10-12 09:21:20 UTC
Hi David,

We have filed new bug related to GPU hang issue, and the bug id is:92435
Comment 42 Elizabeth 2017-10-06 14:30:59 UTC
Closing old verified.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.