Bug 89915 - [bdw execlists 4.0.6] GPU hang after S3
Summary: [bdw execlists 4.0.6] GPU hang after S3
Status: CLOSED DUPLICATE of bug 95019
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: highest normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 91252 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-04-06 08:59 UTC by Eugenio
Modified: 2016-04-20 16:23 UTC (History)
6 users (show)

See Also:
i915 platform: BDW
i915 features: GEM/execlists, GPU hang


Attachments
GPU crash dump (2.81 MB, text/plain)
2015-04-06 08:59 UTC, Eugenio
no flags Details
dmesg (80.93 KB, text/plain)
2015-04-06 08:59 UTC, Eugenio
no flags Details

Description Eugenio 2015-04-06 08:59:10 UTC
Created attachment 114884 [details]
GPU crash dump

Hello,

I had Ubuntu 14.04, kernel 3.16, on Asus UX303LN-R4281H (Optimus Intel + Nvidia, nvidia drivers currently not installed). I updated to 4.0.0rc6 to solve a few issues not related with graphics (e.g. touchpad). After the update, resuming after suspend has some problems. System resumes and then freezes for a few seconds. Then unfreezes and in the kernel log I find:

[   71.845023] [drm] stuck on render ring
[   71.845672] [drm] GPU HANG: ecode 8:0:0xfffffffe, in Xorg [1284], reason: Ring hung, action: reset
[   71.845674] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   71.845674] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   71.845675] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   71.845676] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   71.845676] [drm] GPU crash dump saved to /sys/class/drm/card0/error

After a couple minutes, kernel log is full of:

drm:hsw_unclaimed_reg_detect.isra.10 [i915]] *ERROR* Unclaimed register detected. Please use the i915.mmio_debug=1 to debug this problem.

Note: I was using Chrome in the while, two tabs were probably using acceleration (e.g. 3d gmaps). I don't know if it is related, but some minutes after the system froze completely, without even responding to sysreqs. Never happened before kernel upgrade.

Now I rebooted with mmio.debug=1, suspended and resumed: it freezed again for some seconds. Part of dmesg | grep i915:

[   39.884700] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
...
[   71.852974] drm/i915: Resetting chip after gpu hang
[   77.848497] [drm:i915_set_reset_status.part.38 [i915]] *ERROR* gpu hanging too fast, banning!
[   77.854831] drm/i915: Resetting chip after gpu hang

Attaching GPU crash dump and dmesg after reboot.
Comment 1 Eugenio 2015-04-06 08:59:43 UTC
Created attachment 114885 [details]
dmesg
Comment 2 Chris Wilson 2015-04-06 09:01:58 UTC
Try i915.enable_execlists=0
Comment 3 Eugenio 2015-04-09 16:07:11 UTC
Thanks for the suggestion. The error is not easily reproducible but shows up quite randomly after intense use of Chrome, however I've run the laptop for 2 days with "i915.enable_execlists=0" and it did not show up. Reboot once without it, and after one suspension I got the error again. So the option likely solved it. 

After 2 days with "i915.enable_execlists=0" I found only this instead:

[ 8171.802098] PM: Entering mem sleep
[ 8171.802110] Suspending console(s) (use no_console_suspend to debug)
[ 8171.803545] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 8171.807718] sd 0:0:0:0: [sda] Stopping disk
[ 8172.835579] [drm:stop_ring [i915]] *ERROR* render ring : timed out trying to stop ring
[ 8173.267949] PM: suspend of devices complete after 1464.521 msecs

which seems to be related with the beginning of suspension and did not cause any freeze nor crash.
Also, I have now thousands of "Unclaimed register detected" messages. Not fatal nor problematic, just annoying for filling the log. Not sure if related or not.
Comment 4 Michel Thierry 2015-07-03 09:15:35 UTC
Looking at the reported date, isn't this a duplicate of bug 89600?
Comment 5 Chris Wilson 2015-07-03 09:24:39 UTC
Possibly, but bug 89600 was only confirmed for BSW as the reporters there had working BDW.
Comment 6 Chris Wilson 2015-07-07 08:31:03 UTC
*** Bug 91252 has been marked as a duplicate of this bug. ***
Comment 7 Chris Wilson 2015-07-07 09:43:42 UTC
Michel Thierry:

Hi,

I still think it's the same problem fixed by Peter (http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=364aece01a2dd748fc36a1e8bf52ef639b0857bd).
The issue was a race between enabling the interrupts and completing
the first batchbuffer, that's probably why we only saw it in chv, but
it's the same code bdw uses.

v4.0.6 didn't get the fix,
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/drivers/gpu/drm/i915/i915_drv.c?id=v4.0.6

Only v4.0.7:
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/drivers/gpu/drm/i915/i915_drv.c?id=v4.0.7


Now, this is Michel's comment 4 that it looks like bug 89000 (which at the time was negatively indicated for bdw) but it should be easy enough for everyone to test whether this is now fixed in 4.0.7
Comment 8 Jesse Barnes 2015-08-17 21:01:01 UTC
Assuming this is fixed then?  Please re-open if not...
Comment 9 Eugenio 2016-04-19 13:12:33 UTC
I clicked on reopen but I think I created a new one? 95019
Comment 10 yann 2016-04-20 16:23:45 UTC

*** This bug has been marked as a duplicate of bug 95019 ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.