64725 – [snb regression] Hung GPU when resuming from suspend-to-disk on kernel 3.8

Bug 64725 - [snb regression] Hung GPU when resuming from suspend-to-disk on kernel 3.8

Summary: [snb regression] Hung GPU when resuming from suspend-to-disk on kernel 3.8

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-05-17 19:51 UTC by Thiago Macieira
Modified:	2017-07-24 22:58 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
i915_error_state after the hung messages (2.06 MB, text/plain) 2013-05-17 19:51 UTC, Thiago Macieira	no flags	Details
New error_state (2.05 MB, application/octet-stream) 2013-06-03 20:30 UTC, Thiago Macieira	no flags	Details
Invalidate ring TLBs (2.42 KB, patch) 2013-08-06 18:01 UTC, Chris Wilson	no flags	Details \| Splinter Review
View All

Description Thiago Macieira 2013-05-17 19:51:15 UTC

Created attachment 79482 [details]
i915_error_state after the hung messages

When resuming a hibernated Linux 3.8 (suspend-to-disk), I get the following errors in dmesg:

[133824.298846] Restarting tasks ... done.
[133824.302138] video LNXVIDEO:00: Restoring backlight state
[133824.574336] [drm] Enabling RC6 states: RC6 off, RC6p off, RC6pp off
[133839.582069] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[133839.582074] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[133847.567832] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[133855.565593] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[-- a couple more --]
[133873.569751] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[133873.570591] [drm:i915_reset] *ERROR* Failed to reset chip.

Similar hung messages are to be found in Xorg.0.log, but I failed to capture it before rebooting.

This is a regression from 3.7. With the exact same userspace, resume-from-disk works on all 3.7 kernels I've tried, and it fails on all 3.8 kernels I've tried.

Steps to reproduce:
1) suspend-to-disk
2) resume-from-disk

I initially thought this was related to suspend-to-disk while an extra output was enabled, besides LVDS, and resuming while that output was not connected. That particular bug happened in 3.5 or 3.6, but disappeared in 3.7.

Hardware:
 - SandyBridge (Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz)
 - Dell Latitude E6420, BIOS version A05
 00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 0493
        Flags: bus master, fast devsel, latency 0, IRQ 43
        Memory at e1400000 (64-bit, non-prefetchable) [size=4M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 4000 [size=64]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [d0] Power Management version 2
        Capabilities: [a4] PCI Advanced Features
        Kernel driver in use: i915

Kernel:
 Stock Fedora 17 kernels.
 Last one to show failure: kernel-3.8.11-100
 (there's an updated 3.8.12, but I haven't tested)

Comment 1 Thiago Macieira 2013-05-17 19:53:04 UTC

PCI ID is 8086:0126

Comment 2 Daniel Vetter 2013-05-20 18:41:53 UTC

Hm, it seems to die on the very first command it reads, and the CS seems to read complete garbage. No idea how that one happened, so a bisect sounds useful. Also, if you can please rehang your machine and grab another error_state, just to check whether this is the right pattern.

Comment 3 Chris Wilson 2013-05-21 12:53:05 UTC

In v3.8, there are a few scary patches that tried to speed up resume, and in particular started to diverge the suspend/hibernate paths - such as

commit 1abd02e2dd7e0bd577000301fb2fd47780637387
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Fri Nov 2 11:14:02 2012 -0700

    drm/i915: don't rewrite the GTT on resume v4

Equally there are quite a few new TLB invalidate w/a in v3.8 which might be missed for the hibernate path. If you have time for a bisect, that would be very useful. I've been successfully hibernating my own f18 SNB - but that has been on 3.9 for some time.

Comment 4 Thiago Macieira 2013-05-21 14:19:25 UTC

I do not have the time or skill to bisect changes to the kernel.

I will provide a new error_state dump during this week.

Comment 5 Thiago Macieira 2013-06-03 20:30:33 UTC

Created attachment 80247 [details]
New error_state

Sorry for the delay. Here's the new error state. Steps taken to reproduce:

1. boot kernel 3.8.13-100.fc17 (Fedora 17 latest)
2. suspend to disk
3. resume from disk
4. wait until driver claims the GPU is wedged

Comment 6 Chris Wilson 2013-06-03 20:41:12 UTC

Here the seqno for the render ring never made it to the HWS, which is more believable as an error occurring after resume.

Comment 7 Chris Wilson 2013-06-12 09:31:32 UTC

Can you please try with this patch: https://patchwork.kernel.org/patch/2707341/ as it claims to fix some instability with rc6 on SandyBridge?

Comment 8 Thiago Macieira 2013-06-12 15:14:12 UTC

(In reply to comment #7)
> Can you please try with this patch:
> https://patchwork.kernel.org/patch/2707341/ as it claims to fix some
> instability with rc6 on SandyBridge?

Chris, I'll be happy to try it, but I don't see how it could have any bearing on the problem at hand. RC6 is disabled on my machine via a kernel boot option (i915.i915_enable_rc6=0) because of the instability.

I'll be quite happy to re-enable RC6, but a patch tuning RC6 can't really fix the resume-from-disk problem when RC6 isn't even enabled. Can it?

Comment 9 Chris Wilson 2013-06-12 15:22:34 UTC

My fault, mass bugzilla request. I intended to skip this one and thought I had selected an older rc6 related snb resume issue.

Comment 10 Thiago Macieira 2013-06-12 17:16:34 UTC

Reopening then

Comment 11 Thiago Macieira 2013-07-01 19:18:48 UTC

Do you guys need any new info?

Comment 12 Thiago Macieira 2013-07-15 07:13:34 UTC

Update: still happening on Fedora kernel 3.9.8-100 (fc17.x86_64)

Comment 13 Chris Wilson 2013-08-06 18:01:56 UTC

Created attachment 83729 [details] [review]
Invalidate ring TLBs

Comment 14 Chris Wilson 2013-08-11 11:14:25 UTC

We would like to get testing feedback on the patch to see if it is as good as it claims. :)

Comment 15 Thiago Macieira 2013-08-11 18:34:29 UTC

(In reply to comment #14)
> We would like to get testing feedback on the patch to see if it is as good
> as it claims. :)

I'll try. So far I'm running into kernel build issues.

extracting debug info from /root/rpmbuild/BUILDROOT/kernel-3.9.10-100+i915fix.fc17.x86_64/lib/modules/3.9.10-100+i915fix.fc17.x86_64.debug/extra/fs/fuse/cuse.ko

Comment 16 Thiago Macieira 2013-08-16 00:39:47 UTC

First resume with the patch, no problem. Hibernated with two connections, resumed with the same two.

Will continue testing.

Comment 17 Thiago Macieira 2013-08-16 02:20:46 UTC

Hibernate with two, resume with one - OK

Comment 18 Thiago Macieira 2013-08-17 23:16:37 UTC

After a couple of resumes from hibernate, it seems the problem is gone. Please submit the patch upstream :-)

Comment 19 Daniel Vetter 2013-08-18 17:37:14 UTC

(In reply to comment #18)
> After a couple of resumes from hibernate, it seems the problem is gone.
> Please submit the patch upstream :-)

As you wish ;-) Fix pushed to -fixes with cc: stable:

commit 2156b612b7589caf39eded17545fa6b77e072f10
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 6 19:01:14 2013 +0100

    drm/i915: Invalidate TLBs for the rings after a reset

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.