Created attachment 79482 [details] i915_error_state after the hung messages When resuming a hibernated Linux 3.8 (suspend-to-disk), I get the following errors in dmesg: [133824.298846] Restarting tasks ... done. [133824.302138] video LNXVIDEO:00: Restoring backlight state [133824.574336] [drm] Enabling RC6 states: RC6 off, RC6p off, RC6pp off [133839.582069] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [133839.582074] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [133847.567832] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [133855.565593] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [-- a couple more --] [133873.569751] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! [133873.570591] [drm:i915_reset] *ERROR* Failed to reset chip. Similar hung messages are to be found in Xorg.0.log, but I failed to capture it before rebooting. This is a regression from 3.7. With the exact same userspace, resume-from-disk works on all 3.7 kernels I've tried, and it fails on all 3.8 kernels I've tried. Steps to reproduce: 1) suspend-to-disk 2) resume-from-disk I initially thought this was related to suspend-to-disk while an extra output was enabled, besides LVDS, and resuming while that output was not connected. That particular bug happened in 3.5 or 3.6, but disappeared in 3.7. Hardware: - SandyBridge (Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz) - Dell Latitude E6420, BIOS version A05 00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller]) Subsystem: Dell Device 0493 Flags: bus master, fast devsel, latency 0, IRQ 43 Memory at e1400000 (64-bit, non-prefetchable) [size=4M] Memory at d0000000 (64-bit, prefetchable) [size=256M] I/O ports at 4000 [size=64] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [a4] PCI Advanced Features Kernel driver in use: i915 Kernel: Stock Fedora 17 kernels. Last one to show failure: kernel-3.8.11-100 (there's an updated 3.8.12, but I haven't tested)
PCI ID is 8086:0126
Hm, it seems to die on the very first command it reads, and the CS seems to read complete garbage. No idea how that one happened, so a bisect sounds useful. Also, if you can please rehang your machine and grab another error_state, just to check whether this is the right pattern.
In v3.8, there are a few scary patches that tried to speed up resume, and in particular started to diverge the suspend/hibernate paths - such as commit 1abd02e2dd7e0bd577000301fb2fd47780637387 Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Fri Nov 2 11:14:02 2012 -0700 drm/i915: don't rewrite the GTT on resume v4 Equally there are quite a few new TLB invalidate w/a in v3.8 which might be missed for the hibernate path. If you have time for a bisect, that would be very useful. I've been successfully hibernating my own f18 SNB - but that has been on 3.9 for some time.
I do not have the time or skill to bisect changes to the kernel. I will provide a new error_state dump during this week.
Created attachment 80247 [details] New error_state Sorry for the delay. Here's the new error state. Steps taken to reproduce: 1. boot kernel 3.8.13-100.fc17 (Fedora 17 latest) 2. suspend to disk 3. resume from disk 4. wait until driver claims the GPU is wedged
Here the seqno for the render ring never made it to the HWS, which is more believable as an error occurring after resume.
Can you please try with this patch: https://patchwork.kernel.org/patch/2707341/ as it claims to fix some instability with rc6 on SandyBridge?
(In reply to comment #7) > Can you please try with this patch: > https://patchwork.kernel.org/patch/2707341/ as it claims to fix some > instability with rc6 on SandyBridge? Chris, I'll be happy to try it, but I don't see how it could have any bearing on the problem at hand. RC6 is disabled on my machine via a kernel boot option (i915.i915_enable_rc6=0) because of the instability. I'll be quite happy to re-enable RC6, but a patch tuning RC6 can't really fix the resume-from-disk problem when RC6 isn't even enabled. Can it?
My fault, mass bugzilla request. I intended to skip this one and thought I had selected an older rc6 related snb resume issue.
Reopening then
Do you guys need any new info?
Update: still happening on Fedora kernel 3.9.8-100 (fc17.x86_64)
Created attachment 83729 [details] [review] Invalidate ring TLBs
We would like to get testing feedback on the patch to see if it is as good as it claims. :)
(In reply to comment #14) > We would like to get testing feedback on the patch to see if it is as good > as it claims. :) I'll try. So far I'm running into kernel build issues. extracting debug info from /root/rpmbuild/BUILDROOT/kernel-3.9.10-100+i915fix.fc17.x86_64/lib/modules/3.9.10-100+i915fix.fc17.x86_64.debug/extra/fs/fuse/cuse.ko
First resume with the patch, no problem. Hibernated with two connections, resumed with the same two. Will continue testing.
Hibernate with two, resume with one - OK
After a couple of resumes from hibernate, it seems the problem is gone. Please submit the patch upstream :-)
(In reply to comment #18) > After a couple of resumes from hibernate, it seems the problem is gone. > Please submit the patch upstream :-) As you wish ;-) Fix pushed to -fixes with cc: stable: commit 2156b612b7589caf39eded17545fa6b77e072f10 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 6 19:01:14 2013 +0100 drm/i915: Invalidate TLBs for the rings after a reset
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.