Summary: | [i915gm] GPU lockup (ESR: 0x00000001 IPEHR: 0x02000004) | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Bryce Harrington <bryce> | ||||||||||||||||||||||||
Component: | Driver/intel | Assignee: | Chris Wilson <chris> | ||||||||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||||||||||||
Severity: | major | ||||||||||||||||||||||||||
Priority: | medium | CC: | chewi, daniel, davidcoggins1, elliot.orwells, ermonnezza, ranma+freedesktop | ||||||||||||||||||||||||
Version: | 7.5 (2009.10) | ||||||||||||||||||||||||||
Hardware: | x86 (IA32) | ||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||||
Attachments: |
|
Description
Bryce Harrington
2011-02-07 18:26:41 UTC
Created attachment 43065 [details]
i915_error_state.txt
Created attachment 43066 [details]
BootDmesg.txt
Created attachment 43067 [details]
CurrentDmesg.txt
Created attachment 43068 [details]
XorgLog.txt
Created attachment 43069 [details]
XorgLogOld.txt
This bugzilla won't let me attach the gpu dump, but here's a permalink to it: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/714719/+attachment/1836510/+files/IntelGpuDump.txt *** Bug 34015 has been marked as a duplicate of this bug. *** This patch would confirm my hypothesis that is an invalid unfenced alignment: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index f136899..c970b81 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1416,6 +1416,7 @@ i915_gem_get_unfenced_gtt_alignment(struct drm_i915_gem_ob obj->tiling_mode == I915_TILING_NONE) return 4096; + return i915_gem_get_gtt_size(obj); /* * Older chips need unfenced tiled buffers to be aligned to the left * edge of an even tile row (where tile rows are counted as if the bo is We packaged this patch into a kernel for the bug reporter to test: http://people.canonical.com/~apw/lp714719-natty/ We have not yet heard back from him in a couple weeks. However, we asked other bug reporters with vaguely similar lockups to test as well, and this past weekend one of them tested it and provided the following dmesg after reproducing a lockup. https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/718767/+attachment/1861287/+files/dmesg.txt Hmm, I think I'm seeing this too on my X41T: Recently upgraded Debian and kernel and got gpu hangs again. I upgraded to latest libdrm2 and xf86-video-intel, but still getting gpu hangs. Especially chrome seems to have a knack for causing these (aggressive use of acceleration features I guess). Linux navi 2.6.38-rc7 #64 PREEMPT Sun Mar 6 14:32:50 CET 2011 i686 GNU/Linux ii libdrm2 2.4.24-1 Userspace interface to kernel DRM services - ii xserver-xorg-v 2:2.14.901-1 X.Org X server -- Intel i8xx, i9xx display d (Both built myself from newest upstream packages released last week). intel_gpu_dump: ACTHD: 0xffffffff EIR: 0x00000000 EMR: 0xffffffed ESR: 0x00000001 PGTBL_ER: 0x00000000 IPEHR: 0x02000004 IPEIR: 0x00000000 INSTDONE: 0x038ff8c1 busy: IDCT busy: IQ busy: PR busy: VLD busy: Instruction parser busy: Setup engine busy: Windowizer busy: Intermediate Z busy: Bypass FIFO busy: Pixel shader busy: Color calculator Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write ringbuffer at 0x00000000: (copy&paste from terminal, forgot to redirect into file before resetting the gpu with a suspend-resume cycle). dmesg: [29103.032023] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [29103.032023] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 1775973 at 1775971, next 1775974) [29103.032023] [drm:i915_reset] *ERROR* Failed to reset chip. 00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03) 00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03) 00:02.0 0300: 8086:2592 (rev 03) 00:02.1 0380: 8086:2792 (rev 03) Vendor: 0x8086, Device: 0x2592, Revision: 0x03 (B1/C0) BTW, while a suspend-resume should reset the gpu, I see this: [31055.564022] [drm] Manually setting wedged to 0 [31055.564022] [drm:i915_reset] *ERROR* Failed to reset chip. Why does it fail? The units are not busy anymore according to intel_gpu_top, so I'd expect "echo 0 > /sys/kernel/debug/dri/0/i915_wedged" should unwedge it, but it doesn't Created attachment 44183 [details]
i915 dump after s2mem (tried to recover from wedged gpu), but i915 claims it still can't reset the gpu
(In reply to comment #11) > BTW, while a suspend-resume should reset the gpu, I see this: > > [31055.564022] [drm] Manually setting wedged to 0 > [31055.564022] [drm:i915_reset] *ERROR* Failed to reset chip. > Why does it fail? It fails because we have not found the means to successfully reset that chipset yet. It may well be the only way is to power cycle the PCI device. Meh. > The units are not busy anymore according to intel_gpu_top, so I'd expect "echo > 0 > /sys/kernel/debug/dri/0/i915_wedged" should unwedge it, but it doesn't The units are idle because the chip hit a fatal error and disabled those units. (In reply to comment #13) > (In reply to comment #11) > > BTW, while a suspend-resume should reset the gpu, I see this: > > > > [31055.564022] [drm] Manually setting wedged to 0 > > [31055.564022] [drm:i915_reset] *ERROR* Failed to reset chip. > > Why does it fail? > > It fails because we have not found the means to successfully reset that chipset > yet. It may well be the only way is to power cycle the PCI device. Meh. > > > The units are not busy anymore according to intel_gpu_top, so I'd expect "echo > > 0 > /sys/kernel/debug/dri/0/i915_wedged" should unwedge it, but it doesn't > > The units are idle because the chip hit a fatal error and disabled those units. I don't think so. They are only idle after coming back out of suspend to ram, so I think it's probably because the GPU was power-cycled. Both resume from disk and resume from ram have the same effect here. I think it would be very helpful if KMS/DRM could recover from the GPU hang after suspend to ram or suspend to disk, when the GPU was power-cycled. It used to be the case that 'echo 1 > i915_wedged' would restart the driver after resume, but it seems some internals have changed so that this no longer works. If it would be able to recover in this case it would avoid the need to completely reboot the system to recover. *** Bug 34948 has been marked as a duplicate of this bug. *** Created attachment 44468 [details]
i915_error_state from #34948
Attaching another i915_error_state variant.
Can you give drm-intel-staging, and in particular, commit 0faba0d4e49361886b16c703995a3477951b14e5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Mar 17 15:23:22 2011 +0000 drm/i915: Fix tiling corruption from pipelined fencing ... even though it was disabled. A mistake in the handling of fence reuse caused us to skip the vital delay of waiting for the object to finish rendering before changing the register. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584 Cc: Andy Whitcroft <apw@canonical.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> [Note for 2.6.38-stable, we need to reintroduce the interruptible passing] Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> a whirl? Working on the theory that it is one and the same bug: commit b5b5ac2dec49ea5ae033434efa90863aa5cdfb2c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Mar 17 15:23:22 2011 +0000 drm/i915: Fix tiling corruption from pipelined fencing ... even though it was disabled. A mistake in the handling of fence reuse caused us to skip the vital delay of waiting for the object to finish rendering before changing the register. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584 Cc: Andy Whitcroft <apw@canonical.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> [Note for 2.6.38-stable, we need to reintroduce the interruptible passing] Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Dave Airlie <airlied@linux.ie> Original reporter tested a kernel that includes commit b5b5ac2d patched in and says he still sees the hang: David Coggins wrote on 2011-03-20: The system froze for me testing the latest natty 2.6.38-7.36 which should incorporate the fix for bug 717114 drm/i915: Fix tiling corruption from pipelined fencing Mar 21 11:29:13 eee kernel: [ 0.000000] Linux version 2.6.38-7-generic (buildd@roseapple) (gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-6ubuntu4) ) #36-Ubuntu SMP Fri Mar 18 22:05:25 UTC 2011 (Ubuntu 2.6.38-7.36-generic 2.6.38) Mar 21 11:47:30 eee kernel: [ 1115.992048] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Mar 21 11:47:30 eee kernel: [ 1115.998408] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 110179 at 110177, next 110180) Apport is not generating a problem popup when I next reboot at the moment. A small amount of testing with the terminal does not show any corruption which I was seeing 2 weeks ago bug 717114 *** Bug 35608 has been marked as a duplicate of this bug. *** Created attachment 44880 [details]
i915_error_state from #35608
*** Bug 35647 has been marked as a duplicate of this bug. *** Created attachment 44881 [details]
i915_error_state from #35647
*** Bug 36000 has been marked as a duplicate of this bug. *** Created attachment 45335 [details]
i915_error_state from #36000
I suspect that this bug is related to Bug 36147 Test if reverting commit cc930a37612341a1f2457adb339523c215879d82 helps Bryce, I'm confident that Knut identified the same issue and so disabling relaxed-fencing for the release should fix these as well. (I have lingering doubts since we tried the obvious kernel workarounds, but then again I think we may have a fundamental bug in our allocation ala gen2.) Obviously, if I am wrong, let's open the bug again. commit 686018f283f1d131073ef5917213e6a8ac013f26 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 12 08:23:04 2011 +0100 Turn relaxed-fencing off by default for older (pre-G33) chipsets There are still too many unresolved bugs, typically GPU hangs, that are related to using relaxed fencing (i.e. only allocating the minimal amount of memory required for a buffer) on older hardware, so turn off the feature by default for the release. Reported-and-tested-by: Knut Petersen <Knut_Petersen@t-online.de> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=36147 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> I can't look too deeply into it right now but it looks like this hasn't fixed it for me. The xf86-video-intel I built definitely included that commit and I was running 2.6.38.2. Reopening, though I'm not sure if Cuirot is the reporter. Chris, if it does fix, I'd suggest marking dup as resolution. If we're going to use surnames, it's Le Cuirot please! I'm not the reporter and I'm not 100% sure that my issue is the same but it is very telling that all these similar bug reports sprung up around the same time. I would do a bisect but it's my wife's laptop and I haven't found a quick way to reproduce the issue. It usually occurs around 15 minutes into using Chromium. If someone could suggest a reliable way to reproduce it (like a GPU stress tester?) then I'll give it a try. Still happening on 2.6.39. :( Created attachment 48884 [details] [review] Use full-fence size for alignment on pre-G33 The complication was that there was a second bug that prevented the original patch from preventing the unalignment of the buffers. Patch posted for inclusion. commit e28f87116503f796aba4fb27d81e2c3d81966174 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jul 18 13:11:49 2011 -0700 drm/i915: Fix unfenced alignment on pre-G33 hardware Align unfenced buffers on older hardware to the power-of-two object size. The docs suggest that it should be possible to align only to a power-of-two tile height, but using the already computed fence size is easier and always correct. We also have to make sure that we unbind misaligned buffers upon tiling changes. In order to prevent a repetition of this bug, we change the interface to the alignment computation routines to force the caller to provide the requested alignment and size of the GTT binding rather than assume the current values on the object. Reported-and-tested-by: Sitosfe Wheeler <sitsofe@yahoo.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=36326 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Keith Packard <keithp@keithp.com> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.