Bug 34014

Summary:

[i915gm] GPU lockup (ESR: 0x00000001 IPEHR: 0x02000004)

Product:

xorg

Reporter:

Bryce Harrington <bryce>

Component:

Driver/intel

Assignee:

Chris Wilson <chris>

Status:

RESOLVED FIXED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

major

Priority:

medium

CC:

chewi, daniel, davidcoggins1, elliot.orwells, ermonnezza, ranma+freedesktop

Version:

7.5 (2009.10)

Hardware:

x86 (IA32)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
i915_error_state.txt	none
BootDmesg.txt	none
CurrentDmesg.txt	none
XorgLog.txt	none
XorgLogOld.txt	none
i915 dump after s2mem (tried to recover from wedged gpu), but i915 claims it still can't reset the gpu	none
i915_error_state from #34948	none
i915_error_state from #35608	none
i915_error_state from #35647	none
i915_error_state from #36000	none
Use full-fence size for alignment on pre-G33	none

Description Bryce Harrington 2011-02-07 18:26:41 UTC

Forwarding this bug from Ubuntu reporter mkis62:
http://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/714719

[Problem]
GPU lockup (of the "Hangcheck timer elapsed" variety) on 2.6.38-2 kernel and 2.14.0 intel driver with i915gm hardware.  No compositor is running.

[Original Description]
X crashed while setting preferences in Decibel Audio Player
tty1-6 works ... rebooting...

From GPU dump:
ACTHD: 0xffffffff
EIR: 0x00000000
EMR: 0xffffffed
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x02000004
IPEIR: 0x00000000
INSTDONE: 0x03c7c081
    busy: IDCT
    busy: IQ
    busy: PR
    busy: VLD
    busy: Instruction parser
    busy: Windowizer
    busy: Intermediate Z
    busy: Perspective interpolation
    busy: Texture decompression
    busy: Sampler Cache
    busy: Filtering
    busy: Bypass FIFO
    busy: Pixel shader
    busy: Color calculator
    busy: Map L2

From dmesg:
[ 2026.252160] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 2026.254795] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 402290 at 402288, next 402291)


DistroRelease: Ubuntu 11.04
Package: xserver-xorg-video-intel 2:2.14.0-1ubuntu6
ProcVersionSignature: Ubuntu 2.6.38-2.29-generic 2.6.38-rc3
Uname: Linux 2.6.38-2-generic i686
Architecture: i386
Chipset: i915gm
CompisitorRunning: None
DRM.card0.LVDS.1:
Â status: connected
Â enabled: enabled
Â dpms: On
Â modes: 1024x768
Â edid-base64: DRM.card0.VGA.1:
Â status: disconnected
Â enabled: disabled
Â dpms: Off
Â modes:
Â edid-base64:
Date: Mon Feb  7 18:50:19 2011
DistUpgraded: Yes, recently upgraded Log time: 2011-01-03 14:04:23.058239
DistroCodename: natty
DistroVariant: ubuntu
DumpSignature: 82856c05
ExecutablePath: /usr/share/apport/apport-gpu-error-intel.py
GconfCompiz:

GraphicsCard:
Â Subsystem: Acer Incorporated [ALI] Device [1025:006a]
Â Â Â Subsystem: Acer Incorporated [ALI] Device [1025:006a]
InterpreterPath: /usr/bin/python2.7
MachineType: Acer TravelMate 2410
PccardctlIdent:
Â Socket 0:
Â Â Â no product info available
PccardctlStatus:
Â Socket 0:
Â Â Â no card
ProcCmdline: /usr/bin/python /usr/share/apport/apport-gpu-error-intel.py
ProcEnviron:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-2-generic root=UUID=263aecd1-0156-49f9-8d5e-99e8079b240f ro gfxpayload=true quiet splash vt.handoff=7
ProcKernelCmdLine_: BOOT_IMAGE=/boot/vmlinuz-2.6.38-2-generic root=UUID=263aecd1-0156-49f9-8d5e-99e8079b240f ro gfxpayload=true quiet splash vt.handoff=7
RelatedPackageVersions:
Â xserver-xorg 1:7.6~3ubuntu3
Â libdrm2 2.4.23-1ubuntu3
Â xserver-xorg-video-intel 2:2.14.0-1ubuntu6
Renderer: Hardware acceleration
SourcePackage: xserver-xorg-video-intel
Title: [i915gm] GPU lockup 82856c05
UserGroups:

dmi.bios.date: 02/07/2006
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: V1.09
dmi.board.name: Morar
dmi.board.vendor: Acer
dmi.board.version: Rev
dmi.chassis.asset.tag: None
dmi.chassis.type: 10
dmi.chassis.vendor: Acer
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvrV1.09:bd02/07/2006:svnAcer:pnTravelMate2410:pvr0100:rvnAcer:rnMorar:rvrRev:cvnAcer:ct10:cvrN/A:
dmi.product.name: TravelMate 2410
dmi.product.version: 0100
dmi.sys.vendor: Acer
version.compiz: compiz N/A
version.libdrm2: libdrm2 2.4.23-1ubuntu3
version.libgl1-mesa-glx: libgl1-mesa-glx 7.10-1ubuntu1
version.xserver-xorg: xserver-xorg 1:7.6~3ubuntu3
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.14.0-1ubuntu6

Comment 1 Bryce Harrington 2011-02-07 18:27:09 UTC

Created attachment 43065 [details]
i915_error_state.txt

Comment 2 Bryce Harrington 2011-02-07 18:28:40 UTC

Created attachment 43066 [details]
BootDmesg.txt

Comment 3 Bryce Harrington 2011-02-07 18:29:11 UTC

Created attachment 43067 [details]
CurrentDmesg.txt

Comment 4 Bryce Harrington 2011-02-07 18:29:45 UTC

Created attachment 43068 [details]
XorgLog.txt

Comment 5 Bryce Harrington 2011-02-07 18:30:08 UTC

Created attachment 43069 [details]
XorgLogOld.txt

Comment 6 Bryce Harrington 2011-02-07 18:30:37 UTC

This bugzilla won't let me attach the gpu dump, but here's a permalink to it:

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/714719/+attachment/1836510/+files/IntelGpuDump.txt

Comment 7 Chris Wilson 2011-02-08 01:40:02 UTC

*** Bug 34015 has been marked as a duplicate of this bug. ***

Comment 8 Chris Wilson 2011-02-08 01:42:02 UTC

This patch would confirm my hypothesis that is an invalid unfenced alignment:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f136899..c970b81 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1416,6 +1416,7 @@ i915_gem_get_unfenced_gtt_alignment(struct drm_i915_gem_ob
            obj->tiling_mode == I915_TILING_NONE)
                return 4096;
 
+       return i915_gem_get_gtt_size(obj);
        /*
         * Older chips need unfenced tiled buffers to be aligned to the left
         * edge of an even tile row (where tile rows are counted as if the bo is

Comment 9 Bryce Harrington 2011-02-22 15:29:38 UTC

We packaged this patch into a kernel for the bug reporter to test:

   http://people.canonical.com/~apw/lp714719-natty/

We have not yet heard back from him in a couple weeks.

However, we asked other bug reporters with vaguely similar lockups to test as well, and this past weekend one of them tested it and provided the following dmesg after reproducing a lockup.

   https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/718767/+attachment/1861287/+files/dmesg.txt

Comment 10 Tobias Diedrich 2011-03-06 14:12:21 UTC

Hmm, I think I'm seeing this too on my X41T:

Recently upgraded Debian and kernel and got gpu hangs again.
I upgraded to latest libdrm2 and xf86-video-intel, but still getting gpu hangs.
Especially chrome seems to have a knack for causing these (aggressive use of acceleration features I guess).

Linux navi 2.6.38-rc7 #64 PREEMPT Sun Mar 6 14:32:50 CET 2011 i686 GNU/Linux

ii  libdrm2        2.4.24-1       Userspace interface to kernel DRM services -
ii  xserver-xorg-v 2:2.14.901-1   X.Org X server -- Intel i8xx, i9xx display d

(Both built myself from newest upstream packages released last week).


intel_gpu_dump:
ACTHD: 0xffffffff
EIR: 0x00000000
EMR: 0xffffffed
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x02000004
IPEIR: 0x00000000
INSTDONE: 0x038ff8c1
    busy: IDCT
    busy: IQ
    busy: PR
    busy: VLD
    busy: Instruction parser
    busy: Setup engine
    busy: Windowizer
    busy: Intermediate Z
    busy: Bypass FIFO
    busy: Pixel shader
    busy: Color calculator
Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write
ringbuffer at 0x00000000:
(copy&paste from terminal, forgot to redirect into file before resetting the gpu with a suspend-resume cycle).

dmesg:
[29103.032023] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[29103.032023] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 1775973 at 1775971, next 1775974)
[29103.032023] [drm:i915_reset] *ERROR* Failed to reset chip.



00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)

00:02.0 0300: 8086:2592 (rev 03)
00:02.1 0380: 8086:2792 (rev 03)


Vendor: 0x8086, Device: 0x2592, Revision: 0x03 (B1/C0)

Comment 11 Tobias Diedrich 2011-03-06 14:18:49 UTC

BTW, while a suspend-resume should reset the gpu, I see this:

[31055.564022] [drm] Manually setting wedged to 0
[31055.564022] [drm:i915_reset] *ERROR* Failed to reset chip.
Why does it fail?
The units are not busy anymore according to intel_gpu_top, so I'd expect "echo 0 > /sys/kernel/debug/dri/0/i915_wedged" should unwedge it, but it doesn't

Comment 12 Tobias Diedrich 2011-03-06 14:21:19 UTC

Created attachment 44183 [details]
i915 dump after s2mem (tried to recover from wedged gpu), but i915 claims it still can't reset the gpu

Comment 13 Chris Wilson 2011-03-08 02:59:29 UTC

(In reply to comment #11)
> BTW, while a suspend-resume should reset the gpu, I see this:
> 
> [31055.564022] [drm] Manually setting wedged to 0
> [31055.564022] [drm:i915_reset] *ERROR* Failed to reset chip.
> Why does it fail?

It fails because we have not found the means to successfully reset that chipset yet. It may well be the only way is to power cycle the PCI device. Meh.

> The units are not busy anymore according to intel_gpu_top, so I'd expect "echo
> 0 > /sys/kernel/debug/dri/0/i915_wedged" should unwedge it, but it doesn't

The units are idle because the chip hit a fatal error and disabled those units.

Comment 14 Tobias Diedrich 2011-03-13 12:40:32 UTC

(In reply to comment #13)
> (In reply to comment #11)
> > BTW, while a suspend-resume should reset the gpu, I see this:
> > 
> > [31055.564022] [drm] Manually setting wedged to 0
> > [31055.564022] [drm:i915_reset] *ERROR* Failed to reset chip.
> > Why does it fail?
> 
> It fails because we have not found the means to successfully reset that chipset
> yet. It may well be the only way is to power cycle the PCI device. Meh.
> 
> > The units are not busy anymore according to intel_gpu_top, so I'd expect "echo
> > 0 > /sys/kernel/debug/dri/0/i915_wedged" should unwedge it, but it doesn't
> 
> The units are idle because the chip hit a fatal error and disabled those units.

I don't think so.  They are only idle after coming back out of suspend to ram, so I think it's probably because the GPU was power-cycled.
Both resume from disk and resume from ram have the same effect here.
I think it would be very helpful if KMS/DRM could recover from the GPU hang after suspend to ram or suspend to disk, when the GPU was power-cycled.  It used to be the case that 'echo 1 > i915_wedged' would restart the driver after resume, but it seems some internals have changed so that this no longer works.  If it would be able to recover in this case it would avoid the need to completely reboot the system to recover.

Comment 15 Chris Wilson 2011-03-15 01:53:40 UTC

*** Bug 34948 has been marked as a duplicate of this bug. ***

Comment 16 Chris Wilson 2011-03-15 01:55:18 UTC

Created attachment 44468 [details]
i915_error_state from #34948

Attaching another i915_error_state variant.

Comment 17 Chris Wilson 2011-03-20 04:05:05 UTC

Can you give drm-intel-staging, and in particular,

commit 0faba0d4e49361886b16c703995a3477951b14e5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 17 15:23:22 2011 +0000

    drm/i915: Fix tiling corruption from pipelined fencing
    
    ... even though it was disabled. A mistake in the handling of fence reuse
    caused us to skip the vital delay of waiting for the object to finish
    rendering before changing the register.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584
    Cc: Andy Whitcroft <apw@canonical.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    [Note for 2.6.38-stable, we need to reintroduce the interruptible passing]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

a whirl?

Comment 18 Chris Wilson 2011-03-22 23:54:15 UTC

Working on the theory that it is one and the same bug:

commit b5b5ac2dec49ea5ae033434efa90863aa5cdfb2c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 17 15:23:22 2011 +0000

    drm/i915: Fix tiling corruption from pipelined fencing
    
    ... even though it was disabled. A mistake in the handling of fence reuse
    caused us to skip the vital delay of waiting for the object to finish
    rendering before changing the register.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584
    Cc: Andy Whitcroft <apw@canonical.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    [Note for 2.6.38-stable, we need to reintroduce the interruptible passing]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Dave Airlie <airlied@linux.ie>

Comment 19 Bryce Harrington 2011-03-25 01:01:47 UTC

Original reporter tested a kernel that includes commit b5b5ac2d patched in and says he still sees the hang:

David Coggins wrote on 2011-03-20:
The system froze for me testing the latest natty 2.6.38-7.36 which should incorporate the fix for bug 717114

drm/i915: Fix tiling corruption from pipelined fencing

Mar 21 11:29:13 eee kernel: [ 0.000000] Linux version 2.6.38-7-generic (buildd@roseapple) (gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-6ubuntu4) ) #36-Ubuntu SMP Fri Mar 18 22:05:25 UTC 2011 (Ubuntu 2.6.38-7.36-generic 2.6.38)

Mar 21 11:47:30 eee kernel: [ 1115.992048] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 21 11:47:30 eee kernel: [ 1115.998408] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 110179 at 110177, next 110180)

Apport is not generating a problem popup when I next reboot at the moment.
A small amount of testing with the terminal does not show any corruption which I was seeing 2 weeks ago bug 717114

Comment 20 Chris Wilson 2011-03-26 02:18:20 UTC

*** Bug 35608 has been marked as a duplicate of this bug. ***

Comment 21 Chris Wilson 2011-03-26 02:19:16 UTC

Created attachment 44880 [details]
i915_error_state from #35608

Comment 22 Chris Wilson 2011-03-26 02:22:22 UTC

*** Bug 35647 has been marked as a duplicate of this bug. ***

Comment 23 Chris Wilson 2011-03-26 02:23:21 UTC

Created attachment 44881 [details]
i915_error_state from #35647

Comment 24 Chris Wilson 2011-04-05 23:05:54 UTC

*** Bug 36000 has been marked as a duplicate of this bug. ***

Comment 25 Chris Wilson 2011-04-05 23:06:30 UTC

Created attachment 45335 [details]
i915_error_state from #36000

Comment 26 Knut Petersen 2011-04-12 00:01:22 UTC

I suspect that this bug is related to Bug 36147

Test if reverting commit cc930a37612341a1f2457adb339523c215879d82
helps

Comment 27 Chris Wilson 2011-04-12 01:10:46 UTC

Bryce, I'm confident that Knut identified the same issue and so disabling relaxed-fencing for the release should fix these as well. (I have lingering doubts since we tried the obvious kernel workarounds, but then again I think we may have a fundamental bug in our allocation ala gen2.) Obviously, if I am wrong, let's open the bug again.


commit 686018f283f1d131073ef5917213e6a8ac013f26
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 12 08:23:04 2011 +0100

    Turn relaxed-fencing off by default for older (pre-G33) chipsets
    
    There are still too many unresolved bugs, typically GPU hangs, that are
    related to using relaxed fencing (i.e. only allocating the minimal
    amount of memory required for a buffer) on older hardware, so turn off
    the feature by default for the release.
    
    Reported-and-tested-by: Knut Petersen <Knut_Petersen@t-online.de>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=36147
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Comment 28 James Le Cuirot 2011-04-12 08:29:42 UTC

I can't look too deeply into it right now but it looks like this hasn't fixed it for me. The xf86-video-intel I built definitely included that commit and I was running 2.6.38.2.

Comment 29 Gordon Jin 2011-04-25 00:22:05 UTC

Reopening, though I'm not sure if Cuirot is the reporter.

Chris, if it does fix, I'd suggest marking dup as resolution.

Comment 30 James Le Cuirot 2011-04-25 04:59:28 UTC

If we're going to use surnames, it's Le Cuirot please!

I'm not the reporter and I'm not 100% sure that my issue is the same but it is very telling that all these similar bug reports sprung up around the same time.

I would do a bisect but it's my wife's laptop and I haven't found a quick way to reproduce the issue. It usually occurs around 15 minutes into using Chromium. If someone could suggest a reliable way to reproduce it (like a GPU stress tester?) then I'll give it a try.

Comment 31 James Le Cuirot 2011-05-25 04:57:54 UTC

Still happening on 2.6.39. :(

Comment 32 Chris Wilson 2011-07-08 03:16:04 UTC

Created attachment 48884 [details] [review]
Use full-fence size for alignment on pre-G33

The complication was that there was a second bug that prevented the original patch from preventing the unalignment of the buffers.

Comment 33 Chris Wilson 2011-07-18 07:43:57 UTC

Patch posted for inclusion.

Comment 34 Chris Wilson 2011-07-29 02:38:25 UTC

commit e28f87116503f796aba4fb27d81e2c3d81966174
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 18 13:11:49 2011 -0700

    drm/i915: Fix unfenced alignment on pre-G33 hardware
    
    Align unfenced buffers on older hardware to the power-of-two object
    size.  The docs suggest that it should be possible to align only to a
    power-of-two tile height, but using the already computed fence size is
    easier and always correct. We also have to make sure that we unbind
    misaligned buffers upon tiling changes.
    
    In order to prevent a repetition of this bug, we change the interface
    to the alignment computation routines to force the caller to provide
    the requested alignment and size of the GTT binding rather than assume
    the current values on the object.
    
    Reported-and-tested-by: Sitosfe Wheeler <sitsofe@yahoo.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=36326
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Keith Packard <keithp@keithp.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.