Bug 74176

Summary: [snb] GPU hang IPEHR: 0x7a000002
Product: xorg Reporter: Jan Alexander Steffens (heftig) <jan.steffens>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: VERIFIED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, ziktofel
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
/sys/class/drm/card0/error
none
gdb log none

Description Jan Alexander Steffens (heftig) 2014-01-29 12:30:52 UTC
Created attachment 92991 [details]
/sys/class/drm/card0/error

Arch Linux x86_64
Thinkpad X220 (SNB)
Kernel: 3.13.0
Mesa: 10.0.2
Xorg: 1.15.0
xf86-video-intel: 2.99.907-62-g872468a

Using GNOME Shell and Firefox.

dmesg:
[ 8675.044652] [drm] stuck on render ring
[ 8675.044660] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 8675.044661] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 8675.044663] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 8675.044664] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 8675.044665] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 8675.047739] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x741000 ctx 0) at 0x742268

Error state attached.
Comment 1 Chris Wilson 2014-01-29 13:09:49 UTC
It appears that the binding table for the source is a stale value (and points before the start of the batch). This is impossible - so perhaps a use-after-free?

commit 7df3da10e744d7f168ea3f30b21c434f99beae17
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 29 13:06:08 2014 +0000

    sna/gen4+: Assert that the cached binding location is valid
    
    We can at least check that it is in the right region (i.e. not past
    where the current surface has been allocated from).
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=74176
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

That should catch this particular error, but I hope compiling with assertions enabled (--enable-debug) will detect the fault much earlier.
Comment 2 Jan Alexander Steffens (heftig) 2014-01-29 13:14:24 UTC
Xorg is crashing a lot as well, now. Possibly related? I think the hanging and crashing started with a recent change to xf86-video-intel.

Xorg log:

[ 11820.893] (EE) Backtrace:
[ 11820.893] (EE) 0: /usr/bin/Xorg (xorg_backtrace+0x48) [0x5853a8]
[ 11820.893] (EE) 1: /usr/bin/Xorg (0x400000+0x189369) [0x589369]
[ 11820.893] (EE) 2: /usr/lib/libpthread.so.0 (0x7fb1e9138000+0xf870) [0x7fb1e9147870]
[ 11820.893] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7fb1e39d6000+0x289da) [0x7fb1e39fe9da]
[ 11820.893] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7fb1e39d6000+0xca5db) [0x7fb1e3aa05db]
[ 11820.893] (EE) 5: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7fb1e39d6000+0xcad96) [0x7fb1e3aa0d96]
[ 11820.893] (EE) 6: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7fb1e39d6000+0xce2e6) [0x7fb1e3aa42e6]
[ 11820.893] (EE) 7: /usr/lib/xorg/modules/drivers/intel_drv.so (0x7fb1e39d6000+0x5b8b2) [0x7fb1e3a318b2]
[ 11820.893] (EE) 8: /usr/bin/Xorg (0x400000+0x105d31) [0x505d31]
[ 11820.893] (EE) 9: /usr/bin/Xorg (0x400000+0x35f8e) [0x435f8e]
[ 11820.893] (EE) 10: /usr/bin/Xorg (0x400000+0x39d9a) [0x439d9a]
[ 11820.893] (EE) 11: /usr/lib/libc.so.6 (__libc_start_main+0xf5) [0x7fb1e7dabb05]
[ 11820.893] (EE) 12: /usr/bin/Xorg (0x400000+0x2533e) [0x42533e]
[ 11820.893] (EE) 
[ 11820.893] (EE) Segmentation fault at address 0x18

The lines from intel_drv.so are, in order:
src/intel_list.h:161
src/sna/gen6_render.c:1072
src/sna/gen6_render.c:2989
src/sna/gen6_render.c:3114
src/sna/sna_composite.c:987

I'll also compile with debug, next.
Comment 3 Chris Wilson 2014-01-29 13:21:11 UTC
Yes, that smells of the same bo use-after-free. :(
Comment 4 Jan Alexander Steffens (heftig) 2014-01-29 13:33:43 UTC
Created attachment 92996 [details]
gdb log

Caught an assertion.
Comment 5 Chris Wilson 2014-01-29 13:41:34 UTC
D'oh, that was silly.

commit d70620d9789da1cf983dac318d9ca9149f11ff20
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 29 13:39:20 2014 +0000

    sna: We can only retire a bo if is not referenced by the current batch
    
    Fixes regression from
    commit 8b0ebebcab21647348f769c25ca0c1d81d169e75
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Tue Jan 28 16:30:47 2014 +0000
    
        sna: Be a little more assertive in retiring after set-domain
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=74176
    Reported-by: Jan Alexander Steffens <jan.steffens@gmail.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

That should set everything back to normal...
Comment 6 Jan Alexander Steffens (heftig) 2014-01-29 14:37:02 UTC
Yep, I think that did it.
Comment 7 Gordon Jin 2014-07-08 06:29:49 UTC
*** Bug 74186 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.