Bug 42141

Summary: [915GM] gpu hung garbage in batch buffer
Product: xorg Reporter: Knut Petersen <Knut_Petersen>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: eugeni
Version: git   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
i915_error_state
none
Xorg log
none
another i915_error_state
none
another xorg log none

Description Knut Petersen 2011-10-23 21:33:44 UTC
Well, again a hung gpu.

dmesg:
[41621.520021] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[41621.520268] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[41621.523678] [drm:i915_reset] *ERROR* Failed to reset chip.

Xorg log:
[ 41621.655] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[ 41621.656] (EE) intel(0): When reporting this, please include i915_error_state from debugfs and the full dmesg.
[

Switching to framebuffer console and back worked, but display was completely distorted. Restarting xorg without system reboot worked.

cu,
 knut
Comment 1 Knut Petersen 2011-10-23 21:35:19 UTC
Created attachment 52665 [details]
dmesg
Comment 2 Knut Petersen 2011-10-23 21:37:39 UTC
Created attachment 52666 [details]
i915_error_state
Comment 3 Knut Petersen 2011-10-23 21:43:58 UTC
Created attachment 52667 [details]
Xorg log

Xorg: git tree, fetched and compiled on october 23, last intel driver commit a18f559961135fa288dda3b94207abb0b6d4d302

Kernel: linux 3.0.7

hardware: AOpen i915GMm-hfs with Pentium M Dothan 1.86 MHz cpu

os: opensuse 11.4
Comment 4 Chris Wilson 2011-10-24 00:33:41 UTC
batchbuffer at 0x09d43000:
0x09d43000:      0x00000000: MI_NOOP
0x09d43004:      0x00000000: MI_NOOP
0x09d43008:      0x00000000: MI_NOOP
0x09d4300c:      0x00000000: MI_NOOP
0x09d43010:      0x00000000: MI_NOOP
0x09d43014:      0x00000000: MI_NOOP
0x09d43018:      0x00000000: MI_NOOP
0x09d4301c:      0x00000000: MI_NOOP
0x09d43020:      0x00000000: MI_NOOP
...
0x09d43d10:      0x3f800000:    UNKNOWN
0x09d43d14:      0x3f800000:    UNKNOWN
0x09d43d18: HEAD 0x00000000: MI_NOOP
0x09d43d1c:      0x00000000: MI_NOOP
0x09d43d20:      0x00000000: MI_NOOP
0x09d43d24:      0x00000000: MI_NOOP
0x09d43d28:      0x00000000: MI_NOOP
0x09d43d2c:      0x3f800000:    UNKNOWN
0x09d43d30:      0x3f800000:    UNKNOWN
0x09d43d34:      0x3f800000:    UNKNOWN
0x09d43d38:      0x3f800000:    UNKNOWN
0x09d43d3c:      0x3f800000:    UNKNOWN
0x09d43d40:      0x00000000: MI_NOOP
0x09d43d44:      0x00000000: MI_NOOP
0x09d43d48:      0x000000aa: MI_NOOP
0x09d43d4c:      0x7d040400: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x09d43d50:      0x00008266:    S6: alpha_test=always, alpha_ref=0x0, depth_test=always, cbuf blend enable, src_blnd_fct=one, dst_blnd_fct=inv_src_alpha, cbuf write enable, tristrip_provoking_vertex=2
0x09d43d54:      0x7c800003: 3DSTATE_SCISSOR_ENABLE enabled
0x09d43d58:      0x7d810001: 3DSTATE_SCISSOR_RECTANGLE
0x09d43d5c:      0x01040009:    (9,260)
0x09d43d60:      0x00000000:    (0,0)

which is a weird mixture of overwritten and stale contents. Doesn't seem to be aligned to fence pitches.
Comment 5 Knut Petersen 2011-10-24 01:09:24 UTC
Created attachment 52676 [details]
another i915_error_state
Comment 6 Knut Petersen 2011-10-24 01:13:42 UTC
Created attachment 52677 [details]
another xorg log

dmesg:

[ 8113.460031] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 8113.460269] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 8113.462038] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 882756 at 882701, next 882757)
[ 8113.462775] [drm:i915_reset] *ERROR* Failed to reset chip.


Same server, same kernel. 

Shall I try to reproduce the problem with some special debug parameters?

cu,
 Knut
Comment 7 Eugeni Dodonov 2011-10-24 05:24:23 UTC
@Knut - so this is a regression? If yes, could you please bisect it within the xf86-video-intel driver?


But besides that, from the latest X.log, the following seems suspicious:

[  8111.354] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
[  8111.355] 
Backtrace:
[  8111.355] 0: /usr/bin/Xorg (xorg_backtrace+0x2e) [0x81d5eee]
[  8111.355] 1: /usr/bin/Xorg (mieqEnqueue+0x13e) [0x81b459e]
[  8111.355] 2: /usr/bin/Xorg (QueuePointerEvents+0x5d) [0x809433d]
[  8111.355] 3: /usr/bin/Xorg (xf86PostMotionEventM+0xdd) [0x80cd2dd]
[  8111.356] 4: /usr/lib/xorg/modules/input/evdev_drv.so (0xb721c000+0x6954) [0xb7222954]
[  8111.356] 5: /usr/bin/Xorg (0x8048000+0x740b1) [0x80bc0b1]
[  8111.356] 6: /usr/bin/Xorg (0x8048000+0x9a484) [0x80e2484]
[  8111.356] 7: (vdso) (__kernel_sigreturn+0x0) [0xb7883400]
[  8111.356] 8: /usr/lib/libdrm_intel.so.1 (drm_intel_gem_bo_map_gtt+0x67) [0xb71fd6a7]
[  8111.356] 9: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb722b000+0x12340) [0xb723d340]
[  8111.356] 10: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb722b000+0x2d5d4) [0xb72585d4]
[  8111.357] 11: /usr/bin/Xorg (0x8048000+0x17cb64) [0x81c4b64]
[  8111.357] 12: /usr/bin/Xorg (0x8048000+0xc8b7f) [0x8110b7f]
[  8111.357] 13: /usr/bin/Xorg (0x8048000+0x2dfe7) [0x8075fe7]
[  8111.357] 14: /usr/bin/Xorg (0x8048000+0x3098f) [0x807898f]
[  8111.357] 15: /usr/bin/Xorg (0x8048000+0x1e26d) [0x806626d]
[  8111.357] 16: /lib/libc.so.6 (__libc_start_main+0xfe) [0xb743fc2e]
[  8113.593] (EE) intel(0): Detected a hung GPU, disabling acceleration.

There were lots of changes in X.org input in the past weeks, I wonder if it could come from one of those somehow?
Comment 8 Knut Petersen 2011-10-24 06:04:35 UTC
(In reply to comment #7)
> @Knut - so this is a regression? If yes, could you please bisect it within the
> xf86-video-intel driver?

Well, I do not know if it is a regression. Half a year ago I switched off KDE desktop effects because of similar "composite" related problems, see bug #36151.

Now I had a little time and tried to reenable them, with little success. Maybe it´s still broken, maybe it´s broken again.

> 
> But besides that, from the latest X.log, the following seems suspicious:
> 

I wonder if this is only one bug.

> There were lots of changes in X.org input in the past weeks, I wonder if it
> could come from one of those somehow?

Well, I´ll try to find a "known good" starting point.

cu,
 Knut
Comment 9 Chris Wilson 2012-03-26 01:44:51 UTC
I believe these are all related to the underlying bug:

commit c501ae7f332cdaf42e31af30b72b4b66cbbb1604
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Dec 14 13:57:23 2011 +0100

    drm/i915: Only clear the GPU domains upon a successful finish
    
    By clearing the GPU read domains before waiting upon the buffer, we run
    the risk of the wait being interrupted and the domains prematurely
    cleared. The next time we attempt to wait upon the buffer (after
    userspace handles the signal), we believe that the buffer is idle and so
    skip the wait.
    
    There are a number of bugs across all generations which show signs of an
    overly haste reuse of active buffers.
    
    Such as:
    
      https://bugs.freedesktop.org/show_bug.cgi?id=29046
      https://bugs.freedesktop.org/show_bug.cgi?id=35863
      https://bugs.freedesktop.org/show_bug.cgi?id=38952
      https://bugs.freedesktop.org/show_bug.cgi?id=40282
      https://bugs.freedesktop.org/show_bug.cgi?id=41098
      https://bugs.freedesktop.org/show_bug.cgi?id=41102
      https://bugs.freedesktop.org/show_bug.cgi?id=41284
      https://bugs.freedesktop.org/show_bug.cgi?id=42141
    
    A couple of those pre-date i915_gem_object_finish_gpu(), so may be
    unrelated (such as a wild write from a userspace command buffer), but
    this does look like a convincing cause for most of those bugs.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.