Bug 35284

Summary: [SNB bisected] GPU hangs when running firefox-36-20090609.trace
Product: DRI Reporter: meng <mengmeng.meng>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: medium CC: jbarnes
Version: unspecified   
Hardware: Other   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
The dmesg for this bug
none
Flush BLT before the interrupt none

Description meng 2011-03-14 01:07:51 UTC
System Environment:
--------------------------
Platform:        SugarBay
Libdrm:	        (master)2.4.24-6-g3b04c73650b5e9bbcb602fdb8cea0b16ad82d0c0
Mesa:	        (master)dedc81e1dced8768334c300d630b4683fd8a1ba2
Xserver:	        
(master)xorg-server-1.10.0-77-ga19771e4337d1c4600550314bbc42a1495a023ff
Xf86_video_intel: (master)2.14.901-13-g5c81886c23b6e92f224d40592b077f4817b408b8
Cairo:		  (master)f1d313e042af89b2f5f5d09d3eb1703d0517ecd7
Kernel:	(drm-intel-next) 47ae63e0c2e5fdb582d471dc906eb29be94c732f

Bug detailed description:
-------------------------
GPU hangs on backend xlib when running firefox-talos-gfx.trace on
a SugarBay(i5-2500K,0112(rev09)).It's kernel regression.And it works fine on Piketon.Please the attached dmesg.By bisected, d7b9935a347ae954be907ea3d5eb4564ff124c53 is the first
bad commit.

Backtrace(sometimes):
0: X (xorg_backtrace+0x28) [0x457718]
1: X (mieqEnqueue+0x1f4) [0x457594]
2: X (xf86PostMotionEventM+0x97) [0x475217]
3: /opt/X11R7/lib/xorg/modules/input/evdev_drv.so (0x7fe92b936000+0x5531) [0x7fe92b93b531]
4: X (0x400000+0x68567) [0x468567]
5: X (0x400000+0x115753) [0x515753]
6: /lib64/libpthread.so.0 (0x37f7400000+0xf3c0) [0x37f740f3c0]
7: /lib64/libc.so.6 (ioctl+0x7) [0x37f70dc7b7]
8: /opt/X11R7/lib/libdrm.so.2 (drmIoctl+0x28) [0x7fe92cbd12a8]
9: /opt/X11R7/lib/libdrm_intel.so.1 (drm_intel_gem_bo_map_gtt+0x7e) [0x7fe92c36c92e]
10: /opt/X11R7/lib/xorg/modules/drivers/intel_drv.so (0x7fe92c571000+0x10b53) [0x7fe92c581b53]
11: /opt/X11R7/lib/xorg/modules/drivers/intel_drv.so (0x7fe92c571000+0x282a2) [0x7fe92c5992a2]
12: X (0x400000+0x155d89) [0x555d89]
13: X (0x400000+0xa912e) [0x4a912e]
14: X (0x400000+0x53975) [0x453975]
15: X (0x400000+0x54541) [0x454541]
16: X (0x400000+0x214fb) [0x4214fb]
17: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x37f701ee7d]
18: X (0x400000+0x21089) [0x421089

Reproduce steps:
----------------
1 xinit&
2 ./cairo-perf-trace firefox-36-20090609.trace 

commit d7b9935a347ae954be907ea3d5eb4564ff124c53
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jan 20 13:19:55 2011 -0800

    i915: Fix i915 suspend delay

    During system suspend, the "wait for ring buffer to empty" loop would
    always time out after three seconds, because the faster cached ring
    buffer head read would always return zero.  Force the slow-and-careful
    PIO read on all but the first iterations of the loop to fix it.

    This also removes the unused (and useless) 'actual_head' variable that
    tried to approximate doing this, but did it incorrectly.

    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Rafael J. Wysocki <rjw@sisk.pl>
    Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
    Cc: Dave Airlie <airlied@linux.ie>
    Cc: DRI mailing list <dri-devel@lists.freedesktop.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment 1 meng 2011-03-14 01:08:33 UTC
Created attachment 44430 [details]
The dmesg for this bug
Comment 2 Chris Wilson 2011-03-14 03:26:54 UTC
Ok, I've reproduced this on next and does seem to be a ringbuffer overflow. Much to my surprise.
Comment 3 Chris Wilson 2011-03-14 05:42:03 UTC
After fiddling a little bit, it still hangs without the suspicious ringbuffer wrapping.
Comment 4 Chris Wilson 2011-03-14 12:08:12 UTC
Also reproduced a very similar GPU hang running the trace on a HuronRiver (rev09).
Comment 5 Chris Wilson 2011-03-19 15:35:54 UTC
Created attachment 44622 [details] [review]
Flush BLT before the interrupt
Comment 6 Chris Wilson 2011-03-21 23:51:06 UTC
I am waiting on confirmation that


commit fa0fd4d6f815d05c6f87f11df2cac8a9003cab74
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Mar 19 22:26:49 2011 +0000

    drm/i915: Restore missing command flush before interrupt on BLT ring
    
    We always skipped flushing the BLT ring if the request flush did not
    include the RENDER domain. However, this neglects that we try to flush
    the COMMAND domain after every batch and before the breadcrumb interrupt
    (to make sure the batch is indeed completed prior to the interrupt
    firing and so insuring CPU coherency). As a result of the missing flush,
    incoherency did indeed creep in, most notable when using lots of command
    buffers and so potentially rewritting an active command buffer (i.e.
    the GPU was still executing from it even though the following interrupt
    had already fired and the request/buffer retired).
    
    As all ring->flush routines now have the same preconditions, de-duplicate
    and move those checks up into i915_gem_flush_ring().
    
    Fixes gem_linear_blit.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35284
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

in drm-intel-staging fixes this bug.
Comment 7 meng 2011-03-22 03:23:16 UTC
we can compile it successfully unless deleted the sentence "c:702
drivers/usb/serial/usb_wwan.c" in drm-intel-staging.Even if compiled successfully, system can't work normally on SugarBay.I'm sorry for that.Could you please look at the ERROR when compiling?


drivers/usb/serial/usb_wwan.c:In function ‘play_delayed’:
drivers/usb/serial/usb_wwan.c:702: error: ‘struct dev_pm_info’ has no member named ‘usage_count’
make[5]: *** [drivers/usb/serial/usb_wwan.o] Error 1
make[4]: *** [drivers/usb/serial] Error 2
make[3]: *** [drivers/usb] Error 2
make[2]: *** [drivers] Error 2
make[1]: *** [binrpm-pkg] Error 2
make: *** [binrpm-pkg] Error 2
Comment 8 Chris Wilson 2011-03-22 03:32:39 UTC
I rebased drm-intel-staging on drm-core-next, so the compilation issue should be no more and provide a working system on which to test.
Comment 9 meng 2011-03-22 20:11:37 UTC
It works fine when testing in the commit 87862b8b(drm/i915: Restore missing command flush before interrupt on BLT ring) on SugarBay.
Comment 10 Chris Wilson 2011-03-22 23:50:37 UTC
commit d2023bf8be6c39d45a1a08d0bd8efb126701634c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Mar 19 22:26:49 2011 +0000

    drm/i915: Restore missing command flush before interrupt on BLT ring
    
    We always skipped flushing the BLT ring if the request flush did not
    include the RENDER domain. However, this neglects that we try to flush
    the COMMAND domain after every batch and before the breadcrumb interrupt
    (to make sure the batch is indeed completed prior to the interrupt
    firing and so insuring CPU coherency). As a result of the missing flush,
    incoherency did indeed creep in, most notable when using lots of command
    buffers and so potentially rewritting an active command buffer (i.e.
    the GPU was still executing from it even though the following interrupt
    had already fired and the request/buffer retired).
    
    As all ring->flush routines now have the same preconditions, de-duplicate
    and move those checks up into i915_gem_flush_ring().
    
    Fixes gem_linear_blit.
  
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35284
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Tested-by: mengmeng.meng@intel.com
Comment 11 meng 2011-03-23 03:22:52 UTC
Verified with the commit d2023bf8be6c39d45a1a08d0bd8efb126701634c,it works fine.
Comment 12 Jari Tahvanainen 2017-09-04 10:05:06 UTC
Closing old verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.