Bug 35284

Summary:

[SNB bisected] GPU hangs when running firefox-36-20090609.trace

Product:

DRI

Reporter:

meng <mengmeng.meng>

Component:

DRM/Intel

Assignee:

Chris Wilson <chris>

Status:

CLOSED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

jbarnes

Version:

unspecified

Hardware:

Other

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
The dmesg for this bug	none
Flush BLT before the interrupt	none

Description meng 2011-03-14 01:07:51 UTC

System Environment:
--------------------------
Platform:        SugarBay
Libdrm:	        (master)2.4.24-6-g3b04c73650b5e9bbcb602fdb8cea0b16ad82d0c0
Mesa:	        (master)dedc81e1dced8768334c300d630b4683fd8a1ba2
Xserver:	        
(master)xorg-server-1.10.0-77-ga19771e4337d1c4600550314bbc42a1495a023ff
Xf86_video_intel: (master)2.14.901-13-g5c81886c23b6e92f224d40592b077f4817b408b8
Cairo:		  (master)f1d313e042af89b2f5f5d09d3eb1703d0517ecd7
Kernel:	(drm-intel-next) 47ae63e0c2e5fdb582d471dc906eb29be94c732f

Bug detailed description:
-------------------------
GPU hangs on backend xlib when running firefox-talos-gfx.trace on
a SugarBay(i5-2500K,0112(rev09)).It's kernel regression.And it works fine on Piketon.Please the attached dmesg.By bisected, d7b9935a347ae954be907ea3d5eb4564ff124c53 is the first
bad commit.

Backtrace(sometimes):
0: X (xorg_backtrace+0x28) [0x457718]
1: X (mieqEnqueue+0x1f4) [0x457594]
2: X (xf86PostMotionEventM+0x97) [0x475217]
3: /opt/X11R7/lib/xorg/modules/input/evdev_drv.so (0x7fe92b936000+0x5531) [0x7fe92b93b531]
4: X (0x400000+0x68567) [0x468567]
5: X (0x400000+0x115753) [0x515753]
6: /lib64/libpthread.so.0 (0x37f7400000+0xf3c0) [0x37f740f3c0]
7: /lib64/libc.so.6 (ioctl+0x7) [0x37f70dc7b7]
8: /opt/X11R7/lib/libdrm.so.2 (drmIoctl+0x28) [0x7fe92cbd12a8]
9: /opt/X11R7/lib/libdrm_intel.so.1 (drm_intel_gem_bo_map_gtt+0x7e) [0x7fe92c36c92e]
10: /opt/X11R7/lib/xorg/modules/drivers/intel_drv.so (0x7fe92c571000+0x10b53) [0x7fe92c581b53]
11: /opt/X11R7/lib/xorg/modules/drivers/intel_drv.so (0x7fe92c571000+0x282a2) [0x7fe92c5992a2]
12: X (0x400000+0x155d89) [0x555d89]
13: X (0x400000+0xa912e) [0x4a912e]
14: X (0x400000+0x53975) [0x453975]
15: X (0x400000+0x54541) [0x454541]
16: X (0x400000+0x214fb) [0x4214fb]
17: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x37f701ee7d]
18: X (0x400000+0x21089) [0x421089

Reproduce steps:
----------------
1 xinit&
2 ./cairo-perf-trace firefox-36-20090609.trace 

commit d7b9935a347ae954be907ea3d5eb4564ff124c53
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jan 20 13:19:55 2011 -0800

    i915: Fix i915 suspend delay

    During system suspend, the "wait for ring buffer to empty" loop would
    always time out after three seconds, because the faster cached ring
    buffer head read would always return zero.  Force the slow-and-careful
    PIO read on all but the first iterations of the loop to fix it.

    This also removes the unused (and useless) 'actual_head' variable that
    tried to approximate doing this, but did it incorrectly.

    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Rafael J. Wysocki <rjw@sisk.pl>
    Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
    Cc: Dave Airlie <airlied@linux.ie>
    Cc: DRI mailing list <dri-devel@lists.freedesktop.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Comment 1 meng 2011-03-14 01:08:33 UTC

Created attachment 44430 [details]
The dmesg for this bug

Comment 2 Chris Wilson 2011-03-14 03:26:54 UTC

Ok, I've reproduced this on next and does seem to be a ringbuffer overflow. Much to my surprise.

Comment 3 Chris Wilson 2011-03-14 05:42:03 UTC

After fiddling a little bit, it still hangs without the suspicious ringbuffer wrapping.

Comment 4 Chris Wilson 2011-03-14 12:08:12 UTC

Also reproduced a very similar GPU hang running the trace on a HuronRiver (rev09).

Comment 5 Chris Wilson 2011-03-19 15:35:54 UTC

Created attachment 44622 [details] [review]
Flush BLT before the interrupt

Comment 6 Chris Wilson 2011-03-21 23:51:06 UTC

I am waiting on confirmation that


commit fa0fd4d6f815d05c6f87f11df2cac8a9003cab74
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Mar 19 22:26:49 2011 +0000

    drm/i915: Restore missing command flush before interrupt on BLT ring
    
    We always skipped flushing the BLT ring if the request flush did not
    include the RENDER domain. However, this neglects that we try to flush
    the COMMAND domain after every batch and before the breadcrumb interrupt
    (to make sure the batch is indeed completed prior to the interrupt
    firing and so insuring CPU coherency). As a result of the missing flush,
    incoherency did indeed creep in, most notable when using lots of command
    buffers and so potentially rewritting an active command buffer (i.e.
    the GPU was still executing from it even though the following interrupt
    had already fired and the request/buffer retired).
    
    As all ring->flush routines now have the same preconditions, de-duplicate
    and move those checks up into i915_gem_flush_ring().
    
    Fixes gem_linear_blit.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35284
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

in drm-intel-staging fixes this bug.

Comment 7 meng 2011-03-22 03:23:16 UTC

we can compile it successfully unless deleted the sentence "c:702
drivers/usb/serial/usb_wwan.c" in drm-intel-staging.Even if compiled successfully, system can't work normally on SugarBay.I'm sorry for that.Could you please look at the ERROR when compiling?


drivers/usb/serial/usb_wwan.c:In function ‘play_delayed’:
drivers/usb/serial/usb_wwan.c:702: error: ‘struct dev_pm_info’ has no member named ‘usage_count’
make[5]: *** [drivers/usb/serial/usb_wwan.o] Error 1
make[4]: *** [drivers/usb/serial] Error 2
make[3]: *** [drivers/usb] Error 2
make[2]: *** [drivers] Error 2
make[1]: *** [binrpm-pkg] Error 2
make: *** [binrpm-pkg] Error 2

Comment 8 Chris Wilson 2011-03-22 03:32:39 UTC

I rebased drm-intel-staging on drm-core-next, so the compilation issue should be no more and provide a working system on which to test.

Comment 9 meng 2011-03-22 20:11:37 UTC

It works fine when testing in the commit 87862b8b(drm/i915: Restore missing command flush before interrupt on BLT ring) on SugarBay.

Comment 10 Chris Wilson 2011-03-22 23:50:37 UTC

commit d2023bf8be6c39d45a1a08d0bd8efb126701634c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Mar 19 22:26:49 2011 +0000

    drm/i915: Restore missing command flush before interrupt on BLT ring
    
    We always skipped flushing the BLT ring if the request flush did not
    include the RENDER domain. However, this neglects that we try to flush
    the COMMAND domain after every batch and before the breadcrumb interrupt
    (to make sure the batch is indeed completed prior to the interrupt
    firing and so insuring CPU coherency). As a result of the missing flush,
    incoherency did indeed creep in, most notable when using lots of command
    buffers and so potentially rewritting an active command buffer (i.e.
    the GPU was still executing from it even though the following interrupt
    had already fired and the request/buffer retired).
    
    As all ring->flush routines now have the same preconditions, de-duplicate
    and move those checks up into i915_gem_flush_ring().
    
    Fixes gem_linear_blit.
  
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35284
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Tested-by: mengmeng.meng@intel.com

Comment 11 meng 2011-03-23 03:22:52 UTC

Verified with the commit d2023bf8be6c39d45a1a08d0bd8efb126701634c,it works fine.

Comment 12 Jari Tahvanainen 2017-09-04 10:05:06 UTC

Closing old verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.