Bug 47535

Summary: [snb] gpu hang with google maps (GL version)
Product: Mesa Reporter: Aaron <aaron667>
Component: Drivers/DRI/i965Assignee: Kenneth Graunke <kenneth>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: medium CC: anuj.phogat, arun, brot+bfdo, bryce, chadversary, daniel, erappleman, futuredub, idr, kenneth, radicaledward
Version: git   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: i915_error_state
i915_error_state after upgrade to mesa (git: ca760181b4420696c7e86aa2951d7203522ad1e8 )
i915_error_state running with drm-intel-next-queued kernel

Description Aaron 2012-03-19 13:34:20 UTC
Created attachment 58707 [details]
i915_error_state

xf86-video-intel git: 1c2932e9cb283942567c3dd2695d03b8045da27f
xorg-server: 1.12
firefox: 11.0
kernel: 3.3.0

dmesg:
[  357.750313] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  357.750325] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  357.767446] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 7401 at 7390, next 7402)

Xorg.log:
[   353.425] [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
[   353.425] 
[   353.425] Backtrace:
[   353.426] 0: /usr/bin/X (xorg_backtrace+0x36) [0x56d2d6]
[   353.426] 1: /usr/bin/X (mieqEnqueue+0x273) [0x54dd43]
[   353.426] 2: /usr/bin/X (0x400000+0x4a3ae) [0x44a3ae]
[   353.426] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fdf2082c000+0x62d1) [0x7fdf208322d1]
[   353.426] 4: /usr/bin/X (0x400000+0x72267) [0x472267]
[   353.426] 5: /usr/bin/X (0x400000+0x97ab3) [0x497ab3]
[   353.426] 6: /lib64/libpthread.so.0 (0x7fdf24db8000+0x10420) [0x7fdf24dc8420]
[   353.426] 7: /lib64/libc.so.6 (ioctl+0x7) [0x7fdf23da8807]
[   353.426] 8: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x7fdf2233ef08]
[   353.426] 9: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fdf21e07000+0x37524) [0x7fdf21e3e524]
[   353.426] 10: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fdf21e07000+0x3a353) [0x7fdf21e41353]
[   353.426] 11: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fdf21e07000+0x6256c) [0x7fdf21e6956c]
[   353.426] 12: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fdf21e07000+0x70720) [0x7fdf21e77720]
[   353.426] 13: /usr/bin/X (WakeupHandler+0xa0) [0x43a280]
[   353.426] 14: /usr/bin/X (WaitForSomething+0x1bc) [0x56a8cc]
[   353.426] 15: /usr/bin/X (0x400000+0x35e52) [0x435e52]
[   353.426] 16: /usr/bin/X (0x400000+0x24e6a) [0x424e6a]
[   353.426] 17: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x7fdf23cf522d]
[   353.426] 18: /usr/bin/X (0x400000+0x24a09) [0x424a09]
[   353.426] 
[   353.426] [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
[   353.426] [mi] mieq is *NOT* the cause.  It is a victim.
[   354.097] [mi] EQ overflow continuing.  100 events have been dropped.
Comment 1 Aaron 2012-03-19 14:06:00 UTC
Created attachment 58710 [details]
i915_error_state after upgrade to mesa (git: ca760181b4420696c7e86aa2951d7203522ad1e8 )
Comment 2 Kenneth Graunke 2012-03-31 02:20:30 UTC
It appears that the hang goes away if you disable HiZ.  (This can be done via driconf, or by setting the environment variable hiz=false.)

The simulator is giving me interesting complaints; investigating further...
Comment 3 Kenneth Graunke 2012-04-12 14:09:12 UTC
*** Bug 45806 has been marked as a duplicate of this bug. ***
Comment 4 Kenneth Graunke 2012-04-12 14:09:16 UTC
*** Bug 44108 has been marked as a duplicate of this bug. ***
Comment 5 Eric Appleman 2012-04-12 16:05:56 UTC
Confirming that disabling Hierarchical Z stops the hang. 

However, overall MapsGL performance is uninspiring. Probably Google's fault since its still experimental.
Comment 6 Chad Versace 2012-04-13 11:25:21 UTC
MapsGL is also uninspiring on my MacBook Pro with a discrete NVidia card. It's definitely Google's fault.
Comment 7 Daniel Vetter 2012-04-17 13:29:44 UTC
Maybe we're lucky and one of the SNB workarounds fixed this. Can someone try the latest kernel from my drm-intel-next-queued branch?
Comment 8 Aaron 2012-04-22 04:15:12 UTC
I just gave drm-intel-next-queued (a360bb1a83279243a0945a0e646fd6c66521864e) a try (running with mesa 8.0.2, libdrm-2.4.33 and xf86-video-intel-caf9144271a10f90ea580c246b2df3f69a10b7a0 ) and I'd say the situation definitely improved. I got one initial hang when using google maps, after that it ran almost smooth. Previously every single zoom on the map caused yet another hangcheck_hung entry.

The log message I got was:
[   79.722665] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[   79.722669] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[   79.734067] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
Comment 9 Aaron 2012-04-22 04:16:43 UTC
Created attachment 60446 [details]
i915_error_state running with drm-intel-next-queued kernel
Comment 10 Kenneth Graunke 2012-04-26 02:27:28 UTC
Mashing this register makes the problem go away for me:

$ sudo intel_reg_write 0x2120 '0x1206800'
Value before: 0x6820
Value after: 0x6800

Anyone care to try it out?  I'll put together a kernel patch soon.
Comment 11 Rob Castle 2012-04-26 08:34:40 UTC
I can confirm it works, for me at least.
Comment 12 Aaron 2012-04-26 08:43:51 UTC
Works for me, too.
Comment 13 Kenneth Graunke 2012-04-26 23:48:49 UTC
It turns out that Daniel already implemented a workaround for this in the kernel...but the register write isn't sticking.  Apparently this register needs to be written later in the initialization process.

I've sent a preliminary (lame) patch to intel-gfx to spark some discussion on what the right fix should be.  Hopefully we can land on something soon...

Reassigning to me.
Comment 14 Kenneth Graunke 2012-04-29 22:14:45 UTC
Linus merged the patch into the upstream kernel, and Greg picked it up for 3.3 stable.  Hopefully should be landing in a distro near you. :)

Marking fixed.  If upgrading kernels is inconvenient, you can also apply the workaround manually via "sudo intel_reg_write 0x2120 0x1206800", or by disabling HiZ and separate stencil (export hiz=false).  (The kernel patch does the register write, so the intel_reg_write workaround is just as good.)

commit 3a69ddd6f872180b6f61fda87152b37202118fbc
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Fri Apr 27 12:44:41 2012 -0700

    drm/i915: Set the Stencil Cache eviction policy to non-LRA mode.
    
    Clearing bit 5 of CACHE_MODE_0 is necessary to prevent GPU hangs in
    OpenGL programs such as Google MapsGL, Google Earth, and gzdoom when
    using separate stencil buffers.  Without it, the GPU tries to use the
    LRA eviction policy, which isn't supported.  This was supposed to be off
    by default, but seems to be on for many machines.
    
    This cannot be done in gen6_init_clock_gating with most of the other
    workaround bits; the render ring needs to exist.  Otherwise, the
    register write gets dropped on the floor (one printk will show it
    changed, but a second printk immediately following shows the value
    reverts to the old one).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=47535
    Cc: stable@vger.kernel.org
    Cc: Rob Castle <futuredub@gmail.com>
    Cc: Eric Appleman <erappleman@gmail.com>
    Cc: aaron667@gmx.net
    Cc: Keith Packard <keithp@keithp.com>
    Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Dave Airlie <airlied@redhat.com>
Comment 15 Kenneth Graunke 2012-05-03 16:18:36 UTC
*** Bug 48526 has been marked as a duplicate of this bug. ***
Comment 16 Kenneth Graunke 2012-05-03 16:18:52 UTC
*** Bug 48791 has been marked as a duplicate of this bug. ***
Comment 17 Kenneth Graunke 2012-05-04 08:55:19 UTC
*** Bug 48748 has been marked as a duplicate of this bug. ***
Comment 18 Arun Raghavan 2012-05-04 08:59:28 UTC
Works a treat. Thank you!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.