Bug 50619

Summary: [SNB rc6] drop MMIO writes leading to GPU hang
Product: DRI Reporter: Matt Turner <mattst88>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: highest CC: ben, carthexis, chris, daniel, eugeni, huax.lu, igaldino, jbarnes, peter.alfredsen
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
i915_error_state
none
dmesg
none
/sys/kernel/debug/dri/0/i915_error_state after crash
none
dmesg before crash
none
dmesg after crash
none
Xorg.0.log none

Description Matt Turner 2012-06-02 09:42:32 UTC
Created attachment 62426 [details]
i915_error_state

kernel commit - 829f51dbd825256197fb2a89705d42ad83f958ef

This started happening sometime after 3.3.
Comment 1 Matt Turner 2012-06-02 09:42:47 UTC
Created attachment 62427 [details]
dmesg
Comment 2 Chris Wilson 2012-06-02 14:09:19 UTC
Hmm, that is worrying. It really does look like it is dropping critical writes whilst programming the GPU. Any chance that this is bisectable?
Comment 3 Daniel Vetter 2012-06-04 02:10:32 UTC
If you bisect this, please use the i915_error_state dump/full gpu hang as an indicator - we've added the gtfifo WARN in 3.4. Patches are 67a3744f7515edda988 and ee64cbdbf6170679881 if you're curious.
Comment 4 Matt Turner 2012-06-10 09:47:03 UTC
This probably isn't the news you wanted to hear: it seems related to rc6.

When I boot with i915.i915_enable_rc6=-1 I'll get GPU hangs in when switching between windows in gnome-shell. When I boot with i915.i915_enable_rc6=0, I cannot reproduce the hangs.

I tried to bisect between v3.3 and v3.4 and the offending commit was simply enabling rc6 by default, so I'll have to redo the bisection with rc6 enabled, but I'm not sure if I've ever had rc6 stable on my system.
Comment 5 Eugeni Dodonov 2012-06-15 13:06:05 UTC
Could you please try with the patch from http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/11894 ?
Comment 6 Matt Turner 2012-06-16 11:15:21 UTC
It does not help. I applied patches 1/2 and 2/2 and gnome-shell still died the same way. dmesg is the same.
Comment 7 Chris Wilson 2012-07-04 11:09:34 UTC
*** Bug 50652 has been marked as a duplicate of this bug. ***
Comment 8 Chris Wilson 2012-07-04 11:11:01 UTC
Matt, Lorenz and Emmanuel can we have the full details about your system and configuration? Hopefully we can source a system and reproduce...
Comment 9 Matt Turner 2012-07-06 11:49:36 UTC
(In reply to comment #8)
> Matt, Lorenz and Emmanuel can we have the full details about your system and
> configuration? Hopefully we can source a system and reproduce...

My hardware configuration is
 - 2500K
 - ASRock H67M motherboard (http://www.asrock.com/mb/overview.asp?Model=H67M), BIOS 1.50

I don't think I've ever had a configuration where RC6 worked, I just didn't notice that it was broken because it wasn't on by default.

What else can I provide?
Comment 10 Lorenz Huedepohl 2012-07-15 16:22:31 UTC
My hardware:

- Intel Core i7-2600K
- Also an ASRock motherboard, H67M-GE/HT

I noticed that I wrote "HDMI" as my connector type earlier, this is wrong: My monitor is in fact connected via DisplayPort (whether that matters or not :) ).

- Monitor: Dell U2711 via DisplayPort, 2560x1440

Anything else you need?
Comment 11 Daniel Vetter 2012-07-15 17:00:45 UTC
Can you please test the latest version of the drm-intel-next-queued branch, specifically:

commit af7e1ae36b2be174b547f093babe0f8b545898ec
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jul 15 09:42:38 2012 +0100

    drm/i915: Workaround hang with BSD and forcewake on SandyBridge

Maybe we're lucky in this helps.
Comment 12 Lorenz Huedepohl 2012-07-15 20:00:25 UTC
Unfortunately, this didn't cure the problem for me. Interested in another copy of dmesg, i916_error_state and such? dmesg is still full of

"MMIO read or write has been dropped"
Comment 13 Daniel Vetter 2012-07-15 20:58:22 UTC
> --- Comment #12 from Lorenz Huedepohl <carthexis@yahoo.de> 2012-07-15 20:00:25 PDT ---
> Unfortunately, this didn't cure the problem for me. Interested in another copy
> of dmesg, i916_error_state and such? dmesg is still full of
> 
> "MMIO read or write has been dropped"

No, I don't think anything really is new. Thanks for testing this anyway.
Comment 14 Matt Turner 2012-07-16 00:24:12 UTC
Kenneth suggested to me that this might be related to VT-d, which I'm not sure if its enabled. The computer is now packed and won't be unpacked for at least 6 weeks, so I can't check. Others with the same problem may want to check.

The fact that Lorenz's motherboard is also an ASRock H67M is quite a coincidence if it's not actually related to the board.
Comment 15 Daniel Vetter 2012-07-16 08:28:10 UTC
Neither DMAR nor IOMMU to be found in the dmesg, so no VT-d.
Comment 16 Daniel Vetter 2012-08-06 09:30:23 UTC
Can you please retest this with latest drm-intel-fixes?
Comment 17 Matt Turner 2012-08-06 16:24:36 UTC
(In reply to comment #16)
> Can you please retest this with latest drm-intel-fixes?

I will as soon as I can, but that probably won't be until next month. My system is packed up for the move to Oregon.
Comment 18 Lorenz Huedepohl 2012-08-07 20:31:35 UTC
(In reply to comment #16)
> Can you please retest this with latest drm-intel-fixes?

I'm trying c4aed35 at the moment, it seems to be good so far (uptime 2 hours). I will report back in a few days or in case of trouble.


In dmesg, I get entries like these from time to time:

 5180.774833] ------------[ cut here ]------------
[ 5180.774860] WARNING: at drivers/gpu/drm/i915/i915_drv.c:527 __gen6_gt_wait_for_fifo+0x96/0xa0 [i915]()
[ 5180.774862] Hardware name: To Be Filled By O.E.M.
[ 5180.774927]  thermal
[ 5180.774928]  thermal_sys
[ 5180.774933] Pid: 2674, comm: Xorg Tainted: G        W    3.5.0-rc5-32-desktop+ #12
[ 5180.774934] Call Trace:
[ 5180.774944]  [<ffffffff8104140a>] warn_slowpath_common+0x7a/0xb0
[ 5180.774948]  [<ffffffff81041455>] warn_slowpath_null+0x15/0x20
[ 5180.774960]  [<ffffffffa0113cc6>] __gen6_gt_wait_for_fifo+0x96/0xa0 [i915]
[ 5180.774971]  [<ffffffffa0114662>] i915_write32+0x92/0x130 [i915]
[ 5180.774991]  [<ffffffffa0155c7c>] ring_write_tail+0x1c/0x20 [i915]
[ 5180.775008]  [<ffffffffa01574f1>] intel_ring_advance+0x41/0x50 [i915]
[ 5180.775023]  [<ffffffffa0158233>] gen6_render_ring_flush+0xc3/0xd0 [i915]
[ 5180.775036]  [<ffffffffa0125ea9>] i915_gem_flush_ring.part.23+0x39/0x100 [i915]
[ 5180.775048]  [<ffffffffa0126a64>] i915_gem_flush_ring+0x14/0x20 [i915]
[ 5180.775061]  [<ffffffffa012bec3>] i915_gem_execbuffer_move_to_gpu+0x1b3/0x200 [i915]
[ 5180.775074]  [<ffffffffa012ca81>] i915_gem_do_execbuffer.isra.11+0x621/0x970 [i915]
[ 5180.775086]  [<ffffffffa012d295>] i915_gem_execbuffer2+0x95/0x260 [i915]
[ 5180.775098]  [<ffffffffa00af404>] drm_ioctl+0x444/0x510 [drm]
[ 5180.775103]  [<ffffffff8116245a>] ? do_sync_read+0xca/0x110
[ 5180.775116]  [<ffffffffa012d200>] ? i915_gem_execbuffer+0x430/0x430 [i915]
[ 5180.775121]  [<ffffffff81174215>] do_vfs_ioctl+0x85/0x300
[ 5180.775124]  [<ffffffff81162df8>] ? vfs_read+0x108/0x170
[ 5180.775128]  [<ffffffff81174521>] sys_ioctl+0x91/0xa0
[ 5180.775132]  [<ffffffff815b3779>] system_call_fastpath+0x16/0x1b
[ 5180.775135] ---[ end trace b47a98054c149157 ]---
[ 5180.780250] ------------[ cut here ]------------

However, without an apparent visual effect :)

Kind regards,
  Lorenz
Comment 19 Lorenz Huedepohl 2012-08-07 21:06:35 UTC
Sorry, it didn't last that long, it crashed a few minutes after my last report :(

I attach dmesg, Xorg.log and i915_error_state again, as dmesg shows different entries then the other times (see comment 18).
Comment 20 Lorenz Huedepohl 2012-08-07 21:08:15 UTC
Created attachment 65251 [details]
/sys/kernel/debug/dri/0/i915_error_state after crash
Comment 21 Lorenz Huedepohl 2012-08-07 21:09:00 UTC
Created attachment 65252 [details]
dmesg before crash
Comment 22 Lorenz Huedepohl 2012-08-07 21:09:25 UTC
Created attachment 65253 [details]
dmesg after crash
Comment 23 Lorenz Huedepohl 2012-08-07 21:09:49 UTC
Created attachment 65254 [details]
Xorg.0.log
Comment 24 Chris Wilson 2012-08-13 17:55:16 UTC
*** Bug 53186 has been marked as a duplicate of this bug. ***
Comment 25 igaldino 2012-10-02 23:30:21 UTC
I'm glad to help on this one. My Bug 53186 was related to this one.

I disabled rc6 and so far nothing. What needs to be done?

What do we need to test? Please let me know.

Thanks.
Comment 26 Matt Turner 2012-10-12 19:39:59 UTC
I'm still able to reproduce this bug, but it's gotten much harder to do after updating from Gnome 3.2 to 3.4.

I updated my kernel and userspace and on Gnome 3.2 start up, the GPU would reset completely reproducibly. I updated to Gnome 3.4 and the GPU no longer hangs on reset, but after a long time of use, I will get dropped MMIO writes and the GPU will reset. I think I've had to leave the system on for more than a day to see this.

I tried Ben's 450mV work-around, but it didn't help since my system wasn't affected. I tried the "drm/i915: Fix GT_MODE default value" patch, but it didn't help (my system did /not/ have the doc-suggested bits set in GT_MODE). I tried to update my BIOS (from v1.50 to v2.10) but my system's chipset is revision B2 which is apparently too old to update with a v2.10 BIOS that supports Ivy Bridge.

I have not tried disabling contexts yet, but since I have experienced this RC6 bug ever since RC6 was first enabled by default (~kernel 3.3?) and contexts were only implemented in 3.6, I highly doubt they're involved. I'll test with contexts disabled in any case.

What more can I try? It's significantly more difficult now that Gnome 3.4 doesn't reliably trigger the reset.
Comment 27 Chris Wilson 2012-12-09 15:40:28 UTC
*** Bug 55349 has been marked as a duplicate of this bug. ***
Comment 28 Chris Wilson 2012-12-12 09:27:24 UTC
*** Bug 52424 has been marked as a duplicate of this bug. ***
Comment 29 Chris Wilson 2012-12-12 09:28:29 UTC
*** Bug 52945 has been marked as a duplicate of this bug. ***
Comment 30 Stefan 2012-12-28 18:56:18 UTC
Hello all, I used to see very similar behaviour on my Sandy Bridge box. Interestingly I'm also using an Asrock Board (H67M-GE). That indicates that it might be a hardware issue. I flashed the latest BIOS version provided bei Asrock (V 2.10, used to have 1.40 or 1.50 before, don't remember exactly) and the issue went away for me, hasn't ever re-appeared since.

If I remember correctly, the BIOS changelog mentioned something about fixing the voltage settings of the GPU for the change to version 1.60. Asrock's documentation is rather poor unfortunately, but searching the web for "h67m-ge igpu voltage" yields several hits with descriptions of the BIOS update:
"Modify default setting for IGPU voltage."

As the bug seems to appear only with RC6 enabled there might be a connection here.
If there's no new BIOS available for other boards, manually changing IGPU in the BIOS/UEFI might also work around the issue maybe.

Maybe that helps someone.

Regards,
Stefan
Comment 31 Chris Wilson 2012-12-28 19:07:05 UTC
That sounds similar to:

commit 31643d54a739382626c27c0f2a12b3bbc22d1a38
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Wed Sep 26 10:34:01 2012 -0700

    drm/i915: Workaround to bump rc6 voltage to 450
    
    BIOS should be setting the minimum voltage for rc6 to be 450mV. Old or
    buggy BIOSen may not be doing this, so we correct it for them. Ideally
    customers should update the BIOS as only it would know the optimal
    values for the platform, so we leave that fact as a DRM_ERROR for the
    user to see.

in 3.8. The people we asked to test that patch continued to report issues afterwards, so maybe we need to boost that value a little more, or maybe we didn't have the same root cause.

What would be useful would a debugfs entry for us to see what values were changed in the BIOS.
Comment 32 Matt Turner 2013-01-07 02:56:44 UTC
(In reply to comment #30)
> Hello all, I used to see very similar behaviour on my Sandy Bridge box.
> Interestingly I'm also using an Asrock Board (H67M-GE). That indicates that
> it might be a hardware issue. I flashed the latest BIOS version provided bei
> Asrock (V 2.10, used to have 1.40 or 1.50 before, don't remember exactly)
> and the issue went away for me, hasn't ever re-appeared since.
> 
> If I remember correctly, the BIOS changelog mentioned something about fixing
> the voltage settings of the GPU for the change to version 1.60. Asrock's
> documentation is rather poor unfortunately, but searching the web for
> "h67m-ge igpu voltage" yields several hits with descriptions of the BIOS
> update:
> "Modify default setting for IGPU voltage."
> 
> As the bug seems to appear only with RC6 enabled there might be a connection
> here.
> If there's no new BIOS available for other boards, manually changing IGPU in
> the BIOS/UEFI might also work around the issue maybe.
> 
> Maybe that helps someone.
> 
> Regards,
> Stefan

Good call! ASRock's website doesn't appear to have direct links to older BIOS versions, but if you guess the links, and it turns up these versions:

http://download.asrock.com/bios/1155/H67M(1.60)WIN.zip
http://download.asrock.com/bios/1155/H67M(1.70)WIN.zip
http://download.asrock.com/bios/1155/H67M(1.80)WIN.zip

So if someone has an older revision B2 chipset they can try these versions. I RMA'd my board (given that it /was/ recalled because of the SATA issues) and got a new B3 chipset that supports Ivy Bridge and has BIOS v2.10.

No more rc6 failures with this board.
Comment 33 Daniel Vetter 2013-01-07 10:28:11 UTC
Ok, sounds like the original issue was a combination of busted hw/bios, and we're robbed of any chances to do further debug :( Thanks for reporting, I'll close this now.
Comment 34 Jari Tahvanainen 2016-10-07 05:27:48 UTC
Really closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.