Summary: | [SNB rc6] drop MMIO writes leading to GPU hang | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Matt Turner <mattst88> | ||||||||||||||
Component: | DRM/Intel | Assignee: | Daniel Vetter <daniel> | ||||||||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||
Severity: | normal | ||||||||||||||||
Priority: | highest | CC: | ben, carthexis, chris, daniel, eugeni, huax.lu, igaldino, jbarnes, peter.alfredsen | ||||||||||||||
Version: | XOrg git | ||||||||||||||||
Hardware: | Other | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Created attachment 62427 [details]
dmesg
Hmm, that is worrying. It really does look like it is dropping critical writes whilst programming the GPU. Any chance that this is bisectable? If you bisect this, please use the i915_error_state dump/full gpu hang as an indicator - we've added the gtfifo WARN in 3.4. Patches are 67a3744f7515edda988 and ee64cbdbf6170679881 if you're curious. This probably isn't the news you wanted to hear: it seems related to rc6. When I boot with i915.i915_enable_rc6=-1 I'll get GPU hangs in when switching between windows in gnome-shell. When I boot with i915.i915_enable_rc6=0, I cannot reproduce the hangs. I tried to bisect between v3.3 and v3.4 and the offending commit was simply enabling rc6 by default, so I'll have to redo the bisection with rc6 enabled, but I'm not sure if I've ever had rc6 stable on my system. Could you please try with the patch from http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/11894 ? It does not help. I applied patches 1/2 and 2/2 and gnome-shell still died the same way. dmesg is the same. *** Bug 50652 has been marked as a duplicate of this bug. *** Matt, Lorenz and Emmanuel can we have the full details about your system and configuration? Hopefully we can source a system and reproduce... (In reply to comment #8) > Matt, Lorenz and Emmanuel can we have the full details about your system and > configuration? Hopefully we can source a system and reproduce... My hardware configuration is - 2500K - ASRock H67M motherboard (http://www.asrock.com/mb/overview.asp?Model=H67M), BIOS 1.50 I don't think I've ever had a configuration where RC6 worked, I just didn't notice that it was broken because it wasn't on by default. What else can I provide? My hardware: - Intel Core i7-2600K - Also an ASRock motherboard, H67M-GE/HT I noticed that I wrote "HDMI" as my connector type earlier, this is wrong: My monitor is in fact connected via DisplayPort (whether that matters or not :) ). - Monitor: Dell U2711 via DisplayPort, 2560x1440 Anything else you need? Can you please test the latest version of the drm-intel-next-queued branch, specifically: commit af7e1ae36b2be174b547f093babe0f8b545898ec Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Jul 15 09:42:38 2012 +0100 drm/i915: Workaround hang with BSD and forcewake on SandyBridge Maybe we're lucky in this helps. Unfortunately, this didn't cure the problem for me. Interested in another copy of dmesg, i916_error_state and such? dmesg is still full of "MMIO read or write has been dropped" > --- Comment #12 from Lorenz Huedepohl <carthexis@yahoo.de> 2012-07-15 20:00:25 PDT ---
> Unfortunately, this didn't cure the problem for me. Interested in another copy
> of dmesg, i916_error_state and such? dmesg is still full of
>
> "MMIO read or write has been dropped"
No, I don't think anything really is new. Thanks for testing this anyway.
Kenneth suggested to me that this might be related to VT-d, which I'm not sure if its enabled. The computer is now packed and won't be unpacked for at least 6 weeks, so I can't check. Others with the same problem may want to check. The fact that Lorenz's motherboard is also an ASRock H67M is quite a coincidence if it's not actually related to the board. Neither DMAR nor IOMMU to be found in the dmesg, so no VT-d. Can you please retest this with latest drm-intel-fixes? (In reply to comment #16) > Can you please retest this with latest drm-intel-fixes? I will as soon as I can, but that probably won't be until next month. My system is packed up for the move to Oregon. (In reply to comment #16) > Can you please retest this with latest drm-intel-fixes? I'm trying c4aed35 at the moment, it seems to be good so far (uptime 2 hours). I will report back in a few days or in case of trouble. In dmesg, I get entries like these from time to time: 5180.774833] ------------[ cut here ]------------ [ 5180.774860] WARNING: at drivers/gpu/drm/i915/i915_drv.c:527 __gen6_gt_wait_for_fifo+0x96/0xa0 [i915]() [ 5180.774862] Hardware name: To Be Filled By O.E.M. [ 5180.774927] thermal [ 5180.774928] thermal_sys [ 5180.774933] Pid: 2674, comm: Xorg Tainted: G W 3.5.0-rc5-32-desktop+ #12 [ 5180.774934] Call Trace: [ 5180.774944] [<ffffffff8104140a>] warn_slowpath_common+0x7a/0xb0 [ 5180.774948] [<ffffffff81041455>] warn_slowpath_null+0x15/0x20 [ 5180.774960] [<ffffffffa0113cc6>] __gen6_gt_wait_for_fifo+0x96/0xa0 [i915] [ 5180.774971] [<ffffffffa0114662>] i915_write32+0x92/0x130 [i915] [ 5180.774991] [<ffffffffa0155c7c>] ring_write_tail+0x1c/0x20 [i915] [ 5180.775008] [<ffffffffa01574f1>] intel_ring_advance+0x41/0x50 [i915] [ 5180.775023] [<ffffffffa0158233>] gen6_render_ring_flush+0xc3/0xd0 [i915] [ 5180.775036] [<ffffffffa0125ea9>] i915_gem_flush_ring.part.23+0x39/0x100 [i915] [ 5180.775048] [<ffffffffa0126a64>] i915_gem_flush_ring+0x14/0x20 [i915] [ 5180.775061] [<ffffffffa012bec3>] i915_gem_execbuffer_move_to_gpu+0x1b3/0x200 [i915] [ 5180.775074] [<ffffffffa012ca81>] i915_gem_do_execbuffer.isra.11+0x621/0x970 [i915] [ 5180.775086] [<ffffffffa012d295>] i915_gem_execbuffer2+0x95/0x260 [i915] [ 5180.775098] [<ffffffffa00af404>] drm_ioctl+0x444/0x510 [drm] [ 5180.775103] [<ffffffff8116245a>] ? do_sync_read+0xca/0x110 [ 5180.775116] [<ffffffffa012d200>] ? i915_gem_execbuffer+0x430/0x430 [i915] [ 5180.775121] [<ffffffff81174215>] do_vfs_ioctl+0x85/0x300 [ 5180.775124] [<ffffffff81162df8>] ? vfs_read+0x108/0x170 [ 5180.775128] [<ffffffff81174521>] sys_ioctl+0x91/0xa0 [ 5180.775132] [<ffffffff815b3779>] system_call_fastpath+0x16/0x1b [ 5180.775135] ---[ end trace b47a98054c149157 ]--- [ 5180.780250] ------------[ cut here ]------------ However, without an apparent visual effect :) Kind regards, Lorenz Sorry, it didn't last that long, it crashed a few minutes after my last report :( I attach dmesg, Xorg.log and i915_error_state again, as dmesg shows different entries then the other times (see comment 18). Created attachment 65251 [details]
/sys/kernel/debug/dri/0/i915_error_state after crash
Created attachment 65252 [details]
dmesg before crash
Created attachment 65253 [details]
dmesg after crash
Created attachment 65254 [details]
Xorg.0.log
*** Bug 53186 has been marked as a duplicate of this bug. *** I'm glad to help on this one. My Bug 53186 was related to this one. I disabled rc6 and so far nothing. What needs to be done? What do we need to test? Please let me know. Thanks. I'm still able to reproduce this bug, but it's gotten much harder to do after updating from Gnome 3.2 to 3.4. I updated my kernel and userspace and on Gnome 3.2 start up, the GPU would reset completely reproducibly. I updated to Gnome 3.4 and the GPU no longer hangs on reset, but after a long time of use, I will get dropped MMIO writes and the GPU will reset. I think I've had to leave the system on for more than a day to see this. I tried Ben's 450mV work-around, but it didn't help since my system wasn't affected. I tried the "drm/i915: Fix GT_MODE default value" patch, but it didn't help (my system did /not/ have the doc-suggested bits set in GT_MODE). I tried to update my BIOS (from v1.50 to v2.10) but my system's chipset is revision B2 which is apparently too old to update with a v2.10 BIOS that supports Ivy Bridge. I have not tried disabling contexts yet, but since I have experienced this RC6 bug ever since RC6 was first enabled by default (~kernel 3.3?) and contexts were only implemented in 3.6, I highly doubt they're involved. I'll test with contexts disabled in any case. What more can I try? It's significantly more difficult now that Gnome 3.4 doesn't reliably trigger the reset. *** Bug 55349 has been marked as a duplicate of this bug. *** *** Bug 52424 has been marked as a duplicate of this bug. *** *** Bug 52945 has been marked as a duplicate of this bug. *** Hello all, I used to see very similar behaviour on my Sandy Bridge box. Interestingly I'm also using an Asrock Board (H67M-GE). That indicates that it might be a hardware issue. I flashed the latest BIOS version provided bei Asrock (V 2.10, used to have 1.40 or 1.50 before, don't remember exactly) and the issue went away for me, hasn't ever re-appeared since. If I remember correctly, the BIOS changelog mentioned something about fixing the voltage settings of the GPU for the change to version 1.60. Asrock's documentation is rather poor unfortunately, but searching the web for "h67m-ge igpu voltage" yields several hits with descriptions of the BIOS update: "Modify default setting for IGPU voltage." As the bug seems to appear only with RC6 enabled there might be a connection here. If there's no new BIOS available for other boards, manually changing IGPU in the BIOS/UEFI might also work around the issue maybe. Maybe that helps someone. Regards, Stefan That sounds similar to: commit 31643d54a739382626c27c0f2a12b3bbc22d1a38 Author: Ben Widawsky <ben@bwidawsk.net> Date: Wed Sep 26 10:34:01 2012 -0700 drm/i915: Workaround to bump rc6 voltage to 450 BIOS should be setting the minimum voltage for rc6 to be 450mV. Old or buggy BIOSen may not be doing this, so we correct it for them. Ideally customers should update the BIOS as only it would know the optimal values for the platform, so we leave that fact as a DRM_ERROR for the user to see. in 3.8. The people we asked to test that patch continued to report issues afterwards, so maybe we need to boost that value a little more, or maybe we didn't have the same root cause. What would be useful would a debugfs entry for us to see what values were changed in the BIOS. (In reply to comment #30) > Hello all, I used to see very similar behaviour on my Sandy Bridge box. > Interestingly I'm also using an Asrock Board (H67M-GE). That indicates that > it might be a hardware issue. I flashed the latest BIOS version provided bei > Asrock (V 2.10, used to have 1.40 or 1.50 before, don't remember exactly) > and the issue went away for me, hasn't ever re-appeared since. > > If I remember correctly, the BIOS changelog mentioned something about fixing > the voltage settings of the GPU for the change to version 1.60. Asrock's > documentation is rather poor unfortunately, but searching the web for > "h67m-ge igpu voltage" yields several hits with descriptions of the BIOS > update: > "Modify default setting for IGPU voltage." > > As the bug seems to appear only with RC6 enabled there might be a connection > here. > If there's no new BIOS available for other boards, manually changing IGPU in > the BIOS/UEFI might also work around the issue maybe. > > Maybe that helps someone. > > Regards, > Stefan Good call! ASRock's website doesn't appear to have direct links to older BIOS versions, but if you guess the links, and it turns up these versions: http://download.asrock.com/bios/1155/H67M(1.60)WIN.zip http://download.asrock.com/bios/1155/H67M(1.70)WIN.zip http://download.asrock.com/bios/1155/H67M(1.80)WIN.zip So if someone has an older revision B2 chipset they can try these versions. I RMA'd my board (given that it /was/ recalled because of the SATA issues) and got a new B3 chipset that supports Ivy Bridge and has BIOS v2.10. No more rc6 failures with this board. Ok, sounds like the original issue was a combination of busted hw/bios, and we're robbed of any chances to do further debug :( Thanks for reporting, I'll close this now. Really closing. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 62426 [details] i915_error_state kernel commit - 829f51dbd825256197fb2a89705d42ad83f958ef This started happening sometime after 3.3.