Created attachment 92776 [details] GPU hang log Hi all, Playing movies on my HTPC causes hanging in random intervals. Checking dmesg I see the request to create a post here along with the GPU hang file. Running i3-3240 on Ubuntu 13.10 x64 , play back in XBMC alpha 10 with SOFTWARE playback. Regards, Bjoern
It appears to have hung trying to execute a pageflip. 0x00007eb0: 0x0a000001: MI_DISPLAY_BUFFER_INFO 0x00007eb4: 0x00001e01: dword 1 0x00007eb8: 0x06446000: dword 2 0x00007ebc: 0x00000000: MI_NOOP which looks consistent with fence[14] = 6c3d03b06446001 valid, x-tiled, pitch: 7680, start: 0x06446000, size: 8355840 and Pinned [33]: ... 06446000 8355840 41 00 0 0 P X dirty uncached (name: 49) (fence: 14) Can you please attach the full dmesg leading to the hang? Does this happen frequently or was this a one-off event?
Created attachment 92780 [details] dmesg log
Find attached the dmesg log file. I can reproduce this quite consistently. I noticed it yester when I watched a movie, a "freeze" came every few minutes (just the video, HDMI audio was fine). It still happens today, even after reboot.
Theories for why the GPU may be upset: 1. Multiple render response messages 2. Page flip with flip outstanding 3. Forcewake is required 4. The hardware hates us
I need to do a check here on my end. After going to 3.13.0 I see some errors in dmesg (e.g. "factorial" or "conftest") poping up, even running a 3.11.10 kernel doesn't change this. Not sure, maybe something is wrong with Linux now or my hardware. It worked perfectly till I went to 3.13.0... I'll run a memtest etc. and will update here asap.
One wicked theory I have is that the intoduction of the working SRM is breaking the flips... Can you please test: diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 5b7ce3f09681..de70260e50f3 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -8593,7 +8593,7 @@ static int intel_gen7_queue_flip(struct drm_device *dev, len = 4; if (ring->id == RCS) - len += 6; + len += 4; ret = intel_ring_begin(ring, len); if (ret) @@ -8614,10 +8614,7 @@ static int intel_gen7_queue_flip(struct drm_device *dev, intel_ring_emit(ring, ~(DERRMR_PIPEA_PRI_FLIP_DONE | DERRMR_PIPEB_PRI_FLIP_DONE | DERRMR_PIPEC_PRI_FLIP_DONE)); - intel_ring_emit(ring, MI_STORE_REGISTER_MEM(1) | - MI_SRM_LRM_GLOBAL_GTT); - intel_ring_emit(ring, DERRMR); - intel_ring_emit(ring, ring->scratch.gtt_offset + 256); + intel_ring_emit(ring, MI_NOOP); }
Can you please 'cat /sys/kernel/debug/dri/0/i915_fbc_status'
Created attachment 93135 [details] [review] One potential idea I tried to look through our gen7 page flip code. Looks like everything's according to spec, except we allow the MI_DISPLAY_FLIP to straddle two cachelines. This patch fixes that. Worth a shot I suppose even if the hanging flip in the error state didn't hit this. There were flips in the ring that would have hit this though.
Chris, Ville: Neither patched helped. Around once a minute 20-30 frames are dropped in one go. I'm on git 3.13.0 and just apply your changes to that one - compared to my code I'm around 300 lines off from where you guys are doing the changes... I'll do a new git pull and then try again. "FBC unsupported on this chipset" is what I get for i915_fbc_status.
I did more testing with a clean install of Ubuntu x64 13.10 and getting the usual updates and an XBMC nightly. Conclusion: XBMC runs fine without any issues as long as I don't use "bitstream" audio. Once bitstreaming is enabled and using ALSA directly the skipped frames occur. So what I will do now is I'll take Ville's patch suggestion first and see where that brings me. If I don't get this error again then I'll test again without it. Then I'll take up Chris suggestion if the issue is not resolved. Once I know something in this regard I'll update you.
Starting to see multiple sightings in Ubuntu. We'll have to queue up a revert of RCS flips unless we can find the answer.
*** Bug 74569 has been marked as a duplicate of this bug. ***
Created attachment 93513 [details] [review] Frob FORCEWAKE around RCS flips When in doubt, tell the GPU not to go to sleep.
Created attachment 93514 [details] yet another crash dump I'm affected as well. Running gentoo, happened with kernel 3.11, 3.12, 3.13 for sure, video-intel 2.21.15, 2.99.907, 2.99.909 (all with SNA enabled), mesa 9.2.5 dmesg: [ 763.125484] [drm] stuck on render ring [ 763.125485] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 763.125485] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 763.125486] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 763.125486] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 763.125486] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. ~ # cat /sys/kernel/debug/dri/0/i915_fbc_status FBC disabled: multiple pipes are enabled
One question for everybody: Do you only see this in multi-monitor setups?
It happens for me on a multi-monitor setup, yes.
The multiple simultaneous render response messages theory is flawed; the original crash dump hung with only a single active pipe.
Created attachment 93534 [details] crash dump when using patch from comment #13 Tried patch from comment #13 . Apparently it didn't solved the issue, but there is some notable change: usually I was able to spot the hang because everything freeze, only the mouse moves. Eventually and often, for some very weird reason, firefox fonts are corrupted, some latin letter is replaced with non latin one or anyway some other symbol, the only fix is to restart firefox. This time just the latter happened. Luckly I checked dmesg and I saw the "stuck at render ring" notification.
(In reply to comment #18) > Created attachment 93534 [details] > crash dump when using patch from comment #13 > > Tried patch from comment #13 . Apparently it didn't solved the issue, That's actually reassuring. The crash dump shows that the patch is working and the GPU is awake, so forcewake is definitely not an issue here. > but > there is some notable change: usually I was able to spot the hang because > everything freeze, only the mouse moves. Eventually and often, for some very > weird reason, firefox fonts are corrupted, some latin letter is replaced > with non latin one or anyway some other symbol, the only fix is to restart > firefox. This time just the latter happened. Luckly I checked dmesg and I > saw the "stuck at render ring" notification. That just sounds like the usual dangers with a hung gpu and discarding work before resetting. (We have plans to fix it.)
Enrico, if you have the opportunity can you try Ville's patch from comment 8?
(In reply to comment #20) > Enrico, if you have the opportunity can you try Ville's patch from comment 8? Hi Chris, I compiled and run the 3.13.1 kernel with named patch applied. For now everything is ok, but I still don't yell to victory. This bug is fairly hard to reproduce in my case. Usually it never happen more then once or twice a day so a single day without a crash can be a simple statistical fluctuation. I'll report back on Monday. As I said it happens mostly (only?) when I use ny dual monitor setup, and during the weekend I don't use it. Cross your fingers!
Well no hangs for now! I downloaded the 3.13.2 kernel, applied the patch again and compiled it. From now on I'll start using this instead of the 3.13.1. The difference I might experience compared to when running without the patch, but I'm not sure at all, if there is a difference it is quite small: during firefox rendering (or something else inside the app) freeze for a fraction of second (like half second or so). During such short and temporary freezes it was, rarely, happening the hang. As I said now there is no GPU hang, dmesg is always 100% clear for drm and i915 stuff, but those micro freezes might be a little more frequent with the patch. Again I want to stress this is very hard to quantify and so feel free to simply ignore this. It might just be part of the fact I'm keeping a lot more attention than usual on rendering times to spot an hang.
Assigning to Ville so he can submit the patch. To avoid bikeshedding this to death: I prefer if we add a new intel_ring_begin_cacheline_safe or so which encapsulates the logic. And obviously puts a WARN_ON if the requested length is bigger than 1 cachline ;-)
commit f66fab8e1cd6b3127ba4c5c0d11539fbe1de1e36 Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Tue Feb 11 19:52:06 2014 +0200 drm/i915: Prevent MI_DISPLAY_FLIP straddling two cachelines on IVB According to BSpec the entire MI_DISPLAY_FLIP packet must be contained in a single cacheline. Make sure that happens. v2: Use intel_ring_begin_cacheline_safe() v3: Use intel_ring_cacheline_align() (Chris) Cc: Bjoern C <lkml@call-home.ch> Cc: Alexandru DAMIAN <alexandru.damian@intel.com> Cc: Enrico Tagliavini <enrico.tagliavini@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=74053 Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Thank you very much, much appreciated the help. Can't wait for the next release! Best regards Enrico
*** Bug 73437 has been marked as a duplicate of this bug. ***
Created attachment 95815 [details] Crash dump with kernel 3.13.6 (including the final patch) Hi There. Unfortunately this doesn't look solved. Had 2 hangs this week. This is better than before, but ultimately the issue doesn't look solved :( Attached you can find my last crash dump. Kind regards Enrico
*** Bug 76229 has been marked as a duplicate of this bug. ***
so looks like this bug should be reopened?
Why? We are tracking the continuing saga in #77104.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.