Bug 74053 - [ivb] hang on rcs pageflip
Summary: [ivb] hang on rcs pageflip
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high normal
Assignee: Ville Syrjala
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 73437 74569 76229 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-01-25 17:16 UTC by Bjoern C
Modified: 2017-07-24 22:56 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
GPU hang log (322.05 KB, application/x-gzip)
2014-01-25 17:16 UTC, Bjoern C
no flags Details
dmesg log (16.00 KB, application/x-gzip)
2014-01-25 19:32 UTC, Bjoern C
no flags Details
One potential idea (2.13 KB, patch)
2014-01-31 12:49 UTC, Ville Syrjala
no flags Details | Splinter Review
Frob FORCEWAKE around RCS flips (2.32 KB, patch)
2014-02-06 09:39 UTC, Chris Wilson
no flags Details | Splinter Review
yet another crash dump (2.00 MB, text/plain)
2014-02-06 09:46 UTC, Enrico Tagliavini
no flags Details
crash dump when using patch from comment #13 (2.04 MB, text/plain)
2014-02-06 15:00 UTC, Enrico Tagliavini
no flags Details
Crash dump with kernel 3.13.6 (including the final patch) (2.17 MB, text/plain)
2014-03-14 16:02 UTC, Enrico Tagliavini
no flags Details

Description Bjoern C 2014-01-25 17:16:54 UTC
Created attachment 92776 [details]
GPU hang log

Hi all,

Playing movies on my HTPC causes hanging in random intervals. Checking dmesg I see the request to create a post here along with the GPU hang file.

Running i3-3240 on Ubuntu 13.10 x64 , play back in XBMC alpha 10 with SOFTWARE playback.

Regards,
Bjoern
Comment 1 Chris Wilson 2014-01-25 17:24:08 UTC
It appears to have hung trying to execute a pageflip.

0x00007eb0:      0x0a000001: MI_DISPLAY_BUFFER_INFO
0x00007eb4:      0x00001e01:    dword 1
0x00007eb8:      0x06446000:    dword 2
0x00007ebc:      0x00000000: MI_NOOP

which looks consistent with

  fence[14] = 6c3d03b06446001
    valid, x-tiled, pitch: 7680, start: 0x06446000, size: 8355840

and

Pinned [33]:
...
  06446000  8355840 41 00 0 0 P X dirty uncached (name: 49) (fence: 14)

Can you please attach the full dmesg leading to the hang? Does this happen frequently or was this a one-off event?
Comment 2 Bjoern C 2014-01-25 19:32:08 UTC
Created attachment 92780 [details]
dmesg log
Comment 3 Bjoern C 2014-01-25 19:33:26 UTC
Find attached the dmesg log file.

I can reproduce this quite consistently. I noticed it yester when I watched a movie, a "freeze" came every few minutes (just the video, HDMI audio was fine). It still happens today, even after reboot.
Comment 4 Chris Wilson 2014-01-26 10:00:33 UTC
Theories for why the GPU may be upset:

1. Multiple render response messages
2. Page flip with flip outstanding
3. Forcewake is required
4. The hardware hates us
Comment 5 Bjoern C 2014-01-26 19:56:21 UTC
I need to do a check here on my end. After going to 3.13.0 I see some errors in dmesg (e.g. "factorial" or "conftest") poping up, even running a 3.11.10 kernel doesn't change this. Not sure, maybe something is wrong with Linux now or my hardware. It worked perfectly till I went to 3.13.0...

I'll run a memtest etc. and will update here asap.
Comment 6 Chris Wilson 2014-01-31 10:29:39 UTC
One wicked theory I have is that the intoduction of the working SRM is breaking the flips...

Can you please test:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 5b7ce3f09681..de70260e50f3 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -8593,7 +8593,7 @@ static int intel_gen7_queue_flip(struct drm_device *dev,
 
        len = 4;
        if (ring->id == RCS)
-               len += 6;
+               len += 4;
 
        ret = intel_ring_begin(ring, len);
        if (ret)
@@ -8614,10 +8614,7 @@ static int intel_gen7_queue_flip(struct drm_device *dev,
                intel_ring_emit(ring, ~(DERRMR_PIPEA_PRI_FLIP_DONE |
                                        DERRMR_PIPEB_PRI_FLIP_DONE |
                                        DERRMR_PIPEC_PRI_FLIP_DONE));
-               intel_ring_emit(ring, MI_STORE_REGISTER_MEM(1) |
-                               MI_SRM_LRM_GLOBAL_GTT);
-               intel_ring_emit(ring, DERRMR);
-               intel_ring_emit(ring, ring->scratch.gtt_offset + 256);
+               intel_ring_emit(ring, MI_NOOP);
        }
Comment 7 Chris Wilson 2014-01-31 12:32:56 UTC
Can you please 'cat /sys/kernel/debug/dri/0/i915_fbc_status'
Comment 8 Ville Syrjala 2014-01-31 12:49:47 UTC
Created attachment 93135 [details] [review]
One potential idea

I tried to look through our gen7 page flip code. Looks like everything's according to spec, except we allow the MI_DISPLAY_FLIP to straddle two cachelines. This patch fixes that. Worth a shot I suppose even if the hanging flip in the error state didn't hit this. There were flips in the ring that would have hit this though.
Comment 9 Bjoern C 2014-02-04 00:14:35 UTC
Chris, Ville: Neither patched helped. Around once a minute 20-30 frames are dropped in one go. I'm on git 3.13.0 and just apply your changes to that one - compared to my code I'm around 300 lines off from where you guys are doing the changes... I'll do a new git pull and then try again.

"FBC unsupported on this chipset" is what I get for i915_fbc_status.
Comment 10 Bjoern C 2014-02-04 18:53:01 UTC
I did more testing with a clean install of Ubuntu x64 13.10 and getting the usual updates and an XBMC nightly. Conclusion: XBMC runs fine without any issues as long as I don't use "bitstream" audio. Once bitstreaming is enabled and using ALSA directly the skipped frames occur.

So what I will do now is I'll take Ville's patch suggestion first and see where that brings me. If I don't get this error again then I'll test again without it. Then I'll take up Chris suggestion if the issue is not resolved. Once I know something in this regard I'll update you.
Comment 11 Chris Wilson 2014-02-05 07:31:23 UTC
Starting to see multiple sightings in Ubuntu. We'll have to queue up a revert of RCS flips unless we can find the answer.
Comment 12 Chris Wilson 2014-02-05 14:06:19 UTC
*** Bug 74569 has been marked as a duplicate of this bug. ***
Comment 13 Chris Wilson 2014-02-06 09:39:46 UTC
Created attachment 93513 [details] [review]
Frob FORCEWAKE around RCS flips

When in doubt, tell the GPU not to go to sleep.
Comment 14 Enrico Tagliavini 2014-02-06 09:46:00 UTC
Created attachment 93514 [details]
yet another crash dump

I'm affected as well. Running gentoo, happened with kernel 3.11, 3.12, 3.13 for sure, video-intel 2.21.15, 2.99.907, 2.99.909 (all with SNA enabled), mesa 9.2.5

dmesg:

[  763.125484] [drm] stuck on render ring
[  763.125485] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  763.125485] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  763.125486] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  763.125486] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  763.125486] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.

~ # cat /sys/kernel/debug/dri/0/i915_fbc_status
FBC disabled: multiple pipes are enabled
Comment 15 Chris Wilson 2014-02-06 10:04:49 UTC
One question for everybody: Do you only see this in multi-monitor setups?
Comment 16 Alexandru DAMIAN 2014-02-06 10:10:27 UTC
It happens for me on a multi-monitor setup, yes.
Comment 17 Chris Wilson 2014-02-06 11:38:58 UTC
The multiple simultaneous render response messages theory is flawed; the original crash dump hung with only a single active pipe.
Comment 18 Enrico Tagliavini 2014-02-06 15:00:58 UTC
Created attachment 93534 [details]
crash dump when using patch from comment #13

Tried patch from comment #13 . Apparently it didn't solved the issue, but there is some notable change: usually I was able to spot the hang because everything freeze, only the mouse moves. Eventually and often, for some very weird reason, firefox fonts are corrupted, some latin letter is replaced with non latin one or anyway some other symbol, the only fix is to restart firefox. This time just the latter happened. Luckly I checked dmesg and I saw the "stuck at render ring" notification.
Comment 19 Chris Wilson 2014-02-06 15:05:54 UTC
(In reply to comment #18)
> Created attachment 93534 [details]
> crash dump when using patch from comment #13
> 
> Tried patch from comment #13 . Apparently it didn't solved the issue,

That's actually reassuring. The crash dump shows that the patch is working and the GPU is awake, so forcewake is definitely not an issue here.

> but
> there is some notable change: usually I was able to spot the hang because
> everything freeze, only the mouse moves. Eventually and often, for some very
> weird reason, firefox fonts are corrupted, some latin letter is replaced
> with non latin one or anyway some other symbol, the only fix is to restart
> firefox. This time just the latter happened. Luckly I checked dmesg and I
> saw the "stuck at render ring" notification.

That just sounds like the usual dangers with a hung gpu and discarding work before resetting. (We have plans to fix it.)
Comment 20 Chris Wilson 2014-02-06 15:07:12 UTC
Enrico, if you have the opportunity can you try Ville's patch from comment 8?
Comment 21 Enrico Tagliavini 2014-02-07 17:29:38 UTC
(In reply to comment #20)
> Enrico, if you have the opportunity can you try Ville's patch from comment 8?

Hi Chris, I compiled and run the 3.13.1 kernel with named patch applied. For now everything is ok, but I still don't yell to victory. This bug is fairly hard to reproduce in my case. Usually it never happen more then once or twice a day so a single day without a crash can be a simple statistical fluctuation. I'll report back on Monday. As I said it happens mostly (only?) when I use ny dual monitor setup, and during the weekend I don't use it.

Cross your fingers!
Comment 22 Enrico Tagliavini 2014-02-10 17:42:53 UTC
Well no hangs for now! I downloaded the 3.13.2 kernel, applied the patch again and compiled it. From now on I'll start using this instead of the 3.13.1.

The difference I might experience compared to when running without the patch, but I'm not sure at all, if there is a difference it is quite small: during firefox rendering (or something else inside the app) freeze for a fraction of second (like half second or so). During such short and temporary freezes it was, rarely, happening the hang. As I said now there is no GPU hang, dmesg is always 100% clear for drm and i915  stuff, but those micro freezes might be a little more frequent with the patch. Again I want to stress this is very hard to quantify and so feel free to simply ignore this. It might just be part of the fact I'm keeping a lot more attention than usual on rendering times to spot an hang.
Comment 23 Daniel Vetter 2014-02-11 10:23:57 UTC
Assigning to Ville so he can submit the patch. To avoid bikeshedding this to death: I prefer if we add a new intel_ring_begin_cacheline_safe or so which encapsulates the logic. And obviously puts a WARN_ON if the requested length is bigger than 1 cachline ;-)
Comment 24 Chris Wilson 2014-02-12 18:21:46 UTC
commit f66fab8e1cd6b3127ba4c5c0d11539fbe1de1e36
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Tue Feb 11 19:52:06 2014 +0200

    drm/i915: Prevent MI_DISPLAY_FLIP straddling two cachelines on IVB
    
    According to BSpec the entire MI_DISPLAY_FLIP packet must be contained
    in a single cacheline. Make sure that happens.
    
    v2: Use intel_ring_begin_cacheline_safe()
    v3: Use intel_ring_cacheline_align() (Chris)
    
    Cc: Bjoern C <lkml@call-home.ch>
    Cc: Alexandru DAMIAN <alexandru.damian@intel.com>
    Cc: Enrico Tagliavini <enrico.tagliavini@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=74053
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 25 Enrico Tagliavini 2014-02-13 09:15:13 UTC
Thank you very much, much appreciated the help. Can't wait for the next release!

Best regards
Enrico
Comment 26 Chris Wilson 2014-02-28 12:06:10 UTC
*** Bug 73437 has been marked as a duplicate of this bug. ***
Comment 27 Enrico Tagliavini 2014-03-14 16:02:58 UTC
Created attachment 95815 [details]
Crash dump with kernel 3.13.6 (including the final patch)

Hi There. Unfortunately this doesn't look solved. Had 2 hangs this week. This is better than before, but ultimately the issue doesn't look solved :(

Attached you can find my last crash dump.

Kind regards
Enrico
Comment 28 Chris Wilson 2014-03-16 13:48:09 UTC
*** Bug 76229 has been marked as a duplicate of this bug. ***
Comment 29 Gordon Jin 2014-07-07 08:15:42 UTC
so looks like this bug should be reopened?
Comment 30 Chris Wilson 2014-07-07 08:22:53 UTC
Why? We are tracking the continuing saga in #77104.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.