Bug 91894 - [snb] GPU Hangs since Kernel-4.2 upgrade (global/aliasing ppgtt issue?)
Summary: [snb] GPU Hangs since Kernel-4.2 upgrade (global/aliasing ppgtt issue?)
Status: CLOSED INVALID
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-09-06 13:41 UTC by Armin K
Modified: 2017-07-03 11:01 UTC (History)
2 users (show)

See Also:
i915 platform: SNB
i915 features: GPU hang


Attachments
Kernel oops when Epiphany hung Weston while playing youtube video (4.32 KB, text/plain)
2015-09-06 13:43 UTC, Armin K
no flags Details

Description Armin K 2015-09-06 13:41:46 UTC
It hangs soon as Firefox starts playing any HTML5 video. ickle highlighted it might be an issue with vaapi, which I can confirm that Firefox is using. As a side note, the same thing happens on weston when playing video in epiphany, which also uses gstreamer and utilizes gstreamer-vaapi.

Relevant info from dmesg:

Sep 04 07:16:56 krejzi kernel: [drm] stuck on render ring
Sep 04 07:16:56 krejzi kernel: [drm] stuck on blitter ring
Sep 04 07:16:56 krejzi kernel: [drm] GPU HANG: ecode 6:0:0xf4e9fffe, in Xorg [400], reason: Ring hung, action: reset
Sep 04 07:16:56 krejzi kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep 04 07:16:56 krejzi kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep 04 07:16:56 krejzi kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep 04 07:16:56 krejzi kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Sep 04 07:16:56 krejzi kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Sep 04 07:16:56 krejzi kernel: [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning!
Sep 04 07:16:56 krejzi kernel: drm/i915: Resetting chip after gpu hang
Sep 04 07:17:20 krejzi kernel: [drm] stuck on render ring
Sep 04 07:17:20 krejzi kernel: [drm] GPU HANG: ecode 6:0:0x87e8fffd, in kwin_x11 [560], reason: Ring hung, action: reset
Sep 04 07:17:20 krejzi kernel: drm/i915: Resetting chip after gpu hang

/sys/class/drm/card0/error, from two different hangs:

http://www.linuxfromscratch.org/~krejzi/error.log
http://www.linuxfromscratch.org/~krejzi/error2.log

Linux-4.2, libdrm-2.4.64, mesa-11.0.0-rc2, xorg-server-1.17.99.901, xf86-video-intel-2.99.917 (git from today, with UXA acceleration backend, but I've tried SNA too, no diff), libva-1.6.0, libva-intel-driver-1.6.0, gstreamer-1.5.90, gstreamer-vaapi-0.6.0
Comment 1 Armin K 2015-09-06 13:43:48 UTC
Created attachment 118100 [details]
Kernel oops when Epiphany hung Weston while playing youtube video

This is what happened when I tried to play an html5 video on epiphany on weston.
Comment 2 Armin K 2015-09-06 13:54:25 UTC
I forgot to mention that my system is a laptop with Intel HD 3000 (Sandybridge, Gen6) graphics which also has a muxless AMD Radeon 6470M GPU. I suspect the secondary GPU might cause any issues, but I think it was worth mentioning.
Comment 3 Chris Wilson 2015-09-07 08:38:11 UTC
libva did have a bug where they forgot to mark render targets and one of the 4.2 changes is a read-read optimisation. A hack like

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index a953d4975b8c..e4786eeca38f 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1033,7 +1033,7 @@ i915_gem_execbuffer_move_to_active(struct list_head *vmas,
                u32 old_write = obj->base.write_domain;
 
                obj->dirty = 1; /* be paranoid  */
-               obj->base.write_domain = obj->base.pending_write_domain;
+               obj->base.write_domain = I915_GEM_DOMAIN_RENDER;
                if (obj->base.write_domain == 0)
                        obj->base.pending_read_domains |= obj->base.read_domains;
                obj->base.read_domains = obj->base.pending_read_domains;

would disable the optimisation and hide the libva bug. Does that fix your hang?
Comment 4 Armin K 2015-09-07 10:38:53 UTC
Upgrading to libva/libva-intel-driver 1.6.1.pre1 seems to have fixed the issue. Keep the bug open for some time so I can do some more testing.
Comment 5 Armin K 2015-09-09 09:58:08 UTC
It was as I feared. libva/libva-intel-driver updates didn't fix the problem, it was kinda luck that got it working at that time. Not only that, but the patch from Comment 3 (backported to 4.2, although not sure if done correctly) also didn't fix the issue.
Comment 6 Armin K 2015-09-14 12:24:52 UTC
Issue still present in linux-4.3-rc1
Comment 7 Armin K 2015-09-14 12:27:46 UTC
[  298.520654] [drm] stuck on render ring
[  298.521713] [drm] GPU HANG: ecode 6:0:0x87e8fffd, in MediaPl~back #4 [1039], reason: Ring hung, action: reset
[  298.521714] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  298.521715] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  298.521716] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  298.521717] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  298.521718] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  298.523448] drm/i915: Resetting chip after gpu hang



http://www.linuxfromscratch.org/~krejzi/error3.log
Comment 8 Chris Wilson 2015-09-14 15:52:57 UTC
Since my guess was wrong, the best option is to do a bisection between 4.1 and 4.2 (which take about 12 steps). Is that something you could do?
Comment 9 Armin K 2015-09-15 08:02:26 UTC
Now, this is confusing. git bisect says the following:



0875546c5318c85c13d07014af5350e9000bc9e9 is the first bad commit
commit 0875546c5318c85c13d07014af5350e9000bc9e9
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Apr 20 09:04:05 2015 -0700

    drm/i915: Fix up the vma aliasing ppgtt binding
    
    Currently we have the problem that the decision whether ptes need to
    be (re)written is splattered all over the codebase. Move all that into
    i915_vma_bind. This needs a few changes:
    - Just reuse the PIN_* flags for i915_vma_bind and do the conversion
      to vma->bound in there to avoid duplicating the conversion code all
      over.
    - We need to make binding for EXECBUF (i.e. pick aliasing ppgtt if
      around) explicit, add PIN_USER for that.
    - Two callers want to update ptes, give them a PIN_UPDATE for that.
    
    Of course we still want to avoid double-binding, but that should be
    taken care of:                                                                                                                                   
    - A ppgtt vma will only ever see PIN_USER, so no issue with                                                                                      
      double-binding.                                                                                                                                
    - A ggtt vma with aliasing ppgtt needs both types of binding, and we
      track that properly now.
    - A ggtt vma without aliasing ppgtt could be bound twice. In the
      lower-level ->bind_vma functions hence unconditionally set
      GLOBAL_BIND when writing the ggtt ptes.
    
    There's still a bit room for cleanup, but that's for follow-up
    patches.
    
    v2: Fixup fumbles.
    
    v3: s/PIN_EXECBUF/PIN_USER/ for clearer meaning, suggested by Chris.
    
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>



However, checking out that revision and building it, all is fine. Now, the two commits after that revision introduced the problem.

First one, fa42331b4cd961cecb3f6919116d2e6efeb2334b didn't introduce a real problem, but a hang happened when I closed the video tab where the video was playing, not while playing.

Second one, 4755265977159be0261972da2ba54917765b18ed introduced the real problem, ie hangs all over the place when a video was playing and gst-vaapi was utilized.
Comment 10 Chris Wilson 2015-09-15 08:26:32 UTC
Can you please try:

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index a953d4975b8c..bbf7d35ca906 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -585,7 +585,7 @@ i915_gem_execbuffer_reserve_vma(struct i915_vma *vma,
        uint64_t flags;
        int ret;
 
-       flags = PIN_USER;
+       flags = PIN_USER | PIN_GLOBAL;
        if (entry->flags & EXEC_OBJECT_NEEDS_GTT)
                flags |= PIN_GLOBAL;
 

on a recent kernel?
Comment 11 Armin K 2015-09-15 10:01:17 UTC
Kernel 4.3-rc1 patched with patch from Comment 10 still has the issue.
Comment 12 Chris Wilson 2015-09-15 10:14:49 UTC
Hmm. Can you verify that running with 0875546c5318c85c13d07014af5350e9000bc9e9^ (i.e. the commit before the bisect result) is stable (say over the course of a few days)?

The patch you just tested suggests that it is not a result of the lack of aliased GGTT entries, which is perhaps the most obvious effect of the bisected commit.

Spotted one fumble in the patch, but that was fixed in

commit 5e562f1dddfa3242cede5ec49888260a856a9da2
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu Apr 30 11:02:31 2015 +0300

    drm/i915: Clear vma->bound on unbinding

in v4.2-rc1.

Do you suspend/resume the system before the hang appears?
Comment 13 Armin K 2015-09-15 12:18:28 UTC
As I already said, even the bisected commit 0875546c5318c85c13d07014af5350e9000bc9e9 gives a stable system. It's two commits after that one that introduced the problem.

I don't suspend or hibernate. It's 100% reproducible on a fresh system. I use KDE Plasma, if that matters, but even VAAPI would hang weston with epiphany on running on top GTK+ Wayland.

The problem happens as soon as I utilize vaapi, no matter how, be it from firefox through gstreamer/gstreamer-vaapi or epiphany through webkit via gstreamer/gstreamer-vaapi.

I can work around the problem by removing gstreamer-vaapi, which confirms that vaapi is the problem.
Comment 14 Chris Wilson 2015-09-15 12:47:21 UTC
(In reply to Armin K from comment #13)
> As I already said, even the bisected commit
> 0875546c5318c85c13d07014af5350e9000bc9e9 gives a stable system. It's two
> commits after that one that introduced the problem.

I know. I just wanted to be absolutely certain that is the commit we pick apart. Given that you are only definitely sure that hangs start two commits after, that raises the element that perhaps the issue simply isn't easily reproduced earlier and that maybe the bisect is not definitive.
Comment 15 Armin K 2015-09-15 15:13:54 UTC
As I said, issue is either always there or not there at all, depending on which revision is picked. I'm not going to revert to an older kernel snapshot for a few days to verify what I've already verified last night and this morning while bisecting.
Comment 16 Jonas Jelten 2015-11-08 12:58:18 UTC
Probably related to #92814 ?
Comment 17 Armin K 2016-08-09 20:38:24 UTC
I don't have the hw to test this anymore and it doesn't seem that anyone else is hitting the issue.
Comment 18 Jari Tahvanainen 2017-07-03 11:01:11 UTC
Closing almost one year old resolved+invalid.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.