Bug 80709

Summary: [snb] semaphores deadlock -- testing improved deadlock breaker
Product: DRI Reporter: Stefan Huber <shuber>
Component: DRM/IntelAssignee: Rodrigo Vivi <rodrigo.vivi>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
/sys/class/drm/card0/error
none
/sys/class/drm/card0/error from Jul 18 none

Description Stefan Huber 2014-06-30 11:15:24 UTC
Created attachment 102009 [details]
/sys/class/drm/card0/error

I am following the following instructions:

Jun 30 13:03:30 euklid kernel: [145541.504245] [drm] stuck on render ring
Jun 30 13:03:30 euklid kernel: [145541.504245] [drm] stuck on blitter ring
Jun 30 13:03:30 euklid kernel: [145541.504781] [drm] GPU HANG: ecode 0:0xf4e9fffe, in X [19246], reason: Ring hung, action: reset
Jun 30 13:03:30 euklid kernel: [145541.504781] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jun 30 13:03:30 euklid kernel: [145541.504782] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jun 30 13:03:30 euklid kernel: [145541.504782] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jun 30 13:03:30 euklid kernel: [145541.504782] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jun 30 13:03:30 euklid kernel: [145541.504783] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Jun 30 13:03:30 euklid kernel: [145541.504850] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
Jun 30 13:03:31 euklid kernel: [145543.504837] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off


I am running linux kernel 3.15.1 (gentoo-sources-3.15.1) with the following patch applied:

From 4be173813e57c7298103a83155c2391b5b167b4c Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris@chris-wilson.co.uk>
Date: Fri, 06 Jun 2014 09:22:29 +0000
Subject: drm/i915: Reorder semaphore deadlock check
Comment 1 Chris Wilson 2014-06-30 16:33:57 UTC
Try:

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 0edc97f..9e5a295 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2852,7 +2852,7 @@ static int semaphore_passed(struct intel_engine_cs *ring)
 {
        struct drm_i915_private *dev_priv = ring->dev->dev_private;
        struct intel_engine_cs *signaller;
-       u32 seqno, ctl;
+       u32 seqno;
 
        ring->hangcheck.deadlock++;
 
@@ -2860,19 +2860,20 @@ static int semaphore_passed(struct intel_engine_cs *ring)
        if (signaller == NULL)
                return -1;
 
+       printk("%s waiting on %s [recursion depth %d], seqno 0x%x [current 0x%x]\n",
+              ring->name, signaller->name, signaller->hangcheck.deadlock,
+              seqno, signaller->get_seqno(signaller, false));
+
        /* Prevent pathological recursion due to driver bugs */
        if (signaller->hangcheck.deadlock >= I915_NUM_RINGS)
                return -1;
 
-       /* cursory check for an unkickable deadlock */
-       ctl = I915_READ_CTL(signaller);
-       if (ctl & RING_WAIT_SEMAPHORE && semaphore_passed(signaller) < 0)
-               return -1;
-
        if (i915_seqno_passed(signaller->get_seqno(signaller, false), seqno))
                return 1;
 
-       if (signaller->hangcheck.deadlock)
+       /* cursory check for an unkickable deadlock */
+       if (I915_READ_CTL(signaller) & RING_WAIT_SEMAPHORE &&
+           semaphore_passed(signaller) < 0)
                return -1;
 
        return 0;
Comment 2 Stefan Huber 2014-06-30 18:25:32 UTC
(In reply to comment #1)
I have upgraded to 3.15.2 and applied the second patch too. I will ping you when/if the error occurs next. (According to my logs I had GPU crashes on Feb  5, Feb 20, Apr  4, Apr  9, Apr 21, May 13, May 16, Jun  3, Jun 23, Jun 30.)
Comment 3 Stefan Huber 2014-07-19 11:16:41 UTC
So far so good, no crashes with the proposed patch until now.
Comment 4 Chris Wilson 2014-07-19 11:22:00 UTC
It should emit a warning when it fires, could you check your logs to see if you have had such an event?
Comment 5 Stefan Huber 2014-07-19 11:28:25 UTC
(In reply to comment #4)
> It should emit a warning when it fires, could you check your logs to see if
> you have had such an event?

# zcat messages-* | cat - messages | grep "waiting on" -A 8
Jul 18 16:33:09 euklid kernel: [ 9291.280145] render ring waiting on blitter ring [recursion depth 0], seqno 0x801bd [current 0x801bd]
Jul 18 16:33:09 euklid kernel: [ 9291.280639] [drm] GPU HANG: ecode -1:0x00000000, reason: Kicking stuck semaphore on render ring, action: continue
Jul 18 16:33:09 euklid kernel: [ 9291.280640] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jul 18 16:33:09 euklid kernel: [ 9291.280641] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jul 18 16:33:09 euklid kernel: [ 9291.280642] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jul 18 16:33:09 euklid kernel: [ 9291.280642] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jul 18 16:33:09 euklid kernel: [ 9291.280643] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Jul 18 16:33:09 euklid kernel: [ 9291.280669] blitter ring waiting on render ring [recursion depth 0], seqno 0x801c1 [current 0x801bf]

Interesting, I cannot remember that that there was a crash yesterday.
Comment 6 Stefan Huber 2014-07-19 11:29:41 UTC
Created attachment 103100 [details]
/sys/class/drm/card0/error from Jul 18
Comment 7 Chris Wilson 2014-07-19 11:38:27 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > It should emit a warning when it fires, could you check your logs to see if
> > you have had such an event?
> 
> # zcat messages-* | cat - messages | grep "waiting on" -A 8
> Jul 18 16:33:09 euklid kernel: [ 9291.280145] render ring waiting on blitter
> ring [recursion depth 0], seqno 0x801bd [current 0x801bd]
> Jul 18 16:33:09 euklid kernel: [ 9291.280669] blitter ring waiting on render
> ring [recursion depth 0], seqno 0x801c1 [current 0x801bf]
> 
> Interesting, I cannot remember that that there was a crash yesterday.

You weren't meant to! Thanks, that shows that the patch did the trick.
Comment 8 Chris Wilson 2014-07-22 09:35:53 UTC
commit a0d036b074b4a5a933e37fcb9bdd6b3cc80a0387
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jul 19 12:40:42 2014 +0100

    drm/i915: Reorder the semaphore deadlock check, again
    
    commit 4be173813e57c7298103a83155c2391b5b167b4c
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Fri Jun 6 10:22:29 2014 +0100
    
        drm/i915: Reorder semaphore deadlock check
    
    did the majority of the work, but it missed one crucial detail:
    
    The check for the unkickable deadlock on this ring must come after the
    check whether the ring that we are waiting on has already passed its
    target seqno.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=80709
    Tested-by: Stefan Huber <shuber@sthu.org>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Cc: Jani Nikula <jani.nikula@intel.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.