Bug 106680

Summary:	[CI][SHARDS] igt@gem_exec_await@wide-contexts - fail/dmesg-fail - Failed assertion: !"GPU hung"
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Tvrtko Ursulin <tvrtko.ursulin>
Status:	CLOSED WORKSFORME	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	BSW/CHT, BXT, CFL, CNL, GLK, KBL	i915 features:	GEM/Other

Description Martin Peres 2018-05-28 09:53:59 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_47/fi-bxt-dsi/igt@gem_exec_await@wide-contexts.html

(gem_exec_await:1739) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:1739) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.



https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4171/shard-apl6/igt@gem_exec_await@wide-contexts.html

(gem_exec_await:1448) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:1448) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.

[  132.716731] ============================================
[  132.716737] WARNING: possible recursive locking detected
[  132.716744] 4.17.0-rc4-CI-CI_DRM_4171+ #1 Tainted: G     U           
[  132.716751] --------------------------------------------
[  132.716758] gem_exec_await/1448 is trying to acquire lock:
[  132.716766] 0000000022398cdc (&(&timeline->lock)->rlock){-.-.}, at: i915_gem_reset_engine+0x26c/0x370 [i915]
[  132.716850] 
               but task is already holding lock:
[  132.716857] 000000006f6d9267 (&(&timeline->lock)->rlock){-.-.}, at: i915_gem_reset_engine+0x260/0x370 [i915]
[  132.716943] 
               other info that might help us debug this:
[  132.716950]  Possible unsafe locking scenario:

[  132.716957]        CPU0
[  132.716961]        ----
[  132.716965]   lock(&(&timeline->lock)->rlock);
[  132.716974]   lock(&(&timeline->lock)->rlock);
[  132.716981] 
                *** DEADLOCK ***

[  132.716989]  May be due to missing lock nesting notation

[  132.716998] 4 locks held by gem_exec_await/1448:
[  132.717004]  #0: 00000000aab018e1 (sb_writers#11){.+.+}, at: vfs_write+0x188/0x1a0
[  132.717022]  #1: 00000000a3763ca8 (&attr->mutex){+.+.}, at: simple_attr_write+0x36/0xd0
[  132.717038]  #2: 000000009f5491cd (&dev->struct_mutex){+.+.}, at: i915_drop_caches_set+0x3f/0x1a0 [i915]
[  132.717125]  #3: 000000006f6d9267 (&(&timeline->lock)->rlock){-.-.}, at: i915_gem_reset_engine+0x260/0x370 [i915]
[  132.717211] 
               stack backtrace:
[  132.717221] CPU: 2 PID: 1448 Comm: gem_exec_await Tainted: G     U            4.17.0-rc4-CI-CI_DRM_4171+ #1
[  132.717231] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0047.2018.0108.1419 01/08/2018
[  132.717240] Call Trace:
[  132.717251]  dump_stack+0x67/0x9b
[  132.717260]  __lock_acquire+0xc67/0x1b50
[  132.717338]  ? i915_gem_object_pin_map+0x2d/0x2a0 [i915]
[  132.717418]  ? i915_request_wait+0x130/0x8a0 [i915]
[  132.717431]  ? printk+0x4d/0x69
[  132.717438]  ? lock_acquire+0xa6/0x210
[  132.717444]  lock_acquire+0xa6/0x210
[  132.717519]  ? i915_gem_reset_engine+0x26c/0x370 [i915]
[  132.717532]  _raw_spin_lock+0x2a/0x40
[  132.717606]  ? i915_gem_reset_engine+0x26c/0x370 [i915]
[  132.717682]  i915_gem_reset_engine+0x26c/0x370 [i915]
[  132.717762]  ? i915_request_wait+0x130/0x8a0 [i915]
[  132.717842]  ? i915_request_wait+0x130/0x8a0 [i915]
[  132.717918]  i915_gem_reset+0x5b/0x100 [i915]
[  132.717986]  i915_reset+0x22d/0x290 [i915]
[  132.718070]  __i915_wait_request_check_and_reset.isra.8+0x44/0x50 [i915]
[  132.718151]  i915_request_wait+0x4ea/0x8a0 [i915]
[  132.718166]  ? wake_up_q+0x70/0x70
[  132.718174]  ? wake_up_q+0x70/0x70
[  132.718247]  wait_for_timeline+0x11c/0x2a0 [i915]
[  132.718324]  i915_gem_wait_for_idle+0x8d/0x180 [i915]
[  132.718398]  i915_drop_caches_set+0x18b/0x1a0 [i915]
[  132.718413]  simple_attr_write+0xb0/0xd0
[  132.718423]  full_proxy_write+0x51/0x80
[  132.718431]  __vfs_write+0x31/0x160
[  132.718442]  ? rcu_read_lock_sched_held+0x6f/0x80
[  132.718450]  ? rcu_sync_lockdep_assert+0x29/0x50
[  132.718457]  ? __sb_start_write+0x152/0x1f0
[  132.718466]  ? __sb_start_write+0x168/0x1f0
[  132.718473]  vfs_write+0xbd/0x1a0
[  132.718480]  ksys_write+0x50/0xc0
[  132.718488]  do_syscall_64+0x55/0x190
[  132.718496]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  132.718504] RIP: 0033:0x7fe11cec0154
[  132.718511] RSP: 002b:00007ffdf018b488 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.718521] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe11cec0154
[  132.718531] RDX: 0000000000000004 RSI: 00005601671326a0 RDI: 0000000000000005
[  132.718539] RBP: 00005601671326a0 R08: 00007fe11da22980 R09: 0000000000000000
[  132.718546] R10: 0000000000000000 R11: 0000000000000246 R12: 00005601671173f0
[  132.718555] R13: 0000000000000004 R14: 00007fe11d1982a0 R15: 00007fe11d197760

Comment 1 Chris Wilson 2018-05-29 11:40:07 UTC

The lockdep splat was fixed by

commit 0c591a40af1b369cc11dce4d558dd71bebbdc090
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat May 12 09:49:57 2018 +0100

    drm/i915: Mark up nested spinlocks
    
    When we process the outstanding requests upon banning a context, we need
    to acquire both the engine and the client's timeline, nesting the locks.
    This requires explicit markup as the two timelines are now of the same
    class, since commit a89d1f921c15 ("drm/i915: Split i915_gem_timeline into
    individual timelines").
    
    Testcase: igt/gem_eio/banned
    Fixes: a89d1f921c15 ("drm/i915: Split i915_gem_timeline into individual timelines")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180512084957.9829-1-chris@chris-wilson.co.uk

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 89bf5d67cb74..0a2070112b66 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3119,7 +3119,7 @@ static void engine_skip_context(struct i915_request *request)
        GEM_BUG_ON(timeline == &engine->timeline);
 
        spin_lock_irqsave(&engine->timeline.lock, flags);
-       spin_lock(&timeline->lock);
+       spin_lock_nested(&timeline->lock, SINGLE_DEPTH_NESTING);
 
        list_for_each_entry_continue(request, &engine->timeline.requests, link)
                if (request->ctx == hung_ctx)

Comment 2 Chris Wilson 2018-08-31 06:41:24 UTC

I've run gem_exec_await/wide-contexts a few thousand times trying to replicate the stall, nothing. What is odd in the logs is that we are clearly waiting inside gem_execbuf, but we are not waiting on any specific request. So some other mutex; my current guess is that we hit reclaim and that hangs.

Comment 3 Chris Wilson 2018-09-05 09:48:35 UTC

*** Bug 107833 has been marked as a duplicate of this bug. ***

Comment 4 Martin Peres 2018-09-07 17:21:35 UTC

Also seen on SKL, KBL and GLK:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_103/fi-skl-6700k2/igt@gem_exec_await@wide-contexts.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4766/shard-glk6/igt@gem_exec_await@wide-contexts.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4766/shard-apl7/igt@gem_exec_await@wide-contexts.html

(gem_exec_await:2613) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:2613) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.

Comment 5 Martin Peres 2018-09-10 09:39:50 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_107/fi-bsw-n3050/igt@gem_exec_await@wide-contexts.html
	
(gem_exec_await:1535) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:1535) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed

Comment 6 Chris Wilson 2018-09-12 11:16:28 UTC

Finally it dawned on how we might wait forever in the middle of a run. If i915_request_alloc fails we do a wait_for_idle... If that occurs in the middle of our setup, they only way out is via a GPU hang.

Comment 7 Chris Wilson 2018-09-14 11:04:43 UTC

commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 14 09:00:15 2018 +0100

    drm/i915: Limit the backpressure for i915_request allocation
    
    If we try and fail to allocate a i915_request, we apply some
    backpressure on the clients to throttle the memory allocations coming
    from i915.ko. Currently, we wait until completely idle, but this is far
    too heavy and leads to some situations where the only escape is to
    declare a client hung and reset the GPU. The intent is to only ratelimit
    the allocation requests and to allow ourselves to recycle requests and
    memory from any long queues built up by a client hog.
    
    Although the system memory is inherently a global resources, we don't
    want to overly penalize an unlucky client to pay the price of reaping a
    hog. To reduce the influence of one client on another, we can instead of
    waiting for the entire GPU to idle, impose a barrier on the local client.
    (One end goal for request allocation is for scalability to many
    concurrent allocators; simultaneous execbufs.)
    
    To prevent ourselves from getting caught out by long running requests
    (requests that may never finish without userspace intervention, whom we
    are blocking) we need to impose a finite timeout, ideally shorter than
    hangcheck. A long time ago Paul McKenney suggested that RCU users should
    ratelimit themselves using judicious use of cond_synchronize_rcu(). This
    gives us the opportunity to reduce our indefinite wait for the GPU to
    idle to a wait for the RCU grace period of the previous allocation along
    this timeline to expire, satisfying both the local and finite properties
    we desire for our ratelimiting.
    
    There are still a few global steps (reclaim not least amongst those!)
    when we exhaust the immediate slab pool, at least now the wait is itself
    decoupled from struct_mutex for our glorious highly parallel future!
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-chris@chris-wilson.co.uk

Pretty confident!

Comment 8 Martin Peres 2018-10-11 16:37:38 UTC

(In reply to Chris Wilson from comment #7)
> commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Sep 14 09:00:15 2018 +0100
> 
>     drm/i915: Limit the backpressure for i915_request allocation
>     
>     If we try and fail to allocate a i915_request, we apply some
>     backpressure on the clients to throttle the memory allocations coming
>     from i915.ko. Currently, we wait until completely idle, but this is far
>     too heavy and leads to some situations where the only escape is to
>     declare a client hung and reset the GPU. The intent is to only ratelimit
>     the allocation requests and to allow ourselves to recycle requests and
>     memory from any long queues built up by a client hog.
>     
>     Although the system memory is inherently a global resources, we don't
>     want to overly penalize an unlucky client to pay the price of reaping a
>     hog. To reduce the influence of one client on another, we can instead of
>     waiting for the entire GPU to idle, impose a barrier on the local client.
>     (One end goal for request allocation is for scalability to many
>     concurrent allocators; simultaneous execbufs.)
>     
>     To prevent ourselves from getting caught out by long running requests
>     (requests that may never finish without userspace intervention, whom we
>     are blocking) we need to impose a finite timeout, ideally shorter than
>     hangcheck. A long time ago Paul McKenney suggested that RCU users should
>     ratelimit themselves using judicious use of cond_synchronize_rcu(). This
>     gives us the opportunity to reduce our indefinite wait for the GPU to
>     idle to a wait for the RCU grace period of the previous allocation along
>     this timeline to expire, satisfying both the local and finite properties
>     we desire for our ratelimiting.
>     
>     There are still a few global steps (reclaim not least amongst those!)
>     when we exhaust the immediate slab pool, at least now the wait is itself
>     decoupled from struct_mutex for our glorious highly parallel future!
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>     Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-
> chris@chris-wilson.co.uk
> 
> Pretty confident!

Unfortunately, this is still not fixed, as it is happening at the same rate as before:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4948/shard-kbl2/igt@gem_exec_await@wide-contexts.html

Starting subtest: wide-contexts
(gem_exec_await:1079) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:1079) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.


And we also seen issues on WHL and CNL:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_124/fi-cnl-u/igt@gem_exec_await@wide-contexts.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_121/fi-whl-u/igt@gem_exec_await@wide-contexts.html

Comment 9 Martin Peres 2018-10-11 16:38:14 UTC

This one is interesting though:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4969/shard-glk7/igt@gem_exec_await@wide-contexts.html

Starting subtest: wide-contexts
(gem_exec_await:2214) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:2214) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.

<3> [1053.522972] [drm:i915_reset [i915]] *ERROR* Failed to initialise HW following reset (-5)

Comment 10 Tvrtko Ursulin 2018-11-05 16:20:22 UTC

I can't reproduce this locally with running the test in question in a loop for a couple hours. And the history on shards also looks clean:

https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_await@wide-contexts.html

What was the frequency of the failures in the past? Would we expect to see some failures in the ten day window?

Comment 11 Francesco Balestrieri 2018-11-05 18:34:39 UTC

As far as I can tell this used to happen once a day. On what HW did you test?

Comment 12 Lakshmi 2018-11-06 08:32:50 UTC

(In reply to Tvrtko Ursulin from comment #10)
> I can't reproduce this locally with running the test in question in a loop
> for a couple hours. And the history on shards also looks clean:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_await@wide-contexts.
> html
> 
> What was the frequency of the failures in the past? Would we expect to see
> some failures in the ten day window?

Last seen this failure in IGT_4694_full (1 week, 4 days / 139 runs ago)
Logs from the last failure
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4694/shard-kbl1/igt@gem_exec_await@wide-contexts.html

Until then this issue used to appear very frequently (~everyday).

Comment 13 Tvrtko Ursulin 2018-11-06 10:47:00 UTC

From the data it sounds like we can close this. Just a pity we cannot tag it with a commit which fixed it.

P.S. I tested on SKL.

Comment 14 Martin Peres 2018-11-06 11:06:17 UTC

(In reply to Tvrtko Ursulin from comment #13)
> From the data it sounds like we can close this. Just a pity we cannot tag it
> with a commit which fixed it.
> 
> P.S. I tested on SKL.

I would not be so quick... the issue is quite intermittent and has been so for 5 months...

We'll need to wait a couple of months to be sure, at the current rate.

Comment 15 Tvrtko Ursulin 2018-11-07 10:07:48 UTC

I left the test running in a tight loop for two more hours yesterday and no luck. Was the frequency across platforms similar or there were more susceptible platforms?

Comment 16 Chris Wilson 2018-11-07 10:55:35 UTC

You need to hit an allocation failure while allocating the request...

The problem is resolved down to an unbounded rcu wait on the current grace period.

Comment 17 Francesco Balestrieri 2018-11-27 09:41:36 UTC

Moving to medium while we wait for it to happen again.

Comment 18 Francesco Balestrieri 2019-01-09 09:26:27 UTC

Not seen in two months and a half, closing.

Comment 19 Martin Peres 2019-03-08 15:29:13 UTC

(In reply to Francesco Balestrieri from comment #18)
> Not seen in two months and a half, closing.

Now last seen 4 months ago, which definitely looks pretty promising for it being fixed! I'll archive the bug.

Comment 20 CI Bug Log 2019-03-08 15:29:23 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.