https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_47/fi-bxt-dsi/igt@gem_exec_await@wide-contexts.html (gem_exec_await:1739) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:1739) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4171/shard-apl6/igt@gem_exec_await@wide-contexts.html (gem_exec_await:1448) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:1448) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed. [ 132.716731] ============================================ [ 132.716737] WARNING: possible recursive locking detected [ 132.716744] 4.17.0-rc4-CI-CI_DRM_4171+ #1 Tainted: G U [ 132.716751] -------------------------------------------- [ 132.716758] gem_exec_await/1448 is trying to acquire lock: [ 132.716766] 0000000022398cdc (&(&timeline->lock)->rlock){-.-.}, at: i915_gem_reset_engine+0x26c/0x370 [i915] [ 132.716850] but task is already holding lock: [ 132.716857] 000000006f6d9267 (&(&timeline->lock)->rlock){-.-.}, at: i915_gem_reset_engine+0x260/0x370 [i915] [ 132.716943] other info that might help us debug this: [ 132.716950] Possible unsafe locking scenario: [ 132.716957] CPU0 [ 132.716961] ---- [ 132.716965] lock(&(&timeline->lock)->rlock); [ 132.716974] lock(&(&timeline->lock)->rlock); [ 132.716981] *** DEADLOCK *** [ 132.716989] May be due to missing lock nesting notation [ 132.716998] 4 locks held by gem_exec_await/1448: [ 132.717004] #0: 00000000aab018e1 (sb_writers#11){.+.+}, at: vfs_write+0x188/0x1a0 [ 132.717022] #1: 00000000a3763ca8 (&attr->mutex){+.+.}, at: simple_attr_write+0x36/0xd0 [ 132.717038] #2: 000000009f5491cd (&dev->struct_mutex){+.+.}, at: i915_drop_caches_set+0x3f/0x1a0 [i915] [ 132.717125] #3: 000000006f6d9267 (&(&timeline->lock)->rlock){-.-.}, at: i915_gem_reset_engine+0x260/0x370 [i915] [ 132.717211] stack backtrace: [ 132.717221] CPU: 2 PID: 1448 Comm: gem_exec_await Tainted: G U 4.17.0-rc4-CI-CI_DRM_4171+ #1 [ 132.717231] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0047.2018.0108.1419 01/08/2018 [ 132.717240] Call Trace: [ 132.717251] dump_stack+0x67/0x9b [ 132.717260] __lock_acquire+0xc67/0x1b50 [ 132.717338] ? i915_gem_object_pin_map+0x2d/0x2a0 [i915] [ 132.717418] ? i915_request_wait+0x130/0x8a0 [i915] [ 132.717431] ? printk+0x4d/0x69 [ 132.717438] ? lock_acquire+0xa6/0x210 [ 132.717444] lock_acquire+0xa6/0x210 [ 132.717519] ? i915_gem_reset_engine+0x26c/0x370 [i915] [ 132.717532] _raw_spin_lock+0x2a/0x40 [ 132.717606] ? i915_gem_reset_engine+0x26c/0x370 [i915] [ 132.717682] i915_gem_reset_engine+0x26c/0x370 [i915] [ 132.717762] ? i915_request_wait+0x130/0x8a0 [i915] [ 132.717842] ? i915_request_wait+0x130/0x8a0 [i915] [ 132.717918] i915_gem_reset+0x5b/0x100 [i915] [ 132.717986] i915_reset+0x22d/0x290 [i915] [ 132.718070] __i915_wait_request_check_and_reset.isra.8+0x44/0x50 [i915] [ 132.718151] i915_request_wait+0x4ea/0x8a0 [i915] [ 132.718166] ? wake_up_q+0x70/0x70 [ 132.718174] ? wake_up_q+0x70/0x70 [ 132.718247] wait_for_timeline+0x11c/0x2a0 [i915] [ 132.718324] i915_gem_wait_for_idle+0x8d/0x180 [i915] [ 132.718398] i915_drop_caches_set+0x18b/0x1a0 [i915] [ 132.718413] simple_attr_write+0xb0/0xd0 [ 132.718423] full_proxy_write+0x51/0x80 [ 132.718431] __vfs_write+0x31/0x160 [ 132.718442] ? rcu_read_lock_sched_held+0x6f/0x80 [ 132.718450] ? rcu_sync_lockdep_assert+0x29/0x50 [ 132.718457] ? __sb_start_write+0x152/0x1f0 [ 132.718466] ? __sb_start_write+0x168/0x1f0 [ 132.718473] vfs_write+0xbd/0x1a0 [ 132.718480] ksys_write+0x50/0xc0 [ 132.718488] do_syscall_64+0x55/0x190 [ 132.718496] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 132.718504] RIP: 0033:0x7fe11cec0154 [ 132.718511] RSP: 002b:00007ffdf018b488 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 132.718521] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe11cec0154 [ 132.718531] RDX: 0000000000000004 RSI: 00005601671326a0 RDI: 0000000000000005 [ 132.718539] RBP: 00005601671326a0 R08: 00007fe11da22980 R09: 0000000000000000 [ 132.718546] R10: 0000000000000000 R11: 0000000000000246 R12: 00005601671173f0 [ 132.718555] R13: 0000000000000004 R14: 00007fe11d1982a0 R15: 00007fe11d197760
The lockdep splat was fixed by commit 0c591a40af1b369cc11dce4d558dd71bebbdc090 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat May 12 09:49:57 2018 +0100 drm/i915: Mark up nested spinlocks When we process the outstanding requests upon banning a context, we need to acquire both the engine and the client's timeline, nesting the locks. This requires explicit markup as the two timelines are now of the same class, since commit a89d1f921c15 ("drm/i915: Split i915_gem_timeline into individual timelines"). Testcase: igt/gem_eio/banned Fixes: a89d1f921c15 ("drm/i915: Split i915_gem_timeline into individual timelines") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180512084957.9829-1-chris@chris-wilson.co.uk diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 89bf5d67cb74..0a2070112b66 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -3119,7 +3119,7 @@ static void engine_skip_context(struct i915_request *request) GEM_BUG_ON(timeline == &engine->timeline); spin_lock_irqsave(&engine->timeline.lock, flags); - spin_lock(&timeline->lock); + spin_lock_nested(&timeline->lock, SINGLE_DEPTH_NESTING); list_for_each_entry_continue(request, &engine->timeline.requests, link) if (request->ctx == hung_ctx)
I've run gem_exec_await/wide-contexts a few thousand times trying to replicate the stall, nothing. What is odd in the logs is that we are clearly waiting inside gem_execbuf, but we are not waiting on any specific request. So some other mutex; my current guess is that we hit reclaim and that hangs.
*** Bug 107833 has been marked as a duplicate of this bug. ***
Also seen on SKL, KBL and GLK: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_103/fi-skl-6700k2/igt@gem_exec_await@wide-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4766/shard-glk6/igt@gem_exec_await@wide-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4766/shard-apl7/igt@gem_exec_await@wide-contexts.html (gem_exec_await:2613) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:2613) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed.
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_107/fi-bsw-n3050/igt@gem_exec_await@wide-contexts.html (gem_exec_await:1535) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:1535) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed
Finally it dawned on how we might wait forever in the middle of a run. If i915_request_alloc fails we do a wait_for_idle... If that occurs in the middle of our setup, they only way out is via a GPU hang.
commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Sep 14 09:00:15 2018 +0100 drm/i915: Limit the backpressure for i915_request allocation If we try and fail to allocate a i915_request, we apply some backpressure on the clients to throttle the memory allocations coming from i915.ko. Currently, we wait until completely idle, but this is far too heavy and leads to some situations where the only escape is to declare a client hung and reset the GPU. The intent is to only ratelimit the allocation requests and to allow ourselves to recycle requests and memory from any long queues built up by a client hog. Although the system memory is inherently a global resources, we don't want to overly penalize an unlucky client to pay the price of reaping a hog. To reduce the influence of one client on another, we can instead of waiting for the entire GPU to idle, impose a barrier on the local client. (One end goal for request allocation is for scalability to many concurrent allocators; simultaneous execbufs.) To prevent ourselves from getting caught out by long running requests (requests that may never finish without userspace intervention, whom we are blocking) we need to impose a finite timeout, ideally shorter than hangcheck. A long time ago Paul McKenney suggested that RCU users should ratelimit themselves using judicious use of cond_synchronize_rcu(). This gives us the opportunity to reduce our indefinite wait for the GPU to idle to a wait for the RCU grace period of the previous allocation along this timeline to expire, satisfying both the local and finite properties we desire for our ratelimiting. There are still a few global steps (reclaim not least amongst those!) when we exhaust the immediate slab pool, at least now the wait is itself decoupled from struct_mutex for our glorious highly parallel future! Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-chris@chris-wilson.co.uk Pretty confident!
(In reply to Chris Wilson from comment #7) > commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Sep 14 09:00:15 2018 +0100 > > drm/i915: Limit the backpressure for i915_request allocation > > If we try and fail to allocate a i915_request, we apply some > backpressure on the clients to throttle the memory allocations coming > from i915.ko. Currently, we wait until completely idle, but this is far > too heavy and leads to some situations where the only escape is to > declare a client hung and reset the GPU. The intent is to only ratelimit > the allocation requests and to allow ourselves to recycle requests and > memory from any long queues built up by a client hog. > > Although the system memory is inherently a global resources, we don't > want to overly penalize an unlucky client to pay the price of reaping a > hog. To reduce the influence of one client on another, we can instead of > waiting for the entire GPU to idle, impose a barrier on the local client. > (One end goal for request allocation is for scalability to many > concurrent allocators; simultaneous execbufs.) > > To prevent ourselves from getting caught out by long running requests > (requests that may never finish without userspace intervention, whom we > are blocking) we need to impose a finite timeout, ideally shorter than > hangcheck. A long time ago Paul McKenney suggested that RCU users should > ratelimit themselves using judicious use of cond_synchronize_rcu(). This > gives us the opportunity to reduce our indefinite wait for the GPU to > idle to a wait for the RCU grace period of the previous allocation along > this timeline to expire, satisfying both the local and finite properties > we desire for our ratelimiting. > > There are still a few global steps (reclaim not least amongst those!) > when we exhaust the immediate slab pool, at least now the wait is itself > decoupled from struct_mutex for our glorious highly parallel future! > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1- > chris@chris-wilson.co.uk > > Pretty confident! Unfortunately, this is still not fixed, as it is happening at the same rate as before: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4948/shard-kbl2/igt@gem_exec_await@wide-contexts.html Starting subtest: wide-contexts (gem_exec_await:1079) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:1079) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed. And we also seen issues on WHL and CNL: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_124/fi-cnl-u/igt@gem_exec_await@wide-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_121/fi-whl-u/igt@gem_exec_await@wide-contexts.html
This one is interesting though: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4969/shard-glk7/igt@gem_exec_await@wide-contexts.html Starting subtest: wide-contexts (gem_exec_await:2214) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:2214) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed. <3> [1053.522972] [drm:i915_reset [i915]] *ERROR* Failed to initialise HW following reset (-5)
I can't reproduce this locally with running the test in question in a loop for a couple hours. And the history on shards also looks clean: https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_await@wide-contexts.html What was the frequency of the failures in the past? Would we expect to see some failures in the ten day window?
As far as I can tell this used to happen once a day. On what HW did you test?
(In reply to Tvrtko Ursulin from comment #10) > I can't reproduce this locally with running the test in question in a loop > for a couple hours. And the history on shards also looks clean: > > https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_await@wide-contexts. > html > > What was the frequency of the failures in the past? Would we expect to see > some failures in the ten day window? Last seen this failure in IGT_4694_full (1 week, 4 days / 139 runs ago) Logs from the last failure https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4694/shard-kbl1/igt@gem_exec_await@wide-contexts.html Until then this issue used to appear very frequently (~everyday).
From the data it sounds like we can close this. Just a pity we cannot tag it with a commit which fixed it. P.S. I tested on SKL.
(In reply to Tvrtko Ursulin from comment #13) > From the data it sounds like we can close this. Just a pity we cannot tag it > with a commit which fixed it. > > P.S. I tested on SKL. I would not be so quick... the issue is quite intermittent and has been so for 5 months... We'll need to wait a couple of months to be sure, at the current rate.
I left the test running in a tight loop for two more hours yesterday and no luck. Was the frequency across platforms similar or there were more susceptible platforms?
You need to hit an allocation failure while allocating the request... The problem is resolved down to an unbounded rcu wait on the current grace period.
Moving to medium while we wait for it to happen again.
Not seen in two months and a half, closing.
(In reply to Francesco Balestrieri from comment #18) > Not seen in two months and a half, closing. Now last seen 4 months ago, which definitely looks pretty promising for it being fixed! I'll archive the bug.
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.