Summary: | [CI][SHARDS] igt@gem_exec_await@wide-contexts - fail/dmesg-fail - Failed assertion: !"GPU hung" | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Tvrtko Ursulin <tvrtko.ursulin> |
Status: | CLOSED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BSW/CHT, BXT, CFL, CNL, GLK, KBL | i915 features: | GEM/Other |
Description
Martin Peres
2018-05-28 09:53:59 UTC
The lockdep splat was fixed by commit 0c591a40af1b369cc11dce4d558dd71bebbdc090 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat May 12 09:49:57 2018 +0100 drm/i915: Mark up nested spinlocks When we process the outstanding requests upon banning a context, we need to acquire both the engine and the client's timeline, nesting the locks. This requires explicit markup as the two timelines are now of the same class, since commit a89d1f921c15 ("drm/i915: Split i915_gem_timeline into individual timelines"). Testcase: igt/gem_eio/banned Fixes: a89d1f921c15 ("drm/i915: Split i915_gem_timeline into individual timelines") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180512084957.9829-1-chris@chris-wilson.co.uk diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 89bf5d67cb74..0a2070112b66 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -3119,7 +3119,7 @@ static void engine_skip_context(struct i915_request *request) GEM_BUG_ON(timeline == &engine->timeline); spin_lock_irqsave(&engine->timeline.lock, flags); - spin_lock(&timeline->lock); + spin_lock_nested(&timeline->lock, SINGLE_DEPTH_NESTING); list_for_each_entry_continue(request, &engine->timeline.requests, link) if (request->ctx == hung_ctx) I've run gem_exec_await/wide-contexts a few thousand times trying to replicate the stall, nothing. What is odd in the logs is that we are clearly waiting inside gem_execbuf, but we are not waiting on any specific request. So some other mutex; my current guess is that we hit reclaim and that hangs. *** Bug 107833 has been marked as a duplicate of this bug. *** Also seen on SKL, KBL and GLK: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_103/fi-skl-6700k2/igt@gem_exec_await@wide-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4766/shard-glk6/igt@gem_exec_await@wide-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4766/shard-apl7/igt@gem_exec_await@wide-contexts.html (gem_exec_await:2613) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:2613) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed. https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_107/fi-bsw-n3050/igt@gem_exec_await@wide-contexts.html (gem_exec_await:1535) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:1535) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed Finally it dawned on how we might wait forever in the middle of a run. If i915_request_alloc fails we do a wait_for_idle... If that occurs in the middle of our setup, they only way out is via a GPU hang. commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Sep 14 09:00:15 2018 +0100 drm/i915: Limit the backpressure for i915_request allocation If we try and fail to allocate a i915_request, we apply some backpressure on the clients to throttle the memory allocations coming from i915.ko. Currently, we wait until completely idle, but this is far too heavy and leads to some situations where the only escape is to declare a client hung and reset the GPU. The intent is to only ratelimit the allocation requests and to allow ourselves to recycle requests and memory from any long queues built up by a client hog. Although the system memory is inherently a global resources, we don't want to overly penalize an unlucky client to pay the price of reaping a hog. To reduce the influence of one client on another, we can instead of waiting for the entire GPU to idle, impose a barrier on the local client. (One end goal for request allocation is for scalability to many concurrent allocators; simultaneous execbufs.) To prevent ourselves from getting caught out by long running requests (requests that may never finish without userspace intervention, whom we are blocking) we need to impose a finite timeout, ideally shorter than hangcheck. A long time ago Paul McKenney suggested that RCU users should ratelimit themselves using judicious use of cond_synchronize_rcu(). This gives us the opportunity to reduce our indefinite wait for the GPU to idle to a wait for the RCU grace period of the previous allocation along this timeline to expire, satisfying both the local and finite properties we desire for our ratelimiting. There are still a few global steps (reclaim not least amongst those!) when we exhaust the immediate slab pool, at least now the wait is itself decoupled from struct_mutex for our glorious highly parallel future! Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-chris@chris-wilson.co.uk Pretty confident! (In reply to Chris Wilson from comment #7) > commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Sep 14 09:00:15 2018 +0100 > > drm/i915: Limit the backpressure for i915_request allocation > > If we try and fail to allocate a i915_request, we apply some > backpressure on the clients to throttle the memory allocations coming > from i915.ko. Currently, we wait until completely idle, but this is far > too heavy and leads to some situations where the only escape is to > declare a client hung and reset the GPU. The intent is to only ratelimit > the allocation requests and to allow ourselves to recycle requests and > memory from any long queues built up by a client hog. > > Although the system memory is inherently a global resources, we don't > want to overly penalize an unlucky client to pay the price of reaping a > hog. To reduce the influence of one client on another, we can instead of > waiting for the entire GPU to idle, impose a barrier on the local client. > (One end goal for request allocation is for scalability to many > concurrent allocators; simultaneous execbufs.) > > To prevent ourselves from getting caught out by long running requests > (requests that may never finish without userspace intervention, whom we > are blocking) we need to impose a finite timeout, ideally shorter than > hangcheck. A long time ago Paul McKenney suggested that RCU users should > ratelimit themselves using judicious use of cond_synchronize_rcu(). This > gives us the opportunity to reduce our indefinite wait for the GPU to > idle to a wait for the RCU grace period of the previous allocation along > this timeline to expire, satisfying both the local and finite properties > we desire for our ratelimiting. > > There are still a few global steps (reclaim not least amongst those!) > when we exhaust the immediate slab pool, at least now the wait is itself > decoupled from struct_mutex for our glorious highly parallel future! > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1- > chris@chris-wilson.co.uk > > Pretty confident! Unfortunately, this is still not fixed, as it is happening at the same rate as before: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4948/shard-kbl2/igt@gem_exec_await@wide-contexts.html Starting subtest: wide-contexts (gem_exec_await:1079) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:1079) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed. And we also seen issues on WHL and CNL: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_124/fi-cnl-u/igt@gem_exec_await@wide-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_121/fi-whl-u/igt@gem_exec_await@wide-contexts.html This one is interesting though: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4969/shard-glk7/igt@gem_exec_await@wide-contexts.html Starting subtest: wide-contexts (gem_exec_await:2214) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_await:2214) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest wide-contexts failed. <3> [1053.522972] [drm:i915_reset [i915]] *ERROR* Failed to initialise HW following reset (-5) I can't reproduce this locally with running the test in question in a loop for a couple hours. And the history on shards also looks clean: https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_await@wide-contexts.html What was the frequency of the failures in the past? Would we expect to see some failures in the ten day window? As far as I can tell this used to happen once a day. On what HW did you test? (In reply to Tvrtko Ursulin from comment #10) > I can't reproduce this locally with running the test in question in a loop > for a couple hours. And the history on shards also looks clean: > > https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_exec_await@wide-contexts. > html > > What was the frequency of the failures in the past? Would we expect to see > some failures in the ten day window? Last seen this failure in IGT_4694_full (1 week, 4 days / 139 runs ago) Logs from the last failure https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4694/shard-kbl1/igt@gem_exec_await@wide-contexts.html Until then this issue used to appear very frequently (~everyday). From the data it sounds like we can close this. Just a pity we cannot tag it with a commit which fixed it. P.S. I tested on SKL. (In reply to Tvrtko Ursulin from comment #13) > From the data it sounds like we can close this. Just a pity we cannot tag it > with a commit which fixed it. > > P.S. I tested on SKL. I would not be so quick... the issue is quite intermittent and has been so for 5 months... We'll need to wait a couple of months to be sure, at the current rate. I left the test running in a tight loop for two more hours yesterday and no luck. Was the frequency across platforms similar or there were more susceptible platforms? You need to hit an allocation failure while allocating the request... The problem is resolved down to an unbounded rcu wait on the current grace period. Moving to medium while we wait for it to happen again. Not seen in two months and a half, closing. (In reply to Francesco Balestrieri from comment #18) > Not seen in two months and a half, closing. Now last seen 4 months ago, which definitely looks pretty promising for it being fixed! I'll archive the bug. The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.