Summary: | [CI][BAT] igt@drv_selftest@live_hangcheck - igt_reset_engines failed with error | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | high | CC: | intel-gfx-bugs |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BXT, KBL | i915 features: | GEM/Other |
Description
Martin Peres
2018-09-07 16:12:29 UTC
This bug is a continuation of the bug https://bugs.freedesktop.org/show_bug.cgi?id=106560 which had been closed as some of the issues were fixed. This bug is mostly visible on KBL, but APL also has one hit since the fix from 106560: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4702/shard-apl6/igt@drv_selftest@live_hangcheck.html Bumping the priority as it is quite problematic to wedge a GPU. (In reply to Martin Peres from comment #0) > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/ > igt@drv_selftest@live_hangcheck.html Before the fix. > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4714/fi-kbl-7567u/ > igt@drv_selftest@live_hangcheck.html Before the fix. > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/ > igt@drv_selftest@live_hangcheck.html == DRM_4715 which I guess is before the fix as well. https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/igt@drv_selftest@live_hangcheck.html(In reply to Chris Wilson from comment #3) > (In reply to Martin Peres from comment #0) > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/ > > igt@drv_selftest@live_hangcheck.html > > Before the fix. > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4714/fi-kbl-7567u/ > > igt@drv_selftest@live_hangcheck.html > > Before the fix. > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/ > > igt@drv_selftest@live_hangcheck.html > > == DRM_4715 which I guess is before the fix as well. CI_DRM_4715 was posted on Aug. 28, 2018, 1:11 p.m. This was way after you pushed your commit (2018-08-15 10:15:28 +0100): https://cgit.freedesktop.org/drm-tip/commit/?id=a99b32a6fff7e482a267c72e565c8c410ce793d7 So, I'm re-opening. But please tell me where is the error in my logic if you still think this is fixed :) The last fix for live_hangcheck was commit 9e4fa01221b3230320135072ad31ea809ca31147 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 28 16:27:02 2018 +0100 drm/i915/execlists: Flush tasklet directly from reset-finish On finishing the reset, the intention is to restart the GPU before we relinquish the forcewake taken to handle the reset - the goal being the GPU reloads a context before it is allowed to sleep. For this purpose, we used tasklet_flush() which although it accomplished the goal of restarting the GPU, carried with it a sting in its tail: it cleared the TASKLET_STATE_SCHED bit. This meant that if another CPU queued a new request to this engine, we would clear the flag and later attempt to requeue the tasklet on the local CPU, breaking the per-cpu softirq lists. Remove the dangerous tasklet_kill() and just run the tasklet func directly as we know it is safe to do so (the tasklets are internally locked to allow mixed usage from direct submission). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180828152702.27536-1-chris@chris-wilson.co.uk (In reply to Chris Wilson from comment #5) > The last fix for live_hangcheck was > > commit 9e4fa01221b3230320135072ad31ea809ca31147 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Aug 28 16:27:02 2018 +0100 > > drm/i915/execlists: Flush tasklet directly from reset-finish > > On finishing the reset, the intention is to restart the GPU before we > relinquish the forcewake taken to handle the reset - the goal being the > GPU reloads a context before it is allowed to sleep. For this purpose, > we used tasklet_flush() which although it accomplished the goal of > restarting the GPU, carried with it a sting in its tail: it cleared the > TASKLET_STATE_SCHED bit. This meant that if another CPU queued a new > request to this engine, we would clear the flag and later attempt to > requeue the tasklet on the local CPU, breaking the per-cpu softirq > lists. > > Remove the dangerous tasklet_kill() and just run the tasklet func > directly as we know it is safe to do so (the tasklets are internally > locked to allow mixed usage from direct submission). > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > Cc: Michel Thierry <michel.thierry@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180828152702.27536-1- > chris@chris-wilson.co.uk ACK! Thanks for documenting it :) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.