Summary: | [CI] igt@drv_module_reload@basic-reload - dmesg-fail - Failed assertion: __gem_execbuf(fd, execbuf) == 0 && Failed to idle engines, declaring wedged! | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> | ||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | normal | ||||||
Priority: | medium | CC: | intel-gfx-bugs | ||||
Version: | XOrg git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | ReadyForDev | ||||||
i915 platform: | BSW/CHT, GLK | i915 features: | power/Other | ||||
Attachments: |
|
Description
Martin Peres
2018-05-03 12:03:05 UTC
Created attachment 139306 [details] [review] Kick softirqs harder In both cases it is the 200+ms latency caused by ksoftirqd, falling afoul of our 200ms timeout before declaring bankruptcy. Our loop uses usleep and schedules, so it is not like ksoftirqd doesn't have the opportunity to run. I've been very tempted to execute the tasklet directly, but that doesn't solve the issue that we may encounter the 200ms delay while trying to execute things on the GPU. This is one of the reasons why we need patches like https://patchwork.freedesktop.org/patch/219353/ What I am using at the moment is attached. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4129/shard-glk1/igt@drv_module_reload@basic-reload-inject.html Stderr: (drv_module_reload:7296) ioctl_wrappers-CRITICAL: Test assertion failure function gem_execbuf, file ../lib/ioctl_wrappers.c:604: (drv_module_reload:7296) ioctl_wrappers-CRITICAL: Failed assertion: __gem_execbuf(fd, execbuf) == 0 (drv_module_reload:7296) ioctl_wrappers-CRITICAL: error: -5 != 0 Test drv_module_reload failed. Dmesg: [ 1513.231729] i915 0000:00:02.0: Failed to idle engines, declaring wedged! Then, the following test produced the following assert. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4129/shard-glk1/igt@gem_render_copy_redux@flink-interruptible.html gem_render_copy_redux: ../lib/rendercopy_gen9.c:143: gen6_render_flush: Assertion `ret == 0' failed. Received signal SIGABRT. This is similar to https://bugs.freedesktop.org/show_bug.cgi?id=106064. Also seen on igt@drv_module_reload@basic-no-display: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4134/fi-bsw-n3050/igt@drv_module_reload@basic-no-display.html Should be improved by commit dd0cf235d81f24c1ba80c4a000bafc9f2dce3840 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun May 6 18:13:28 2018 +0100 drm/i915: Speed up idle detection by kicking the tasklets We rely on ksoftirqd to run in a timely fashion in order to drain the execlists queue. Quite frequently, it does not. In some cases we may see latencies of over 200ms triggering our idle timeouts and forcing us to declare the driver wedged! Thus we can speed up idle detection by bypassing ksoftirqd in these cases and flush our tasklet to confirm if we are indeed still waiting for the ELSP to drain. v2: Put the execlists.first check back; it is required for handling reset! References: https://bugs.freedesktop.org/show_bug.cgi?id=106373 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180506171328.30034-1-chris@chris-wilson.co.uk (In reply to Chris Wilson from comment #4) > Should be improved by > > commit dd0cf235d81f24c1ba80c4a000bafc9f2dce3840 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Sun May 6 18:13:28 2018 +0100 > > drm/i915: Speed up idle detection by kicking the tasklets > > We rely on ksoftirqd to run in a timely fashion in order to drain the > execlists queue. Quite frequently, it does not. In some cases we may see > latencies of over 200ms triggering our idle timeouts and forcing us to > declare the driver wedged! > > Thus we can speed up idle detection by bypassing ksoftirqd in these > cases and flush our tasklet to confirm if we are indeed still waiting > for the ELSP to drain. > > v2: Put the execlists.first check back; it is required for handling > reset! > > References: https://bugs.freedesktop.org/show_bug.cgi?id=106373 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180506171328.30034-1- > chris@chris-wilson.co.uk Seems like it did the trick. We used to reproduce the issue every 5-20 runs, and we have not hit it in 200 runs now. Thanks! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.