https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-byt-n2820/igt@gem_eio@unwedge-stress.html (gem_eio:1302) CRITICAL: Test assertion failure function trigger_reset, file ../tests/gem_eio.c:81: (gem_eio:1302) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2 Subtest unwedge-stress failed.
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_23/fi-cfl-u/igt@gem_eio@in-flight-suspend.html (gem_eio:1624) CRITICAL: Test assertion failure function trigger_reset, file ../tests/gem_eio.c:81: (gem_eio:1624) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2 Subtest in-flight-suspend failed.
commit 89ae332745e31a075747a63ac5acc5baccf75769 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri May 11 18:58:59 2018 +0100 tests/gem_eio: Only wait-for-idle inside trigger_reset() trigger_reset() imposes a tight time constraint (2s) so that we verify that the reset itself completes quickly. In the middle of this check, we call gem_quiescent_gpu() which may invoke an rcu_barrier() or two to clear out the freed memory (DROP_FREED). Those barriers may have unbounded latency pushing beyond the 2s timeout, so restrict the operation to only wait-for-idle (DROP_ACTIVE). Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105957 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Optimistically marking as fixed to see what happens. It's doubtful that the rcu_barrier alone is causing the grief, so I suspect there might be an outside timing influence -- as far as I can tell, the driver is doing the right thing and isn't causing the delay itself.
It was definitely not fixed: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_46/fi-kbl-7567u/igt@gem_eio@in-flight-suspend.html
(In reply to Martin Peres from comment #3) > It was definitely not fixed: > > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_46/fi-kbl-7567u/ > igt@gem_eio@in-flight-suspend.html But that isn't the same bug. In that case there was an unexpected GPU hangcheck after resume.
This issue is occurring regularly. https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-bsw-n3050/igt@gem_eio@in-flight-1us.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_113/fi-bsw-n3050/igt@gem_eio@in-flight-external.html
Update: Last seen CI_DRM_4943_full (1 week, 6 days / 164 runs ago).
This has a very low failure rate and I have been running the test list from IGT_4727 for quite a long time and didn't get any failure. So far I have been running the test for over 48 hours, 236 times. Is it OK to lower the "importance" of this bug to "lowest"?
Last seen 2 weeks ago on GLK: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5207/shard-glk7/igt@gem_eio@wait-wedge-immediate.html Before that, it happened with weekly frequency.
Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Feb 12 20:40:34 2019 +0000 i915/gem_eio: Check average reset times As we have moved to rcu/srcu to serialise the resets, individual resets are subject to small variations in system grace periods. Allow for this by only expecting the median reset time to be within our target, thereby excluding noisy outliers from perturbing our results (but keep the maximum capped to prevent horrid failures!) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
A CI Bug Log filter associated to this bug has been updated: {- all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&ts) < 2 -} {+ all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&ts) < 2 +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5729/shard-glk4/igt@gem_eio@context-create.html
This bug is reopened to due to the this failure https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5729/shard-glk4/igt@gem_eio@context-create.html Starting subtest: context-create (gem_eio:2574) CRITICAL: Test assertion failure function trigger_reset, file ../tests/i915/gem_eio.c:82: (gem_eio:2574) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2 Subtest context-create failed. **** DEBUG **** (gem_eio:2574) i915/gem_context-DEBUG: Test requirement passed: gem_has_contexts(fd) (gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_eio:2574) DEBUG: Disabling GPU reset (gem_eio:2574) DEBUG: Test requirement passed: fd >= 0 (gem_eio:2574) DEBUG: Test requirement passed: i915_reset_control(false) (gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_eio:2574) DEBUG: Enabling GPU reset (gem_eio:2574) DEBUG: Test requirement passed: fd >= 0 (gem_eio:2574) igt_gt-DEBUG: Triggering GPU reset (gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_eio:2574) DEBUG: Checking that the GPU recovered (gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_eio:2574) CRITICAL: Test assertion failure function trigger_reset, file ../tests/i915/gem_eio.c:82: (gem_eio:2574) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2 (gem_eio:2574) igt_core-INFO: Stack trace: (gem_eio:2574) igt_core-INFO: #0 ../lib/igt_core.c:1474 __igt_fail_assert() (gem_eio:2574) igt_core-INFO: #1 ../tests/i915/gem_eio.c:83 trigger_reset() (gem_eio:2574) igt_core-INFO: #2 ../tests/i915/gem_eio.c:132 test_context_create() (gem_eio:2574) igt_core-INFO: #3 ../tests/i915/gem_eio.c:835 __real_main814() (gem_eio:2574) igt_core-INFO: #4 ../tests/i915/gem_eio.c:814 main() (gem_eio:2574) igt_core-INFO: #5 ../csu/libc-start.c:344 __libc_start_main() (gem_eio:2574) igt_core-INFO: #6 [_start+0x2a] **** END ****
Latest occurrence from two weeks ago: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6087/shard-glk1/igt@gem_eio@wait-10ms.html
Continues to occur every 1-2 weeks, e.g.: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6402/shard-glk6/igt@gem_eio@in-flight-10ms.html
A CI Bug Log filter associated to this bug has been updated: {- all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&ts) < 2 -} {+ all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&ts) < 2 +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_eio@in-flight-internal-immediate.html
A CI Bug Log filter associated to this bug has been updated: {- all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&ts) < 2 -} {+ all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&ts) < 2 +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6976/shard-snb1/igt@gem_eio@kms.html
Note to self: with the upcoming heartbeat change for snb/ivb, it's likely we need to bump this to allow for the heartbeat timeout, say 5s.
Give up. commit 74f55119f9920b65996535210a09147997804136 (HEAD, upstream/master) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Oct 2 12:17:07 2019 +0100 i915/gem_eio: Relax timeout for forced resets It appears we cannot consistently hit our self-imposed QoS target of 2s for performing the reset (my theory is that is some RCU scheduling quirk), so relax the assertion to a measly 10s. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105957 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.