At lot of igt@gem_exec_schedule subtests started to fail on GPU hang on CI_DRM https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl1/igt@gem_exec_schedule@preempt-contexts-bsd.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb3/igt@gem_exec_schedule@preempt-contexts-bsd.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-contexts-bsd.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl4/igt@gem_exec_schedule@preempt-vebox.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb2/igt@gem_exec_schedule@preempt-vebox.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl4/igt@gem_exec_schedule@preempt-vebox.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl2/igt@gem_exec_schedule@preempt-render.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb2/igt@gem_exec_schedule@preempt-render.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl6/igt@gem_exec_schedule@preempt-render.html (gem_exec_schedule:4286) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482: (gem_exec_schedule:4286) igt-aux-CRITICAL: Failed assertion: !"GPU hung" this is possible related to bug 104259 which also started on CI_DRM_3514 on some BAT machines.
There are more of them that is already covered by bug 102848. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-bsd2.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl2/igt@gem_exec_schedule@preempt-other-blt.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-contexts-bsd.html GLK is green on above. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-self-bsd2.html Below doesn't fail due to GPU hung, instead: (gem_exec_schedule:2729) CRITICAL: Test assertion failure function preemptive_hang, file gem_exec_schedule.c:545: (gem_exec_schedule:2729) CRITICAL: Failed assertion: gem_bo_busy(fd, spin[n]->handle) Subtest preemptive-hang-vebox failed. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl5/igt@gem_exec_schedule@preemptive-hang-vebox.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-self-bsd2.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb3/igt@gem_exec_schedule@preemptive-hang-vebox.html
The frequency is certainly troubling, it looks like it fails on the first execution on each machine? So I'm wondering if we have some leftover state that we are now not flushing. The previous failures have been more sporadic, where the tests were running much slower and timedout. The ones I've looked at here barely make it to the second iteration.
I believe I was wrong calling out CI_DRM_3514 as the first occurrence, it was actually already on IGT_4063, which is based on CI_DRM_3513 here is one example: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4063/shard-apl8/igt@gem_exec_schedule@preempt-contexts-bsd.html
So far, every failure I've looked at has the hallmarks of the spin batch not ending. However, it is refusing to fail on my machines, suggesting that it is something to do with the pre-existing state. Grr.
I am not sure this is actually related to this issue, this only happened on GLK-shards once, the other issues are reproduced on every run from 3514 to so far 3521. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3517/shard-glkb2/igt@gem_exec_schedule@preemptive-hang-bsd.html (gem_exec_schedule:2682) CRITICAL: Test assertion failure function preemptive_hang, file gem_exec_schedule.c:545: (gem_exec_schedule:2682) CRITICAL: Failed assertion: gem_bo_busy(fd, spin[n]->handle) Subtest preemptive-hang-bsd failed.
With a bit of luck, commit 7b6da818d86fddfc88ddb523d6539c1bf7fc6302 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Dec 16 00:03:34 2017 +0000 drm/i915: Restore the kernel context after a GPU reset on an idle engine As part of the system requirement for powersaving is that we always have a context loaded. Upon boot and resume, we load the kernel_context to ensure that some valid state is set before powersaving kicks in, we should do so after a full GPU reset as well. We only need to do so for an idle engine, as any active engines will restart by executing the stuck request, loading its context. For the idle engine, we create a new request to load the kernel_context instead. For whatever reason, perfoming a dummy execute on the idle engine after reset papers over a subsequent GPU hang in rare circumstances, even on machines not using contexts (e.g. Pineview). Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104259 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104261 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171216000334.8197-1-chris@chris-wilson.co.uk
Patch integrated in CI_DRM_3526 these tests are then green, I will close this.
*** Bug 104315 has been marked as a duplicate of this bug. ***
*** Bug 104314 has been marked as a duplicate of this bug. ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.