Summary: | [CI][BAT bsw] igt@gem_* - fail - Failed assertion: !"GPU hung" | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BSW/CHT | i915 features: | GPU hang |
Description
Martin Peres
2018-11-05 11:05:32 UTC
Looks like a genuine TLB or other cache failure as it ran off the end of the batchbuffer. And on the next run, the other two bsw failed similarly. Suspicious. *** Bug 108666 has been marked as a duplicate of this bug. *** https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-bsw-kefka/igt@gem_exec_schedule@deep-vebox.html Starting subtest: deep-vebox (gem_exec_schedule:1391) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_schedule:1391) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest deep-vebox failed. *** Bug 108681 has been marked as a duplicate of this bug. *** Ever the optimist: commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Nov 8 08:17:38 2018 +0000 drm/i915/execlists: Force write serialisation into context image vs execution Ensure that the writes into the context image are completed prior to the register mmio to trigger execution. Although previously we were assured by the SDM that all writes are flushed before an uncached memory transaction (our mmio write to submit the context to HW for execution), we have empirical evidence to believe that this is not actually the case. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656 References: https://bugs.freedesktop.org/show_bug.cgi?id=108315 References: https://bugs.freedesktop.org/show_bug.cgi?id=106887 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181108081740.25615-1-chris@chris-wilson.co.uk Cc: stable@vger.kernel.org My bsw hung again, batch: [0x00000000_03e02000, 0x00000000_03e03000] BBADDR: 0x00000000_03e07001 so not the magic bullet. (In reply to Chris Wilson from comment #6) > Ever the optimist: > > commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> > drm-intel-next-queued, drm-intel/for-linux-next, > drm-intel/drm-intel-next-queued) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Nov 8 08:17:38 2018 +0000 > > drm/i915/execlists: Force write serialisation into context image vs > execution And for confirmation that no, this sadly wasn't it: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5110/fi-bsw-kefka/igt@gem_ctx_create@basic-files.html https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4718/fi-bsw-kefka/igt@i915_selftest@live_contexts.html <7> [597.917656] [drm:intel_gpu_reset [i915]] vecs0: timed out on STOP_RING *** Bug 108798 has been marked as a duplicate of this bug. *** *** Bug 108802 has been marked as a duplicate of this bug. *** https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_146/fi-bsw-kefka/igt@gem_exec_schedule@deep-bsd.html Starting subtest: deep-bsd (gem_exec_schedule:986) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_schedule:986) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest deep-bsd failed. **** DEBUG **** (gem_exec_schedule:986) drmtest-DEBUG: Test requirement passed: !(fd<0) (gem_exec_schedule:986) igt_dummyload-DEBUG: Test requirement passed: vgem_has_fences(cork->vgem.device) (gem_exec_schedule:986) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_exec_schedule:986) INFO: Using 2046 requests (prio range 2046) (gem_exec_schedule:986) drmtest-DEBUG: Test requirement passed: !(fd<0) (gem_exec_schedule:986) igt_dummyload-DEBUG: Test requirement passed: vgem_has_fences(cork->vgem.device) (gem_exec_schedule:986) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_schedule:986) igt_aux-CRITICAL: Failed assertion: !"GPU hung" (gem_exec_schedule:986) igt_core-INFO: Stack trace: (gem_exec_schedule:986) igt_core-INFO: #0 ../lib/igt_core.c:1467 __igt_fail_assert() (gem_exec_schedule:986) igt_core-INFO: #1 ../lib/igt_aux.c:504 igt_fork_hang_detector() (gem_exec_schedule:986) igt_core-INFO: #2 [killpg+0x40] (gem_exec_schedule:986) igt_core-INFO: #3 ../sysdeps/unix/syscall-template.S:78 ioctl() (gem_exec_schedule:986) igt_core-INFO: #4 /home/cidrm/libdrm/xf86drm.c:189 drmIoctl() (gem_exec_schedule:986) igt_core-INFO: #5 ../lib/ioctl_wrappers.c:589 __gem_execbuf() (gem_exec_schedule:986) igt_core-INFO: #6 ../lib/ioctl_wrappers.c:605 gem_execbuf() (gem_exec_schedule:986) igt_core-INFO: #7 ../tests/i915/gem_exec_schedule.c:850 deep() (gem_exec_schedule:986) igt_core-INFO: #8 ../tests/i915/gem_exec_schedule.c:1314 __real_main1194() (gem_exec_schedule:986) igt_core-INFO: #9 ../tests/i915/gem_exec_schedule.c:1194 main() (gem_exec_schedule:986) igt_core-INFO: #10 ../csu/libc-start.c:344 __libc_start_main() (gem_exec_schedule:986) igt_core-INFO: #11 [_start+0x2a] **** END **** Subtest deep-bsd: FAIL (9.107s) https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5192/fi-bsw-kefka/igt@gem_ctx_create@basic-files.html Starting subtest: basic-files (gem_ctx_create:1105) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_ctx_create:1105) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest basic-files failed. **** DEBUG **** (gem_ctx_create:1105) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_ctx_create:1105) igt_aux-CRITICAL: Failed assertion: !"GPU hung" (gem_ctx_create:1105) igt_core-INFO: Stack trace: (gem_ctx_create:1105) igt_core-INFO: #0 ../lib/igt_core.c:1467 __igt_fail_assert() (gem_ctx_create:1105) igt_core-INFO: #1 ../lib/igt_aux.c:504 igt_fork_hang_detector() (gem_ctx_create:1105) igt_core-INFO: #2 [killpg+0x40] (gem_ctx_create:1105) igt_core-INFO: #3 ../sysdeps/unix/sysv/linux/wait.c:29 wait() (gem_ctx_create:1105) igt_core-INFO: #4 ../lib/igt_core.c:1768 __igt_waitchildren() (gem_ctx_create:1105) igt_core-INFO: #5 ../lib/igt_core.c:1819 igt_waitchildren() (gem_ctx_create:1105) igt_core-INFO: #6 ../tests/i915/gem_ctx_create.c:102 files() (gem_ctx_create:1105) igt_core-INFO: #7 ../tests/i915/gem_ctx_create.c:360 __real_main311() (gem_ctx_create:1105) igt_core-INFO: #8 ../tests/i915/gem_ctx_create.c:311 main() (gem_ctx_create:1105) igt_core-INFO: #9 ../csu/libc-start.c:344 __libc_start_main() (gem_ctx_create:1105) igt_core-INFO: #10 [_start+0x2a] **** END **** Subtest basic-files: FAIL (8.358s) https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_148/fi-bsw-kefka/igt@gem_exec_parallel@default-contexts.html Starting subtest: default-contexts (gem_exec_parallel:1795) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_parallel:1795) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest default-contexts failed. **** DEBUG **** (gem_exec_parallel:1795) i915/gem_context-DEBUG: Test requirement passed: gem_has_contexts(fd) (gem_exec_parallel:1795) DEBUG: Test requirement passed: gem_has_ring(fd, engine) (gem_exec_parallel:1795) DEBUG: Test requirement passed: gem_can_store_dword(fd, engine) (gem_exec_parallel:1795) DEBUG: Test requirement passed: nengine (gem_exec_parallel:1795) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_exec_parallel:1795) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_exec_parallel:1795) DEBUG: Verifying result (pass=0, handle=1) (gem_exec_parallel:1795) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_exec_parallel:1795) igt_aux-CRITICAL: Failed assertion: !"GPU hung" (gem_exec_parallel:1795) igt_core-INFO: Stack trace: (gem_exec_parallel:1795) igt_core-INFO: #0 ../lib/igt_core.c:1467 __igt_fail_assert() (gem_exec_parallel:1795) igt_core-INFO: #1 ../lib/igt_aux.c:504 igt_fork_hang_detector() (gem_exec_parallel:1795) igt_core-INFO: #2 [killpg+0x40] (gem_exec_parallel:1795) igt_core-INFO: #3 ../sysdeps/unix/syscall-template.S:78 ioctl() (gem_exec_parallel:1795) igt_core-INFO: #4 /home/cidrm/libdrm/xf86drm.c:191 drmIoctl() (gem_exec_parallel:1795) igt_core-INFO: #5 ../lib/ioctl_wrappers.c:402 __gem_set_domain() (gem_exec_parallel:1795) igt_core-INFO: #6 ../lib/ioctl_wrappers.c:422 gem_set_domain() (gem_exec_parallel:1795) igt_core-INFO: #7 ../tests/i915/gem_exec_parallel.c:53 check_bo() (gem_exec_parallel:1795) igt_core-INFO: #8 ../tests/i915/gem_exec_parallel.c:221 all() (gem_exec_parallel:1795) igt_core-INFO: #9 ../tests/i915/gem_exec_parallel.c:255 __real_main228() (gem_exec_parallel:1795) igt_core-INFO: #10 ../tests/i915/gem_exec_parallel.c:228 main() (gem_exec_parallel:1795) igt_core-INFO: #11 ../csu/libc-start.c:344 __libc_start_main() (gem_exec_parallel:1795) igt_core-INFO: #12 [_start+0x2a] **** END **** Subtest default-contexts: FAIL (9.456s) https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4734/fi-bsw-kefka/igt@i915_selftest@live_hangcheck.html <3> [565.196923] [drm:gen8_reset_engines [i915]] *ERROR* vecs0: reset request timeout <3> [565.200917] [drm:gen8_reset_engines [i915]] *ERROR* vecs0: reset request timeout <3> [565.213506] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5 Step 1: commit 490b8c65b9db45896769e1095e78725775f47b3e (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Dec 6 08:44:31 2018 +0000 drm/i915/execlists: Apply a full mb before execution for Braswell Braswell is really picky about having our writes posted to memory before we execute or else the GPU may see stale values. A wmb() is insufficient as it only ensures the writes are visible to other cores, we need a full mb() to ensure the writes are in memory and visible to the GPU. The most frequent failure in flushing before execution is that we see stale PTE values and execute the wrong pages. References: 987abd5c62f9 ("drm/i915/execlists: Force write serialisation into context image vs execution") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: stable@vger.kernel.org Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Step 2: commit e8894267cc3325901073e8adf0a63e2dc53b6242 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Dec 7 09:02:13 2018 +0000 drm/i915: Pipeline PDP updates for Braswell Currently we face a severe problem on Braswell that manifests as invalid ppGTT accesses. The code tries to maintain the PDP (page directory pointers) inside the context in two ways, direct write into the context and a pipelined LRI update. The direct write into the context is fundamentally racy as it is unserialised with any access (read or write) the GPU is doing. By asserting that Braswell is not used with vGPU (currently an unsupported platform) we can eliminate the dangerous direct write into the context image and solely use the pipelined update. However, the LRI of the PDP fouls up the GPU, causing it to freeze and take out the machine with "forcewake ack timeouts". This seems possible to workaround by preventing the GPU from sleeping (via means of disabling the power-state management interface, i.e. forcing each ring to remain awake) around the update. Equally, it seems an EMIT_INVALIDATE before the LRI is sufficient to prevent the forcewake errors. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656 References: https://bugs.freedesktop.org/show_bug.cgi?id=108714 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181207090213.14352-3-chris@chris-wilson.co.uk Step 3: Hope for the best. Note that the root cause for this bug is believed to be invalid TLB / vm. Something like https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5282/fi-bsw-cyan/igt@gem_ctx_create@basic-files.html has different symptoms (the GPU just froze). It may be worthwhile to treat that as a different bug if it persists. (In reply to Chris Wilson from comment #17) > Step 3: Hope for the best. A step too far. A CI Bug Log filter associated to this bug has been updated: {- BSW: random tests - fail | dmesg-fail - Failed assertion: !"GPU hung" -} {+ BSW: random tests - fail | dmesg-fail - Failed assertion: !"GPU hung" +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_159/fi-bsw-n3050/igt@gem_exec_parallel@render-contexts.html A CI Bug Log filter associated to this bug has been updated: {- BSW: random tests - fail | dmesg-fail - Failed assertion: !"GPU hung" -} {+ BSW: random tests - fail | dmesg-fail - Failed assertion: !"GPU hung" +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_173/fi-bsw-kefka/igt@gem_exec_parallel@blt-fds.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_173/fi-bsw-n3050/igt@gem_exec_schedule@deep-render.html A new hope: commit 8cd999181f8c744c87fb64e7b3600876ec3428b2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jan 14 21:17:27 2019 +0000 drm/i915: Prevent concurrent GGTT update and use on Braswell (again) On Braswell, under heavy stress, if we update the GGTT while simultaneously accessing another region inside the GTT, we are returned the wrong values. To prevent this we stop the machine to update the GGTT entries so that no memory traffic can occur at the same time. This was first spotted in commit 5bab6f60cb4d1417ad7c599166bcfec87529c1a2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Oct 23 18:43:32 2015 +0100 drm/i915: Serialise updates to GGTT with access through GGTT on Braswell but removed again in forlorn hope with commit 4509276ee824bb967885c095c610767e42345c36 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Feb 20 12:47:18 2017 +0000 drm/i915: Remove Braswell GGTT update w/a However, gem_concurrent_blit is once again only stable with the patch applied and CI is detecting the odd failure in forked gem_mmap_gtt tests (which smell like the same issue). Fwiw, a wide variety of CPU memory barriers (around GGTT flushing, fence updates, PTE updates) and GPU flushes/invalidates (between requests, after PTE updates) were tried as part of the investigation to find an alternate cause, nothing comes close to serialised GGTT updates. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105591 Testcase: igt/gem_concurrent_blit Testcase: igt/gem_mmap_gtt/*forked* References: 5bab6f60cb4d ("drm/i915: Serialise updates to GGTT with access through GGTT on Braswell") References: 4509276ee824 ("drm/i915: Remove Braswell GGTT update w/a") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190114211729.30352-1-chris@chris-wilson.co.uk Last seen this issue drmtip_173 (2 months / 1083 runs ago). Not seen after the fix. Closing the bug. The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.