Bug 108656 - [CI][BAT bsw] igt@gem_* - fail - Failed assertion: !"GPU hung"
Summary: [CI][BAT bsw] igt@gem_* - fail - Failed assertion: !"GPU hung"
Status: REOPENED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 108666 108681 108798 108802 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-11-05 11:05 UTC by Martin Peres
Modified: 2018-12-08 10:11 UTC (History)
1 user (show)

See Also:
i915 platform: BSW/CHT
i915 features: GPU hang


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-11-05 11:05:32 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5085/fi-bsw-cyan/igt@gem_ctx_create@basic-files.html

Starting subtest: basic-files
(gem_ctx_create:980) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_ctx_create:980) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-files failed.
Comment 1 Chris Wilson 2018-11-05 12:07:34 UTC
Looks like a genuine TLB or other cache failure as it ran off the end of the batchbuffer.
Comment 2 Chris Wilson 2018-11-05 13:19:10 UTC
And on the next run, the other two bsw failed similarly. Suspicious.
Comment 3 Chris Wilson 2018-11-05 14:18:56 UTC
*** Bug 108666 has been marked as a duplicate of this bug. ***
Comment 4 Martin Peres 2018-11-06 09:24:46 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-bsw-kefka/igt@gem_exec_schedule@deep-vebox.html

Starting subtest: deep-vebox
(gem_exec_schedule:1391) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_schedule:1391) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest deep-vebox failed.
Comment 5 Chris Wilson 2018-11-06 12:33:13 UTC
*** Bug 108681 has been marked as a duplicate of this bug. ***
Comment 6 Chris Wilson 2018-11-08 12:24:32 UTC
Ever the optimist:

commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 8 08:17:38 2018 +0000

    drm/i915/execlists: Force write serialisation into context image vs execution
    
    Ensure that the writes into the context image are completed prior to the
    register mmio to trigger execution. Although previously we were assured
    by the SDM that all writes are flushed before an uncached memory
    transaction (our mmio write to submit the context to HW for execution),
    we have empirical evidence to believe that this is not actually the
    case.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181108081740.25615-1-chris@chris-wilson.co.uk
    Cc: stable@vger.kernel.org
Comment 7 Chris Wilson 2018-11-08 13:39:31 UTC
My bsw hung again,

  batch: [0x00000000_03e02000, 0x00000000_03e03000]
  BBADDR: 0x00000000_03e07001

so not the magic bullet.
Comment 8 Chris Wilson 2018-11-09 12:28:57 UTC
(In reply to Chris Wilson from comment #6)
> Ever the optimist:
> 
> commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD ->
> drm-intel-next-queued, drm-intel/for-linux-next,
> drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Nov 8 08:17:38 2018 +0000
> 
>     drm/i915/execlists: Force write serialisation into context image vs
> execution 

And for confirmation that no, this sadly wasn't it:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5110/fi-bsw-kefka/igt@gem_ctx_create@basic-files.html
Comment 9 Martin Peres 2018-11-16 15:58:01 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4718/fi-bsw-kefka/igt@i915_selftest@live_contexts.html

<7> [597.917656] [drm:intel_gpu_reset [i915]] vecs0: timed out on STOP_RING
Comment 10 Chris Wilson 2018-11-19 16:35:53 UTC
*** Bug 108798 has been marked as a duplicate of this bug. ***
Comment 11 Chris Wilson 2018-11-19 17:00:44 UTC
*** Bug 108802 has been marked as a duplicate of this bug. ***
Comment 12 Lakshmi 2018-11-22 09:56:02 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_146/fi-bsw-kefka/igt@gem_exec_schedule@deep-bsd.html

	
Starting subtest: deep-bsd
(gem_exec_schedule:986) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_schedule:986) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest deep-bsd failed.
**** DEBUG ****
(gem_exec_schedule:986) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_exec_schedule:986) igt_dummyload-DEBUG: Test requirement passed: vgem_has_fences(cork->vgem.device)
(gem_exec_schedule:986) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_exec_schedule:986) INFO: Using 2046 requests (prio range 2046)
(gem_exec_schedule:986) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_exec_schedule:986) igt_dummyload-DEBUG: Test requirement passed: vgem_has_fences(cork->vgem.device)
(gem_exec_schedule:986) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_schedule:986) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_exec_schedule:986) igt_core-INFO: Stack trace:
(gem_exec_schedule:986) igt_core-INFO:   #0 ../lib/igt_core.c:1467 __igt_fail_assert()
(gem_exec_schedule:986) igt_core-INFO:   #1 ../lib/igt_aux.c:504 igt_fork_hang_detector()
(gem_exec_schedule:986) igt_core-INFO:   #2 [killpg+0x40]
(gem_exec_schedule:986) igt_core-INFO:   #3 ../sysdeps/unix/syscall-template.S:78 ioctl()
(gem_exec_schedule:986) igt_core-INFO:   #4 /home/cidrm/libdrm/xf86drm.c:189 drmIoctl()
(gem_exec_schedule:986) igt_core-INFO:   #5 ../lib/ioctl_wrappers.c:589 __gem_execbuf()
(gem_exec_schedule:986) igt_core-INFO:   #6 ../lib/ioctl_wrappers.c:605 gem_execbuf()
(gem_exec_schedule:986) igt_core-INFO:   #7 ../tests/i915/gem_exec_schedule.c:850 deep()
(gem_exec_schedule:986) igt_core-INFO:   #8 ../tests/i915/gem_exec_schedule.c:1314 __real_main1194()
(gem_exec_schedule:986) igt_core-INFO:   #9 ../tests/i915/gem_exec_schedule.c:1194 main()
(gem_exec_schedule:986) igt_core-INFO:   #10 ../csu/libc-start.c:344 __libc_start_main()
(gem_exec_schedule:986) igt_core-INFO:   #11 [_start+0x2a]
****  END  ****
Subtest deep-bsd: FAIL (9.107s)
Comment 13 Lakshmi 2018-11-25 18:56:22 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5192/fi-bsw-kefka/igt@gem_ctx_create@basic-files.html

Starting subtest: basic-files
(gem_ctx_create:1105) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_ctx_create:1105) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-files failed.
**** DEBUG ****
(gem_ctx_create:1105) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_ctx_create:1105) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_ctx_create:1105) igt_core-INFO: Stack trace:
(gem_ctx_create:1105) igt_core-INFO:   #0 ../lib/igt_core.c:1467 __igt_fail_assert()
(gem_ctx_create:1105) igt_core-INFO:   #1 ../lib/igt_aux.c:504 igt_fork_hang_detector()
(gem_ctx_create:1105) igt_core-INFO:   #2 [killpg+0x40]
(gem_ctx_create:1105) igt_core-INFO:   #3 ../sysdeps/unix/sysv/linux/wait.c:29 wait()
(gem_ctx_create:1105) igt_core-INFO:   #4 ../lib/igt_core.c:1768 __igt_waitchildren()
(gem_ctx_create:1105) igt_core-INFO:   #5 ../lib/igt_core.c:1819 igt_waitchildren()
(gem_ctx_create:1105) igt_core-INFO:   #6 ../tests/i915/gem_ctx_create.c:102 files()
(gem_ctx_create:1105) igt_core-INFO:   #7 ../tests/i915/gem_ctx_create.c:360 __real_main311()
(gem_ctx_create:1105) igt_core-INFO:   #8 ../tests/i915/gem_ctx_create.c:311 main()
(gem_ctx_create:1105) igt_core-INFO:   #9 ../csu/libc-start.c:344 __libc_start_main()
(gem_ctx_create:1105) igt_core-INFO:   #10 [_start+0x2a]
****  END  ****
Subtest basic-files: FAIL (8.358s)
Comment 14 Lakshmi 2018-11-25 19:39:49 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_148/fi-bsw-kefka/igt@gem_exec_parallel@default-contexts.html

	
Starting subtest: default-contexts
(gem_exec_parallel:1795) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_parallel:1795) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest default-contexts failed.
**** DEBUG ****
(gem_exec_parallel:1795) i915/gem_context-DEBUG: Test requirement passed: gem_has_contexts(fd)
(gem_exec_parallel:1795) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_parallel:1795) DEBUG: Test requirement passed: gem_can_store_dword(fd, engine)
(gem_exec_parallel:1795) DEBUG: Test requirement passed: nengine
(gem_exec_parallel:1795) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_exec_parallel:1795) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_exec_parallel:1795) DEBUG: Verifying result (pass=0, handle=1)
(gem_exec_parallel:1795) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_parallel:1795) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_exec_parallel:1795) igt_core-INFO: Stack trace:
(gem_exec_parallel:1795) igt_core-INFO:   #0 ../lib/igt_core.c:1467 __igt_fail_assert()
(gem_exec_parallel:1795) igt_core-INFO:   #1 ../lib/igt_aux.c:504 igt_fork_hang_detector()
(gem_exec_parallel:1795) igt_core-INFO:   #2 [killpg+0x40]
(gem_exec_parallel:1795) igt_core-INFO:   #3 ../sysdeps/unix/syscall-template.S:78 ioctl()
(gem_exec_parallel:1795) igt_core-INFO:   #4 /home/cidrm/libdrm/xf86drm.c:191 drmIoctl()
(gem_exec_parallel:1795) igt_core-INFO:   #5 ../lib/ioctl_wrappers.c:402 __gem_set_domain()
(gem_exec_parallel:1795) igt_core-INFO:   #6 ../lib/ioctl_wrappers.c:422 gem_set_domain()
(gem_exec_parallel:1795) igt_core-INFO:   #7 ../tests/i915/gem_exec_parallel.c:53 check_bo()
(gem_exec_parallel:1795) igt_core-INFO:   #8 ../tests/i915/gem_exec_parallel.c:221 all()
(gem_exec_parallel:1795) igt_core-INFO:   #9 ../tests/i915/gem_exec_parallel.c:255 __real_main228()
(gem_exec_parallel:1795) igt_core-INFO:   #10 ../tests/i915/gem_exec_parallel.c:228 main()
(gem_exec_parallel:1795) igt_core-INFO:   #11 ../csu/libc-start.c:344 __libc_start_main()
(gem_exec_parallel:1795) igt_core-INFO:   #12 [_start+0x2a]
****  END  ****
Subtest default-contexts: FAIL (9.456s)
Comment 15 Martin Peres 2018-11-28 15:01:47 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4734/fi-bsw-kefka/igt@i915_selftest@live_hangcheck.html

<3> [565.196923] [drm:gen8_reset_engines [i915]] *ERROR* vecs0: reset request timeout
<3> [565.200917] [drm:gen8_reset_engines [i915]] *ERROR* vecs0: reset request timeout
<3> [565.213506] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5
Comment 16 Chris Wilson 2018-12-07 10:28:16 UTC
Step 1:

commit 490b8c65b9db45896769e1095e78725775f47b3e (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Dec 6 08:44:31 2018 +0000

    drm/i915/execlists: Apply a full mb before execution for Braswell
    
    Braswell is really picky about having our writes posted to memory before
    we execute or else the GPU may see stale values. A wmb() is insufficient
    as it only ensures the writes are visible to other cores, we need a full
    mb() to ensure the writes are in memory and visible to the GPU.
    
    The most frequent failure in flushing before execution is that we see
    stale PTE values and execute the wrong pages.
    
    References: 987abd5c62f9 ("drm/i915/execlists: Force write serialisation into context image vs execution")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: stable@vger.kernel.org
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Comment 17 Chris Wilson 2018-12-07 13:31:27 UTC
Step 2:

commit e8894267cc3325901073e8adf0a63e2dc53b6242 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Dec 7 09:02:13 2018 +0000

    drm/i915: Pipeline PDP updates for Braswell
    
    Currently we face a severe problem on Braswell that manifests as invalid
    ppGTT accesses. The code tries to maintain the PDP (page directory
    pointers) inside the context in two ways, direct write into the context
    and a pipelined LRI update. The direct write into the context is
    fundamentally racy as it is unserialised with any access (read or write)
    the GPU is doing. By asserting that Braswell is not used with vGPU
    (currently an unsupported platform) we can eliminate the dangerous
    direct write into the context image and solely use the pipelined update.
    
    However, the LRI of the PDP fouls up the GPU, causing it to freeze and
    take out the machine with "forcewake ack timeouts". This seems possible
    to workaround by preventing the GPU from sleeping (via means of
    disabling the power-state management interface, i.e. forcing each ring
    to remain awake) around the update. Equally, it seems an EMIT_INVALIDATE
    before the LRI is sufficient to prevent the forcewake errors.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108714
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181207090213.14352-3-chris@chris-wilson.co.uk


Step 3: Hope for the best.
Comment 18 Chris Wilson 2018-12-07 13:32:42 UTC
Note that the root cause for this bug is believed to be invalid TLB / vm.

Something like https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5282/fi-bsw-cyan/igt@gem_ctx_create@basic-files.html has different symptoms (the GPU just froze). It may be worthwhile to treat that as a different bug if it persists.
Comment 19 Chris Wilson 2018-12-08 10:11:04 UTC
(In reply to Chris Wilson from comment #17)
> Step 3: Hope for the best.

A step too far.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.