Bug 108656 - [CI][BAT bsw] igt@gem_* - fail - Failed assertion: !"GPU hung"
Summary: [CI][BAT bsw] igt@gem_* - fail - Failed assertion: !"GPU hung"
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 108666 108681 108798 108802 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-11-05 11:05 UTC by Martin Peres
Modified: 2019-02-19 08:47 UTC (History)
1 user (show)

See Also:
i915 platform: BSW/CHT
i915 features: GPU hang


Attachments

Description Martin Peres 2018-11-05 11:05:32 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5085/fi-bsw-cyan/igt@gem_ctx_create@basic-files.html

Starting subtest: basic-files
(gem_ctx_create:980) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_ctx_create:980) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-files failed.
Comment 1 Chris Wilson 2018-11-05 12:07:34 UTC
Looks like a genuine TLB or other cache failure as it ran off the end of the batchbuffer.
Comment 2 Chris Wilson 2018-11-05 13:19:10 UTC
And on the next run, the other two bsw failed similarly. Suspicious.
Comment 3 Chris Wilson 2018-11-05 14:18:56 UTC
*** Bug 108666 has been marked as a duplicate of this bug. ***
Comment 4 Martin Peres 2018-11-06 09:24:46 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-bsw-kefka/igt@gem_exec_schedule@deep-vebox.html

Starting subtest: deep-vebox
(gem_exec_schedule:1391) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_schedule:1391) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest deep-vebox failed.
Comment 5 Chris Wilson 2018-11-06 12:33:13 UTC
*** Bug 108681 has been marked as a duplicate of this bug. ***
Comment 6 Chris Wilson 2018-11-08 12:24:32 UTC
Ever the optimist:

commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 8 08:17:38 2018 +0000

    drm/i915/execlists: Force write serialisation into context image vs execution
    
    Ensure that the writes into the context image are completed prior to the
    register mmio to trigger execution. Although previously we were assured
    by the SDM that all writes are flushed before an uncached memory
    transaction (our mmio write to submit the context to HW for execution),
    we have empirical evidence to believe that this is not actually the
    case.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181108081740.25615-1-chris@chris-wilson.co.uk
    Cc: stable@vger.kernel.org
Comment 7 Chris Wilson 2018-11-08 13:39:31 UTC
My bsw hung again,

  batch: [0x00000000_03e02000, 0x00000000_03e03000]
  BBADDR: 0x00000000_03e07001

so not the magic bullet.
Comment 8 Chris Wilson 2018-11-09 12:28:57 UTC
(In reply to Chris Wilson from comment #6)
> Ever the optimist:
> 
> commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD ->
> drm-intel-next-queued, drm-intel/for-linux-next,
> drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Nov 8 08:17:38 2018 +0000
> 
>     drm/i915/execlists: Force write serialisation into context image vs
> execution 

And for confirmation that no, this sadly wasn't it:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5110/fi-bsw-kefka/igt@gem_ctx_create@basic-files.html
Comment 9 Martin Peres 2018-11-16 15:58:01 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4718/fi-bsw-kefka/igt@i915_selftest@live_contexts.html

<7> [597.917656] [drm:intel_gpu_reset [i915]] vecs0: timed out on STOP_RING
Comment 10 Chris Wilson 2018-11-19 16:35:53 UTC
*** Bug 108798 has been marked as a duplicate of this bug. ***
Comment 11 Chris Wilson 2018-11-19 17:00:44 UTC
*** Bug 108802 has been marked as a duplicate of this bug. ***
Comment 12 Lakshmi 2018-11-22 09:56:02 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_146/fi-bsw-kefka/igt@gem_exec_schedule@deep-bsd.html

	
Starting subtest: deep-bsd
(gem_exec_schedule:986) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_schedule:986) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest deep-bsd failed.
**** DEBUG ****
(gem_exec_schedule:986) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_exec_schedule:986) igt_dummyload-DEBUG: Test requirement passed: vgem_has_fences(cork->vgem.device)
(gem_exec_schedule:986) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_exec_schedule:986) INFO: Using 2046 requests (prio range 2046)
(gem_exec_schedule:986) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_exec_schedule:986) igt_dummyload-DEBUG: Test requirement passed: vgem_has_fences(cork->vgem.device)
(gem_exec_schedule:986) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_schedule:986) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_exec_schedule:986) igt_core-INFO: Stack trace:
(gem_exec_schedule:986) igt_core-INFO:   #0 ../lib/igt_core.c:1467 __igt_fail_assert()
(gem_exec_schedule:986) igt_core-INFO:   #1 ../lib/igt_aux.c:504 igt_fork_hang_detector()
(gem_exec_schedule:986) igt_core-INFO:   #2 [killpg+0x40]
(gem_exec_schedule:986) igt_core-INFO:   #3 ../sysdeps/unix/syscall-template.S:78 ioctl()
(gem_exec_schedule:986) igt_core-INFO:   #4 /home/cidrm/libdrm/xf86drm.c:189 drmIoctl()
(gem_exec_schedule:986) igt_core-INFO:   #5 ../lib/ioctl_wrappers.c:589 __gem_execbuf()
(gem_exec_schedule:986) igt_core-INFO:   #6 ../lib/ioctl_wrappers.c:605 gem_execbuf()
(gem_exec_schedule:986) igt_core-INFO:   #7 ../tests/i915/gem_exec_schedule.c:850 deep()
(gem_exec_schedule:986) igt_core-INFO:   #8 ../tests/i915/gem_exec_schedule.c:1314 __real_main1194()
(gem_exec_schedule:986) igt_core-INFO:   #9 ../tests/i915/gem_exec_schedule.c:1194 main()
(gem_exec_schedule:986) igt_core-INFO:   #10 ../csu/libc-start.c:344 __libc_start_main()
(gem_exec_schedule:986) igt_core-INFO:   #11 [_start+0x2a]
****  END  ****
Subtest deep-bsd: FAIL (9.107s)
Comment 13 Lakshmi 2018-11-25 18:56:22 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5192/fi-bsw-kefka/igt@gem_ctx_create@basic-files.html

Starting subtest: basic-files
(gem_ctx_create:1105) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_ctx_create:1105) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-files failed.
**** DEBUG ****
(gem_ctx_create:1105) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_ctx_create:1105) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_ctx_create:1105) igt_core-INFO: Stack trace:
(gem_ctx_create:1105) igt_core-INFO:   #0 ../lib/igt_core.c:1467 __igt_fail_assert()
(gem_ctx_create:1105) igt_core-INFO:   #1 ../lib/igt_aux.c:504 igt_fork_hang_detector()
(gem_ctx_create:1105) igt_core-INFO:   #2 [killpg+0x40]
(gem_ctx_create:1105) igt_core-INFO:   #3 ../sysdeps/unix/sysv/linux/wait.c:29 wait()
(gem_ctx_create:1105) igt_core-INFO:   #4 ../lib/igt_core.c:1768 __igt_waitchildren()
(gem_ctx_create:1105) igt_core-INFO:   #5 ../lib/igt_core.c:1819 igt_waitchildren()
(gem_ctx_create:1105) igt_core-INFO:   #6 ../tests/i915/gem_ctx_create.c:102 files()
(gem_ctx_create:1105) igt_core-INFO:   #7 ../tests/i915/gem_ctx_create.c:360 __real_main311()
(gem_ctx_create:1105) igt_core-INFO:   #8 ../tests/i915/gem_ctx_create.c:311 main()
(gem_ctx_create:1105) igt_core-INFO:   #9 ../csu/libc-start.c:344 __libc_start_main()
(gem_ctx_create:1105) igt_core-INFO:   #10 [_start+0x2a]
****  END  ****
Subtest basic-files: FAIL (8.358s)
Comment 14 Lakshmi 2018-11-25 19:39:49 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_148/fi-bsw-kefka/igt@gem_exec_parallel@default-contexts.html

	
Starting subtest: default-contexts
(gem_exec_parallel:1795) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_parallel:1795) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest default-contexts failed.
**** DEBUG ****
(gem_exec_parallel:1795) i915/gem_context-DEBUG: Test requirement passed: gem_has_contexts(fd)
(gem_exec_parallel:1795) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_parallel:1795) DEBUG: Test requirement passed: gem_can_store_dword(fd, engine)
(gem_exec_parallel:1795) DEBUG: Test requirement passed: nengine
(gem_exec_parallel:1795) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_exec_parallel:1795) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_exec_parallel:1795) DEBUG: Verifying result (pass=0, handle=1)
(gem_exec_parallel:1795) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_parallel:1795) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_exec_parallel:1795) igt_core-INFO: Stack trace:
(gem_exec_parallel:1795) igt_core-INFO:   #0 ../lib/igt_core.c:1467 __igt_fail_assert()
(gem_exec_parallel:1795) igt_core-INFO:   #1 ../lib/igt_aux.c:504 igt_fork_hang_detector()
(gem_exec_parallel:1795) igt_core-INFO:   #2 [killpg+0x40]
(gem_exec_parallel:1795) igt_core-INFO:   #3 ../sysdeps/unix/syscall-template.S:78 ioctl()
(gem_exec_parallel:1795) igt_core-INFO:   #4 /home/cidrm/libdrm/xf86drm.c:191 drmIoctl()
(gem_exec_parallel:1795) igt_core-INFO:   #5 ../lib/ioctl_wrappers.c:402 __gem_set_domain()
(gem_exec_parallel:1795) igt_core-INFO:   #6 ../lib/ioctl_wrappers.c:422 gem_set_domain()
(gem_exec_parallel:1795) igt_core-INFO:   #7 ../tests/i915/gem_exec_parallel.c:53 check_bo()
(gem_exec_parallel:1795) igt_core-INFO:   #8 ../tests/i915/gem_exec_parallel.c:221 all()
(gem_exec_parallel:1795) igt_core-INFO:   #9 ../tests/i915/gem_exec_parallel.c:255 __real_main228()
(gem_exec_parallel:1795) igt_core-INFO:   #10 ../tests/i915/gem_exec_parallel.c:228 main()
(gem_exec_parallel:1795) igt_core-INFO:   #11 ../csu/libc-start.c:344 __libc_start_main()
(gem_exec_parallel:1795) igt_core-INFO:   #12 [_start+0x2a]
****  END  ****
Subtest default-contexts: FAIL (9.456s)
Comment 15 Martin Peres 2018-11-28 15:01:47 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4734/fi-bsw-kefka/igt@i915_selftest@live_hangcheck.html

<3> [565.196923] [drm:gen8_reset_engines [i915]] *ERROR* vecs0: reset request timeout
<3> [565.200917] [drm:gen8_reset_engines [i915]] *ERROR* vecs0: reset request timeout
<3> [565.213506] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5
Comment 16 Chris Wilson 2018-12-07 10:28:16 UTC
Step 1:

commit 490b8c65b9db45896769e1095e78725775f47b3e (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Dec 6 08:44:31 2018 +0000

    drm/i915/execlists: Apply a full mb before execution for Braswell
    
    Braswell is really picky about having our writes posted to memory before
    we execute or else the GPU may see stale values. A wmb() is insufficient
    as it only ensures the writes are visible to other cores, we need a full
    mb() to ensure the writes are in memory and visible to the GPU.
    
    The most frequent failure in flushing before execution is that we see
    stale PTE values and execute the wrong pages.
    
    References: 987abd5c62f9 ("drm/i915/execlists: Force write serialisation into context image vs execution")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: stable@vger.kernel.org
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Comment 17 Chris Wilson 2018-12-07 13:31:27 UTC
Step 2:

commit e8894267cc3325901073e8adf0a63e2dc53b6242 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Dec 7 09:02:13 2018 +0000

    drm/i915: Pipeline PDP updates for Braswell
    
    Currently we face a severe problem on Braswell that manifests as invalid
    ppGTT accesses. The code tries to maintain the PDP (page directory
    pointers) inside the context in two ways, direct write into the context
    and a pipelined LRI update. The direct write into the context is
    fundamentally racy as it is unserialised with any access (read or write)
    the GPU is doing. By asserting that Braswell is not used with vGPU
    (currently an unsupported platform) we can eliminate the dangerous
    direct write into the context image and solely use the pipelined update.
    
    However, the LRI of the PDP fouls up the GPU, causing it to freeze and
    take out the machine with "forcewake ack timeouts". This seems possible
    to workaround by preventing the GPU from sleeping (via means of
    disabling the power-state management interface, i.e. forcing each ring
    to remain awake) around the update. Equally, it seems an EMIT_INVALIDATE
    before the LRI is sufficient to prevent the forcewake errors.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108714
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181207090213.14352-3-chris@chris-wilson.co.uk


Step 3: Hope for the best.
Comment 18 Chris Wilson 2018-12-07 13:32:42 UTC
Note that the root cause for this bug is believed to be invalid TLB / vm.

Something like https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5282/fi-bsw-cyan/igt@gem_ctx_create@basic-files.html has different symptoms (the GPU just froze). It may be worthwhile to treat that as a different bug if it persists.
Comment 19 Chris Wilson 2018-12-08 10:11:04 UTC
(In reply to Chris Wilson from comment #17)
> Step 3: Hope for the best.

A step too far.
Comment 20 CI Bug Log 2018-12-31 12:07:48 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BSW: random tests - fail | dmesg-fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ BSW: random tests - fail | dmesg-fail - Failed assertion: !&quot;GPU hung&quot; +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_159/fi-bsw-n3050/igt@gem_exec_parallel@render-contexts.html
Comment 21 CI Bug Log 2018-12-31 14:43:05 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BSW: random tests - fail | dmesg-fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ BSW: random tests - fail | dmesg-fail - Failed assertion: !&quot;GPU hung&quot; +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_173/fi-bsw-kefka/igt@gem_exec_parallel@blt-fds.html
* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_173/fi-bsw-n3050/igt@gem_exec_schedule@deep-render.html
Comment 22 Chris Wilson 2019-01-15 17:48:05 UTC
A new hope:

commit 8cd999181f8c744c87fb64e7b3600876ec3428b2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 14 21:17:27 2019 +0000

    drm/i915: Prevent concurrent GGTT update and use on Braswell (again)
    
    On Braswell, under heavy stress, if we update the GGTT while
    simultaneously accessing another region inside the GTT, we are returned
    the wrong values. To prevent this we stop the machine to update the GGTT
    entries so that no memory traffic can occur at the same time.
    
    This was first spotted in
    
    commit 5bab6f60cb4d1417ad7c599166bcfec87529c1a2
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Fri Oct 23 18:43:32 2015 +0100
    
        drm/i915: Serialise updates to GGTT with access through GGTT on Braswell
    
    but removed again in forlorn hope with
    
    commit 4509276ee824bb967885c095c610767e42345c36
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Mon Feb 20 12:47:18 2017 +0000
    
        drm/i915: Remove Braswell GGTT update w/a
    
    However, gem_concurrent_blit is once again only stable with the patch
    applied and CI is detecting the odd failure in forked gem_mmap_gtt tests
    (which smell like the same issue). Fwiw, a wide variety of CPU memory
    barriers (around GGTT flushing, fence updates, PTE updates) and GPU
    flushes/invalidates (between requests, after PTE updates) were tried as
    part of the investigation to find an alternate cause, nothing comes
    close to serialised GGTT updates.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105591
    Testcase: igt/gem_concurrent_blit
    Testcase: igt/gem_mmap_gtt/*forked*
    References: 5bab6f60cb4d ("drm/i915: Serialise updates to GGTT with access through GGTT on Braswell")
    References: 4509276ee824 ("drm/i915: Remove Braswell GGTT update w/a")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190114211729.30352-1-chris@chris-wilson.co.uk
Comment 23 Lakshmi 2019-02-19 08:40:35 UTC
Last seen this issue drmtip_173 (2 months / 1083 runs ago). Not seen after the fix. Closing the bug.
Comment 24 CI Bug Log 2019-02-19 08:47:35 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.