Bug 111937

Summary: [CI][BAT] igt@i915_selftest@live_execlists - incomplete - GEM_BUG_ON(i915_active_is_idle(&ce->active))
Product: DRI Reporter: Lakshmi <lakshminarayana.vudum>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: ICL i915 features: GEM/Other

Description Lakshmi 2019-10-09 14:17:00 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5219/fi-icl-u2/igt@i915_selftest@live_execlists.html
<2> [696.738175] kernel BUG at drivers/gpu/drm/i915/gt/intel_lrc.c:2778!
<4> [696.738187] CPU: 4 PID: 5787 Comm: i915_selftest Tainted: G     U            5.4.0-rc2-CI-CI_DRM_7040+ #1
<4> [696.738445]  __intel_gt_set_wedged.part.18+0xb2/0x180 [i915]
<4> [696.738450]  ? __drm_printfn_info+0x20/0x20
<4> [696.738493]  intel_gt_set_wedged+0x60/0x70 [i915]
<4> [696.738493]  __i915_subtests+0xb8/0x210 [i915]
<4> [696.738493]  ? __i915_nop_teardown+0x10/0x10 [i915]
<4> [696.738493]  ? __i915_live_setup+0x10/0x10 [i915]
<4> [696.738493]  i915_live_selftests+0x2c/0x60 [i915]
<4> [696.738493]  i915_pci_probe+0x93/0x1b0 [i915]
<4> [696.738493] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec e1000e snd_hwdep ghash_clmulni_intel cdc_ether snd_hda_core usbnet mii snd_pcm ptp pps_core mei_me thunderbolt mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915]
<3> [697.118367] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 5787, name: i915_selftest
<4> [697.118421] CPU: 4 PID: 5787 Comm: i915_selftest Tainted: G     UD           5.4.0-rc2-CI-CI_DRM_7040+ #1
Comment 1 CI Bug Log 2019-10-09 14:17:45 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@i915_selftest@live_execlists - incomplete - GEM_BUG_ON(i915_active_is_idle(&amp;ce-&gt;active))
  (No new failures associated)
Comment 2 Chris Wilson 2019-10-09 14:32:04 UTC
This is bizarre; it looks quite straightforward but then unravels quickly as you start pulling on threads.

<7> [696.700847] __intel_gt_set_wedged vcs0
<7> [696.700851] __intel_gt_set_wedged 	Awake? 1
<7> [696.700854] __intel_gt_set_wedged 	Hangcheck: 5864 ms ago
<7> [696.700856] __intel_gt_set_wedged 	Reset count: 0 (global 0)
<7> [696.700859] __intel_gt_set_wedged 	Requests:
<7> [696.702510] __intel_gt_set_wedged 	MMIO base:  0x001c0000
<7> [696.703362] __intel_gt_set_wedged 	RING_START: 0x0000a000
<7> [696.704157] __intel_gt_set_wedged 	RING_HEAD:  0x00002038
<7> [696.704184] __intel_gt_set_wedged 	RING_TAIL:  0x00002038
<7> [696.704223] __intel_gt_set_wedged 	RING_CTL:   0x00003401 [waiting]
<7> [696.705925] __intel_gt_set_wedged 	RING_MODE:  0x00000200 [idle]
<7> [696.706787] __intel_gt_set_wedged 	RING_IMR: 00000000
<7> [696.709285] __intel_gt_set_wedged 	ACTHD:  0x00000000_00002038
<7> [696.711045] __intel_gt_set_wedged 	BBADDR: 0x00000000_00000000
<7> [696.711915] __intel_gt_set_wedged 	DMA_FADDR: 0x00000000_0000c038
<7> [696.712812] __intel_gt_set_wedged 	IPEIR: 0x00000000
<7> [696.713579] __intel_gt_set_wedged 	IPEHR: 0x0e40c002
<7> [696.714443] __intel_gt_set_wedged 	Execlist status: 0x00002098 20000040, entries 12
<7> [696.714446] __intel_gt_set_wedged 	Execlist CSB read 6, write 7, tasklet queued? no (enabled)
<7> [696.714449] __intel_gt_set_wedged 	Execlist CSB[7]: 0x00000002, context: 536870944
<7> [696.714472] __intel_gt_set_wedged 		Active[0]: ring:{start:00006000, hwsp:ffff9140, seqno:00000001}, rq:  1b146:2*  prio=3 @ 8240ms: [i915]
<7> [696.714487] __intel_gt_set_wedged 		Pending[0] ring:{start:0000a000, hwsp:ffff9180, seqno:00000002}, rq:  1b147:2!+  prio=4097 @ 8240ms: signaled
<7> [696.714492] __intel_gt_set_wedged 		Pending[1] ring:{start:00006000, hwsp:ffff9140, seqno:00000001}, rq:  1b146:4-  prio=3 @ 8240ms: [i915]
<7> [696.714509] __intel_gt_set_wedged 		E  1b146:2*  prio=3 @ 8240ms: [i915]
<7> [696.714512] __intel_gt_set_wedged 		E  1b146:4-  prio=3 @ 8240ms: [i915]
<7> [696.714515] __intel_gt_set_wedged 		Queue priority hint: 3

<0> [696.673408] i915_sel-5787    5.... 740573711us : __intel_context_do_pin: vcs0 context:1b146 pin ring:{head:0000, tail:0000}
<0> [696.673408] i915_sel-5787    5.... 740574064us : __intel_context_do_pin: vcs0 context:1b147 pin ring:{head:0000, tail:0000}
<0> [696.673408] i915_sel-5787    5.... 740574078us : __engine_unpark: vcs0
<0> [696.673408] i915_sel-5787    5.... 740574084us : __gt_unpark: 
<0> [696.673408] i915_sel-5787    5.... 740574655us : __i915_request_commit: vcs0 fence 1b146:2
<0> [696.673408] i915_sel-5787    5d..1 740574662us : __i915_request_submit: vcs0 fence 1b146:2, current 0
<0> [696.673408] i915_sel-5787    5d..1 740574663us : __execlists_submission_tasklet: vcs0: queue_priority_hint:-2147483648, submit:yes
<0> [696.673408] i915_sel-5787    5d..1 740574665us : trace_ports: vcs0: submit { 1b146:2, 0:0 }
<0> [696.673408] i915_sel-5787    5.... 740574723us : __i915_request_commit: vcs0 fence 1b147:2
<0> [696.673408] i915_sel-5787    5.... 740574754us : __i915_request_commit: vcs0 fence 1b146:4
<0> [696.673408]   <idle>-0       2..s1 740574757us : process_csb: vcs0 cs-irq head=5, tail=6
<0> [696.673408]   <idle>-0       2..s1 740574758us : process_csb: vcs0 csb[6]: status=0x00000001:0x20000000
<0> [696.673408]   <idle>-0       2..s1 740574760us : trace_ports: vcs0: promote { 1b146:2*, 0:0 }
<0> [696.673408]   <idle>-0       2d.s2 740574784us : __execlists_submission_tasklet: vcs0: preempting last=1b146:2, prio=3, hint=4097
<0> [696.673408]   <idle>-0       2d.s2 740574786us : __i915_request_unsubmit: vcs0 fence 1b146:2, current 1
<0> [696.673408]   <idle>-0       2d.s2 740574788us : __i915_request_submit: vcs0 fence 1b147:2, current 0
<0> [696.673408]   <idle>-0       2d.s2 740574798us : __i915_request_submit: vcs0 fence 1b146:2, current 1
<0> [696.673408]   <idle>-0       2d.s2 740574800us : __i915_request_submit: vcs0 fence 1b146:4, current 1
<0> [696.673408]   <idle>-0       2d.s2 740574801us : __execlists_submission_tasklet: vcs0: queue_priority_hint:-2147483648, submit:yes
<0> [696.673408]   <idle>-0       2d.s2 740574802us : trace_ports: vcs0: submit { 1b147:2, 1b146:4 }
<0> [696.673408] i915_sel-5787    5.... 740574910us : i915_request_retire: vcs0 fence 1b147:2, current 2
<0> [696.673408] i915_sel-5787    5.... 740574912us : intel_context_unpin: vcs0 context:1b147 retire
<0> [696.673408] i915_sel-5787    5.... 740574916us : __intel_context_retire: vcs0 context:1b147 retire

So the HW froze, there is a CS event in the queue, but we never saw the interrupt. (So the HW died? We just missed an interrupt? The latter is nice and scary.)

During reset, the context idled. Which also shouldn't have happened -- I think the engine parked, but we are after the set-wedged (and after the GEM_TRACE) so it is retired immediately. Hmm. Seems possible.
Comment 3 Francesco Balestrieri 2019-10-10 06:00:38 UTC
I don't see any record of this in CI buglog (the entry says 0 occurrences) but given that it happened in BAT we need to keep an eye on it.
Comment 4 Chris Wilson 2019-10-10 08:19:22 UTC
(In reply to Chris Wilson from comment #2)
> During reset, the context idled. Which also shouldn't have happened -- I
> think the engine parked, but we are after the set-wedged (and after the
> GEM_TRACE) so it is retired immediately. Hmm. Seems possible.

intel_gt_set_wedged() takes the intel_engine_pm before marking it as wedged, so we should not be able to discard the idle barrier so easily. Except for

switch_to_kernel_context:
        rq = __i915_request_create(engine->kernel_context, GFP_NOWAIT);
        if (IS_ERR(rq))
                /* Context switch failed, hope for the best! Maybe reset? */
                goto out_unlock;

Otherwise there should have been an idle request in flight keeping the retired context active.

It might interesting to stick an intel_engine_pm_is_awake() into the reset code, but first need to think how that upsets the replay or note.
Comment 5 Chris Wilson 2019-10-10 13:16:53 UTC
I hope that

commit bd9bec5b6a09a3a7656f096e3ff0ca6709f89770
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Oct 10 09:32:42 2019 +0100

    drm/i915/execlists: Mark up expected state during reset
    
    Move the BUG_ON around slightly and add some explanations for each to
    try and capture the expected state more carefully. We want to compare
    the expected active state of our bookkeeping as compared to the tracked
    HW state.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111937
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191010083242.1387-1-chris@chris-wilson.co.uk

makes the BUG_ON go away. Doesn't answer how we got into that state, but maybe that will go away all by itself.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.