https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5219/fi-icl-u2/igt@i915_selftest@live_execlists.html <2> [696.738175] kernel BUG at drivers/gpu/drm/i915/gt/intel_lrc.c:2778! <4> [696.738187] CPU: 4 PID: 5787 Comm: i915_selftest Tainted: G U 5.4.0-rc2-CI-CI_DRM_7040+ #1 <4> [696.738445] __intel_gt_set_wedged.part.18+0xb2/0x180 [i915] <4> [696.738450] ? __drm_printfn_info+0x20/0x20 <4> [696.738493] intel_gt_set_wedged+0x60/0x70 [i915] <4> [696.738493] __i915_subtests+0xb8/0x210 [i915] <4> [696.738493] ? __i915_nop_teardown+0x10/0x10 [i915] <4> [696.738493] ? __i915_live_setup+0x10/0x10 [i915] <4> [696.738493] i915_live_selftests+0x2c/0x60 [i915] <4> [696.738493] i915_pci_probe+0x93/0x1b0 [i915] <4> [696.738493] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec e1000e snd_hwdep ghash_clmulni_intel cdc_ether snd_hda_core usbnet mii snd_pcm ptp pps_core mei_me thunderbolt mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915] <3> [697.118367] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 5787, name: i915_selftest <4> [697.118421] CPU: 4 PID: 5787 Comm: i915_selftest Tainted: G UD 5.4.0-rc2-CI-CI_DRM_7040+ #1
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ICL: igt@i915_selftest@live_execlists - incomplete - GEM_BUG_ON(i915_active_is_idle(&ce->active)) (No new failures associated)
This is bizarre; it looks quite straightforward but then unravels quickly as you start pulling on threads. <7> [696.700847] __intel_gt_set_wedged vcs0 <7> [696.700851] __intel_gt_set_wedged Awake? 1 <7> [696.700854] __intel_gt_set_wedged Hangcheck: 5864 ms ago <7> [696.700856] __intel_gt_set_wedged Reset count: 0 (global 0) <7> [696.700859] __intel_gt_set_wedged Requests: <7> [696.702510] __intel_gt_set_wedged MMIO base: 0x001c0000 <7> [696.703362] __intel_gt_set_wedged RING_START: 0x0000a000 <7> [696.704157] __intel_gt_set_wedged RING_HEAD: 0x00002038 <7> [696.704184] __intel_gt_set_wedged RING_TAIL: 0x00002038 <7> [696.704223] __intel_gt_set_wedged RING_CTL: 0x00003401 [waiting] <7> [696.705925] __intel_gt_set_wedged RING_MODE: 0x00000200 [idle] <7> [696.706787] __intel_gt_set_wedged RING_IMR: 00000000 <7> [696.709285] __intel_gt_set_wedged ACTHD: 0x00000000_00002038 <7> [696.711045] __intel_gt_set_wedged BBADDR: 0x00000000_00000000 <7> [696.711915] __intel_gt_set_wedged DMA_FADDR: 0x00000000_0000c038 <7> [696.712812] __intel_gt_set_wedged IPEIR: 0x00000000 <7> [696.713579] __intel_gt_set_wedged IPEHR: 0x0e40c002 <7> [696.714443] __intel_gt_set_wedged Execlist status: 0x00002098 20000040, entries 12 <7> [696.714446] __intel_gt_set_wedged Execlist CSB read 6, write 7, tasklet queued? no (enabled) <7> [696.714449] __intel_gt_set_wedged Execlist CSB[7]: 0x00000002, context: 536870944 <7> [696.714472] __intel_gt_set_wedged Active[0]: ring:{start:00006000, hwsp:ffff9140, seqno:00000001}, rq: 1b146:2* prio=3 @ 8240ms: [i915] <7> [696.714487] __intel_gt_set_wedged Pending[0] ring:{start:0000a000, hwsp:ffff9180, seqno:00000002}, rq: 1b147:2!+ prio=4097 @ 8240ms: signaled <7> [696.714492] __intel_gt_set_wedged Pending[1] ring:{start:00006000, hwsp:ffff9140, seqno:00000001}, rq: 1b146:4- prio=3 @ 8240ms: [i915] <7> [696.714509] __intel_gt_set_wedged E 1b146:2* prio=3 @ 8240ms: [i915] <7> [696.714512] __intel_gt_set_wedged E 1b146:4- prio=3 @ 8240ms: [i915] <7> [696.714515] __intel_gt_set_wedged Queue priority hint: 3 <0> [696.673408] i915_sel-5787 5.... 740573711us : __intel_context_do_pin: vcs0 context:1b146 pin ring:{head:0000, tail:0000} <0> [696.673408] i915_sel-5787 5.... 740574064us : __intel_context_do_pin: vcs0 context:1b147 pin ring:{head:0000, tail:0000} <0> [696.673408] i915_sel-5787 5.... 740574078us : __engine_unpark: vcs0 <0> [696.673408] i915_sel-5787 5.... 740574084us : __gt_unpark: <0> [696.673408] i915_sel-5787 5.... 740574655us : __i915_request_commit: vcs0 fence 1b146:2 <0> [696.673408] i915_sel-5787 5d..1 740574662us : __i915_request_submit: vcs0 fence 1b146:2, current 0 <0> [696.673408] i915_sel-5787 5d..1 740574663us : __execlists_submission_tasklet: vcs0: queue_priority_hint:-2147483648, submit:yes <0> [696.673408] i915_sel-5787 5d..1 740574665us : trace_ports: vcs0: submit { 1b146:2, 0:0 } <0> [696.673408] i915_sel-5787 5.... 740574723us : __i915_request_commit: vcs0 fence 1b147:2 <0> [696.673408] i915_sel-5787 5.... 740574754us : __i915_request_commit: vcs0 fence 1b146:4 <0> [696.673408] <idle>-0 2..s1 740574757us : process_csb: vcs0 cs-irq head=5, tail=6 <0> [696.673408] <idle>-0 2..s1 740574758us : process_csb: vcs0 csb[6]: status=0x00000001:0x20000000 <0> [696.673408] <idle>-0 2..s1 740574760us : trace_ports: vcs0: promote { 1b146:2*, 0:0 } <0> [696.673408] <idle>-0 2d.s2 740574784us : __execlists_submission_tasklet: vcs0: preempting last=1b146:2, prio=3, hint=4097 <0> [696.673408] <idle>-0 2d.s2 740574786us : __i915_request_unsubmit: vcs0 fence 1b146:2, current 1 <0> [696.673408] <idle>-0 2d.s2 740574788us : __i915_request_submit: vcs0 fence 1b147:2, current 0 <0> [696.673408] <idle>-0 2d.s2 740574798us : __i915_request_submit: vcs0 fence 1b146:2, current 1 <0> [696.673408] <idle>-0 2d.s2 740574800us : __i915_request_submit: vcs0 fence 1b146:4, current 1 <0> [696.673408] <idle>-0 2d.s2 740574801us : __execlists_submission_tasklet: vcs0: queue_priority_hint:-2147483648, submit:yes <0> [696.673408] <idle>-0 2d.s2 740574802us : trace_ports: vcs0: submit { 1b147:2, 1b146:4 } <0> [696.673408] i915_sel-5787 5.... 740574910us : i915_request_retire: vcs0 fence 1b147:2, current 2 <0> [696.673408] i915_sel-5787 5.... 740574912us : intel_context_unpin: vcs0 context:1b147 retire <0> [696.673408] i915_sel-5787 5.... 740574916us : __intel_context_retire: vcs0 context:1b147 retire So the HW froze, there is a CS event in the queue, but we never saw the interrupt. (So the HW died? We just missed an interrupt? The latter is nice and scary.) During reset, the context idled. Which also shouldn't have happened -- I think the engine parked, but we are after the set-wedged (and after the GEM_TRACE) so it is retired immediately. Hmm. Seems possible.
I don't see any record of this in CI buglog (the entry says 0 occurrences) but given that it happened in BAT we need to keep an eye on it.
(In reply to Chris Wilson from comment #2) > During reset, the context idled. Which also shouldn't have happened -- I > think the engine parked, but we are after the set-wedged (and after the > GEM_TRACE) so it is retired immediately. Hmm. Seems possible. intel_gt_set_wedged() takes the intel_engine_pm before marking it as wedged, so we should not be able to discard the idle barrier so easily. Except for switch_to_kernel_context: rq = __i915_request_create(engine->kernel_context, GFP_NOWAIT); if (IS_ERR(rq)) /* Context switch failed, hope for the best! Maybe reset? */ goto out_unlock; Otherwise there should have been an idle request in flight keeping the retired context active. It might interesting to stick an intel_engine_pm_is_awake() into the reset code, but first need to think how that upsets the replay or note.
I hope that commit bd9bec5b6a09a3a7656f096e3ff0ca6709f89770 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Oct 10 09:32:42 2019 +0100 drm/i915/execlists: Mark up expected state during reset Move the BUG_ON around slightly and add some explanations for each to try and capture the expected state more carefully. We want to compare the expected active state of our bookkeeping as compared to the tracked HW state. References: https://bugs.freedesktop.org/show_bug.cgi?id=111937 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191010083242.1387-1-chris@chris-wilson.co.uk makes the BUG_ON go away. Doesn't answer how we got into that state, but maybe that will go away all by itself.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.