Bug 107207

Summary: [BAT] igt@drv_selftest@live_hangcheck - incomplete - kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1040
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: GLK, KBL i915 features: GEM/execlists

Description Martin Peres 2018-07-12 13:21:20 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4460/fi-kbl-7500u/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4456/fi-kbl-7560u/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4465/fi-kbl-7567u/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4472/fi-glk-dsi/igt@drv_selftest@live_hangcheck.html

<4>[  639.552375] ------------[ cut here ]------------
<2>[  639.552379] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1040!
<4>[  639.552419] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4>[  639.552434] CPU: 3 PID: 31 Comm: ksoftirqd/3 Tainted: G     U            4.18.0-rc4-CI-CI_DRM_4472+ #1
<4>[  639.552454] Hardware name: Intel Corp. Geminilake/GLK RVP2 LP4SD (07), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017
<4>[  639.552600] RIP: 0010:process_csb+0x4b3/0x770 [i915]
<4>[  639.552614] Code: bc 09 c5 e0 48 8b 35 4c b8 19 00 49 c7 c0 00 c8 5a a0 b9 10 04 00 00 48 c7 c2 50 4f 57 a0 48 c7 c7 de 99 4a a0 e8 8d 9a cb e0 <0f> 0b 48 8b 75 d0 4c 8d a6 30 16 00 00 4c 89 e7 e8 28 48 49 e1 48 
<4>[  639.552739] RSP: 0018:ffffc9000016bd38 EFLAGS: 00010082
<4>[  639.552752] RAX: 000000000000000d RBX: ffff88016f26c2a8 RCX: 0000000000000000
<4>[  639.552768] RDX: 0000000000000000 RSI: 000000000000004c RDI: 0000000000000000
<4>[  639.552782] RBP: ffffc9000016bda0 R08: ffffffffa05ac800 R09: 0000000000000001
<4>[  639.552798] R10: ffffc9000016bd90 R11: 0000000000000000 R12: ffff88016e72605c
<4>[  639.552813] R13: 0000000000000003 R14: ffff88016e726058 R15: ffff88016e726040
<4>[  639.552829] FS:  0000000000000000(0000) GS:ffff88017fd80000(0000) knlGS:0000000000000000
<4>[  639.552846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  639.552860] CR2: 000055abb55db5e8 CR3: 000000016943c000 CR4: 0000000000340ee0
<4>[  639.552875] Call Trace:
<4>[  639.552968]  __execlists_submission_tasklet+0x32/0xc00 [i915]
<4>[  639.553066]  execlists_submission_tasklet+0x55/0x70 [i915]
<4>[  639.553088]  tasklet_action_common.isra.5+0x47/0xb0
<4>[  639.553102]  ? smpboot_thread_fn+0x6b/0x280
<4>[  639.553117]  __do_softirq+0xd9/0x505
<4>[  639.553129]  ? smpboot_thread_fn+0x23/0x280
<4>[  639.553142]  ? smpboot_thread_fn+0x6b/0x280
<4>[  639.553153]  run_ksoftirqd+0x29/0x50
<4>[  639.553164]  smpboot_thread_fn+0x1d3/0x280
<4>[  639.553176]  ? sort_range+0x20/0x20
<4>[  639.553187]  kthread+0x119/0x130
<4>[  639.553199]  ? kthread_flush_work_fn+0x10/0x10
<4>[  639.553213]  ret_from_fork+0x3a/0x50
Comment 1 Chris Wilson 2018-07-12 13:35:20 UTC
I think this the same bug as 106560 but with different symptoms. Should all be cleared up real soon now (tm).
Comment 2 Chris Wilson 2018-08-15 09:21:04 UTC
Memory is hazy, but I do think we closed this BUG loop hole. Hmm, iirc, it was a double wedge. Ok, not quite closed yet as I still have a patch to prevent double wedges.
Comment 3 Chris Wilson 2018-08-15 09:39:43 UTC
Hmm, we have

commit 3970c65c2b47c450f917bc8a29c5849563a95dfe
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 23 15:53:35 2018 +0100

    drm/i915: Skip repeated calls to i915_gem_set_wedged()
    
    If we already wedged, i915_gem_set_wedged() becomes a complicated no-op.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=107343
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180723145335.24579-1-c
hris@chris-wilson.co.uk

+

commit f1a498fa549e8e86895cda37e3fca867aae955b7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 16 09:03:30 2018 +0100

    drm/i915/execlists: Disable submission tasklet upon wedging
    
    If we declare the driver wedged before the GPU truly is, then we may see
    the GPU complete some CS events following our cancellation. This leaves
    us quite confused as we deleted all the bookkeeping and thus complain
    about the inconsistent state.
    
    We can just ignore the remaining events and let the GPU idle by not
    feeding it, and so avoid trying to racily overwrite shared state. We
    rely on there being a full GPU reset before unwedging, giving us the
    opportunity to reset the shared state.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180716080332.32283-4-c
hris@chris-wilson.co.uk

I think accounts for it.
Comment 4 Lakshmi 2018-08-24 06:52:13 UTC
Closed as this seen 1 month ago.
Comment 5 Lakshmi 2018-08-28 06:26:13 UTC
This bug used to occur after 2-23 rounds. This issue was not seen last 221 rounds. Closing this issue.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.