Bug 111671

Summary: [CI][RESUME] igt@gem_exec_schedule@deep-* - incomplete - NMI watchdog
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: not set    
Priority: high CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: TGL i915 features: GEM/Other

Description Martin Peres 2019-09-12 10:24:59 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_363/fi-tgl-u/igt@gem_exec_schedule@deep-render.html

<0>[   84.718343] ---------------------------------
<0>[   84.718704] Kernel Offset: 0x16000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
<4>[   84.718704] CPU: 3 PID: 1009 Comm: gem_exec_schedu Tainted: G     U            5.3.0-rc7-ga1769d05ffa7-drmtip_363+ #1
<4>[   84.718704] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.2321.A01.1908052106 08/05/2019
<4>[   84.718705] Call Trace:
<4>[   84.718705]  <NMI>
<4>[   84.718705]  dump_stack+0x67/0x9b
<4>[   84.718705]  panic+0x12b/0x2cd
<4>[   84.718705]  nmi_panic+0x30/0x30
<4>[   84.718706]  watchdog_overflow_callback+0x167/0x1a0
<4>[   84.718706]  __perf_event_overflow+0x52/0xf0
<4>[   84.718706]  handle_pmi_common+0x1e6/0x290
<4>[   84.718706]  ? intel_pmu_handle_irq+0xcd/0x1b0
<4>[   84.718706]  intel_pmu_handle_irq+0xcd/0x1b0
<4>[   84.718706]  perf_event_nmi_handler+0x28/0x40
<4>[   84.718707]  nmi_handle+0xce/0x2b0
<4>[   84.718707]  default_do_nmi+0x72/0x170
<4>[   84.718707]  do_nmi+0x11b/0x170
<4>[   84.718707]  end_repeat_nmi+0x16/0x50
<4>[   84.718707] RIP: 0010:__i915_schedule+0x23a/0x9d0 [i915]
<4>[   84.718708] Code: e8 4b 23 f7 d6 48 c7 c2 10 32 1d c0 be 01 00 00 00 48 c7 c7 e0 b4 24 98 e8 d3 88 f4 d6 45 39 fe 0f 89 48 02 00 00 49 8b 04 24 <49> 39 c4 48 8d 50 f0 0f 84 37 02 00 00 4c 39 ea 0f 84 46 05 00 00
<4>[   84.718708] RSP: 0018:ffffbbaa804e3938 EFLAGS: 00000097
<4>[   84.718708] RAX: ffff9de2a8e0d050 RBX: ffffbbaa804e3960 RCX: 00000000a751a396
<4>[   84.718709] RDX: ffff9de2d55908c8 RSI: 00000000d248aed4 RDI: 00000000ffffffff
<4>[   84.718709] RBP: ffffbbaa804e39e8 R08: ffff9de2d5590918 R09: 00000000fffffffe
<4>[   84.718709] R10: 00000000277b8e45 R11: 0000000063404fd3 R12: ffff9de2a8585770
<4>[   84.718709] R13: ffff9de2a8e0ee40 R14: 0000000000000003 R15: 0000000000000004
<4>[   84.718710]  ? __i915_schedule+0x23a/0x9d0 [i915]
<4>[   84.718710]  ? __i915_schedule+0x23a/0x9d0 [i915]
<4>[   84.718710]  </NMI>
<4>[   84.718710]  ? i915_schedule+0x14/0x40 [i915]
<4>[   84.718710]  i915_schedule+0x23/0x40 [i915]
<4>[   84.718711]  __i915_request_queue+0x37/0x50 [i915]
<4>[   84.718711]  i915_request_add+0xc3/0x330 [i915]
<4>[   84.718711]  i915_gem_do_execbuffer+0x10d2/0x22f0 [i915]
<4>[   84.718711]  ? stack_trace_save+0x46/0x70
<4>[   84.718711]  ? init_object+0x66/0x80
<4>[   84.718712]  ? __lock_acquire+0x4ac/0x1e90
<4>[   84.718712]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4>[   84.718712]  i915_gem_execbuffer2_ioctl+0x11b/0x460 [i915]
<4>[   84.718712]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4>[   84.718712]  drm_ioctl_kernel+0x83/0xf0
<4>[   84.718713]  drm_ioctl+0x2f3/0x3b0
<4>[   84.718713]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4>[   84.718713]  ? trace_hardirqs_on_thunk+0x1a/0x20
<4>[   84.718713]  ? lockdep_hardirqs_on+0xe3/0x1c0
<4>[   84.718713]  ? trace_hardirqs_on_thunk+0x1a/0x20
<4>[   84.718714]  do_vfs_ioctl+0xa0/0x6f0
<4>[   84.718714]  ? retint_kernel+0x2b/0x2b
<4>[   84.718714]  ksys_ioctl+0x35/0x60
<4>[   84.718714]  __x64_sys_ioctl+0x11/0x20
<4>[   84.718714]  do_syscall_64+0x55/0x1c0
<4>[   84.718715]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[   84.718715] RIP: 0033:0x7fb7b7f755d7
<4>[   84.718715] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4>[   84.718715] RSP: 002b:00007ffedce1d4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4>[   84.718716] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb7b7f755d7
<4>[   84.718716] RDX: 00007ffedce1d530 RSI: 0000000040406469 RDI: 0000000000000005
<4>[   84.718716] RBP: 00007ffedce1d530 R08: 0000000000000040 R09: 0000000000000161
<4>[   84.718716] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000040406469
<4>[   84.718717] R13: 0000000000000005 R14: 0000000000000580 R15: 0000000000000161
Comment 2 Chris Wilson 2019-09-12 10:48:31 UTC
So it looks like it got caught in a loop in schedule. Given the GEM_BUG_ON(!node_signaled(dep->signaler)) we've seen in bug 111660, it's not completely out of the question for both to have the same root cause. Something is not happy with the execlists execution.
Comment 3 Chris Wilson 2019-09-17 16:44:18 UTC
Presuming

commit c210e85b8f3371168ce78c8da00b913839a84ec7 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Sep 17 13:30:55 2019 +0100

    drm/i915/tgl: Extend MI_SEMAPHORE_WAIT
    
    On Tigerlake, MI_SEMAPHORE_WAIT grew an extra dword, so be sure to
    update the length field and emit that extra parameter and any padding
    noop as required.
    
    v2: Define the token shift while we are adding the updated MI_SEMAPHORE_WAIT
    v3: Use int instead of bool in the addition so that readers are not left
    wondering about the intricacies of the C spec. Now they just have to
    worry what the integer value of a boolean operation is...
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
    Cc: Michal Winiarski <michal.winiarski@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190917123055.28965-1-chris@chris-wilson.co.uk

is causing trouble here as well.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.