Bug 106693

Summary: [CI] igt@* - incomplete - intel_engine_unpin_breadcrumbs_irq:226 GEM_BUG_ON(!b->irq_enabled)
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: chris, intel-gfx-bugs, marta.lofstedt
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: CFL, KBL i915 features: firmware/guc

Description Martin Peres 2018-05-28 14:56:51 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_42/fi-cfl-guc/igt@gem_eio@throttle.html

<4>[   54.621438] ------------[ cut here ]------------
<2>[   54.621440] kernel BUG at drivers/gpu/drm/i915/intel_breadcrumbs.c:226!
<4>[   54.621463] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<0>[   54.621469] Dumping ftrace buffer:
<0>[   54.621475]    (ftrace buffer empty)
<4>[   54.621479] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul snd_hda_codec snd_hwdep ghash_clmulni_intel snd_hda_core e1000e snd_pcm mei_me prime_numbers mei
<4>[   54.621513] CPU: 11 PID: 2451 Comm: kworker/u24:35 Tainted: G     U            4.17.0-rc5-g5e68b6c9b2d2-drmtip_42+ #1
<4>[   54.621522] Hardware name: Micro-Star International Co., Ltd. MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
<4>[   54.621558] Workqueue: i915 i915_gem_idle_work_handler [i915]
<4>[   54.621591] RIP: 0010:intel_engine_unpin_breadcrumbs_irq+0x8c/0x90 [i915]
<4>[   54.621597] RSP: 0018:ffffa87340d27db8 EFLAGS: 00010086
<4>[   54.621603] RAX: 000000000000000e RBX: ffff895413d3c2a8 RCX: 0000000000000000
<4>[   54.621609] RDX: 0000000000000000 RSI: 0000000000000050 RDI: 0000000000000000
<4>[   54.621616] RBP: ffff895413d3c3b0 R08: ffffffffc063d71b R09: 0000000000000001
<4>[   54.621622] R10: ffffa87340d27d38 R11: 0000000000000000 R12: ffff895415f87678
<4>[   54.621628] R13: 0000000cb8078d14 R14: 00000000ffffffff R15: ffff895415f80000
<4>[   54.621635] FS:  0000000000000000(0000) GS:ffff8954264c0000(0000) knlGS:0000000000000000
<4>[   54.621642] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   54.621648] CR2: 00005627176a0210 CR3: 000000001a210003 CR4: 00000000003606e0
<4>[   54.621654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[   54.621661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[   54.621667] Call Trace:
<4>[   54.621697]  intel_engines_park+0x122/0x1b0 [i915]
<4>[   54.621705]  ? synchronize_irq+0x3e/0xb0
<4>[   54.621734]  i915_gem_idle_work_handler+0x207/0x3c0 [i915]
<4>[   54.621741]  process_one_work+0x229/0x6a0
<4>[   54.621748]  worker_thread+0x35/0x380
<4>[   54.621753]  ? process_one_work+0x6a0/0x6a0
<4>[   54.621758]  kthread+0x119/0x130
<4>[   54.621763]  ? _kthread_create_on_node+0x60/0x60
<4>[   54.621769]  ret_from_fork+0x3a/0x50
<4>[   54.621776] Code: e8 a3 ee ba ef 48 8b 35 fb 22 1a 00 49 c7 c0 1b d7 63 c0 b9 e2 00 00 00 48 c7 c2 20 43 62 c0 48 c7 c7 e7 e4 54 c0 e8 d4 58 c1 ef <0f> 0b 66 90 f6 87 38 02 00 00 01 75 02 f3 c3 41 55 41 54 55 53 
<1>[   54.621855] RIP: intel_engine_unpin_breadcrumbs_irq+0x8c/0x90 [i915] RSP: ffffa87340d27db8
<4>[   54.621864] ---[ end trace 570180a713d22aa1 ]---


https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_42/fi-kbl-guc/igt@gem_eio@execbuf.html

<4>[   97.275205] ------------[ cut here ]------------
<2>[   97.275206] kernel BUG at drivers/gpu/drm/i915/intel_breadcrumbs.c:226!
<4>[   97.275222] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<0>[   97.275226] Dumping ftrace buffer:
<0>[   97.275231]    (ftrace buffer empty)
<4>[   97.275234] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core mei_me e1000e snd_pcm prime_numbers mei
<4>[   97.275258] CPU: 1 PID: 127 Comm: kworker/u16:1 Tainted: G     U            4.17.0-rc5-g5e68b6c9b2d2-drmtip_42+ #1
<4>[   97.275265] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4>[   97.275301] Workqueue: i915 i915_gem_idle_work_handler [i915]
<4>[   97.275325] RIP: 0010:intel_engine_unpin_breadcrumbs_irq+0x8c/0x90 [i915]
<4>[   97.275330] RSP: 0018:ffffa2f94032fdb8 EFLAGS: 00010086
<4>[   97.275334] RAX: 000000000000000e RBX: ffff9b1d27a9a158 RCX: 0000000000000000
<4>[   97.275339] RDX: 0000000000000000 RSI: 0000000000000050 RDI: 0000000000000000
<4>[   97.275343] RBP: ffff9b1d27a9a260 R08: ffffffffc061471b R09: 0000000000000001
<4>[   97.275348] R10: ffffa2f94032fd38 R11: 0000000000000000 R12: ffff9b1d1b017678
<4>[   97.275352] R13: 00000016a6a52b82 R14: 00000000ffffffff R15: ffff9b1d1b010000
<4>[   97.275357] FS:  0000000000000000(0000) GS:ffff9b1d36c40000(0000) knlGS:0000000000000000
<4>[   97.275362] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   97.275366] CR2: 00007f4b858cd010 CR3: 000000001b210002 CR4: 00000000003606e0
<4>[   97.275371] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[   97.275375] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[   97.275380] Call Trace:
<4>[   97.275402]  intel_engines_park+0x122/0x1b0 [i915]
<4>[   97.275408]  ? synchronize_irq+0x3e/0xb0
<4>[   97.275429]  i915_gem_idle_work_handler+0x207/0x3c0 [i915]
<4>[   97.275435]  process_one_work+0x229/0x6a0
<4>[   97.275439]  worker_thread+0x35/0x380
<4>[   97.275443]  ? process_one_work+0x6a0/0x6a0
<4>[   97.275447]  kthread+0x119/0x130
<4>[   97.275450]  ? _kthread_create_on_node+0x60/0x60
<4>[   97.275455]  ret_from_fork+0x3a/0x50
<4>[   97.275460] Code: e8 a3 7e bd d4 48 8b 35 fb 22 1a 00 49 c7 c0 1b 47 61 c0 b9 e2 00 00 00 48 c7 c2 20 b3 5f c0 48 c7 c7 e7 54 52 c0 e8 d4 e8 c3 d4 <0f> 0b 66 90 f6 87 38 02 00 00 01 75 02 f3 c3 41 55 41 54 55 53 
<1>[   97.275518] RIP: intel_engine_unpin_breadcrumbs_irq+0x8c/0x90 [i915] RSP: ffffa2f94032fdb8
<4>[   97.275524] ---[ end trace c201766107365ac8 ]---
Comment 1 Chris Wilson 2018-06-08 21:51:23 UTC
*** Bug 105864 has been marked as a duplicate of this bug. ***
Comment 2 Chris Wilson 2018-09-06 19:42:33 UTC
commit 209b7955e59e361fe8ba1911fac68f46355ac0cf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jul 17 21:29:32 2018 +0100

    drm/i915/guc: Keep guc submission permanently engaged
    
    We make a decision at module load whether to use the GuC backend or not,
    but lose that setup across set-wedge. Currently, the guc doesn't
    override the engine->set_default_submission hook letting execlists sneak
    back in temporarily on unwedging leading to an unbalanced park/unpark.
    
    v2: Remove comment about switching back temporarily to execlists on
    guc_submission_disable(). We currently only call disable on shutdown,
    and plan to also call disable before suspend and reset, in which case we
    will either restore guc submission or mark the driver as wedged, making
    the reset back to execlists pointless.
    v3: Move reset.prepare across
    
    Fixes: 63572937cebf ("drm/i915/execlists: Flush pending preemption events during reset")
    Testcase: igt/drv_module_reload/basic-reload-inject
    Testcase: igt/gem_eio
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180717202932.1423-1-chris@chris-wilson.co.uk
Comment 3 Martin Peres 2018-09-14 13:07:31 UTC
(In reply to Chris Wilson from comment #2)
> commit 209b7955e59e361fe8ba1911fac68f46355ac0cf
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Jul 17 21:29:32 2018 +0100
> 
>     drm/i915/guc: Keep guc submission permanently engaged
>     
>     We make a decision at module load whether to use the GuC backend or not,
>     but lose that setup across set-wedge. Currently, the guc doesn't
>     override the engine->set_default_submission hook letting execlists sneak
>     back in temporarily on unwedging leading to an unbalanced park/unpark.
>     
>     v2: Remove comment about switching back temporarily to execlists on
>     guc_submission_disable(). We currently only call disable on shutdown,
>     and plan to also call disable before suspend and reset, in which case we
>     will either restore guc submission or mark the driver as wedged, making
>     the reset back to execlists pointless.
>     v3: Move reset.prepare across
>     
>     Fixes: 63572937cebf ("drm/i915/execlists: Flush pending preemption
> events during reset")
>     Testcase: igt/drv_module_reload/basic-reload-inject
>     Testcase: igt/gem_eio
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Michał Winiarski <michal.winiarski@intel.com>
>     Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>     Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180717202932.1423-1-
> chris@chris-wilson.co.uk

I would like to believe this solved it, but the evidence has started piling up... It looks like this commit did not change anything :s

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_110/fi-cfl-guc/igt@prime_busy@hang-default.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_110/fi-kbl-guc/igt@prime_busy@wait-hang-default.html
Comment 4 Chris Wilson 2018-09-14 13:12:48 UTC
Nah, this GEM_BUG_ON is definitely solved. There's no way the guc can trigger it anymore. Those, they are another matter :-p
Comment 5 Lakshmi 2018-10-16 08:27:32 UTC
(In reply to Chris Wilson from comment #4)
> Nah, this GEM_BUG_ON is definitely solved. There's no way the guc can
> trigger it anymore. Those, they are another matter :-p

Meaning, need a separate bug for the ongoing failures?
I see that this is happening still 

Call Trace:
<4> [479.329919]  <IRQ>
<4> [479.329931]  ? lock_acquire+0xa6/0x1c0
<4> [479.329937]  ? handle_irq_event+0x3a/0x50
<4> [479.329947]  tasklet_action_common.isra.5+0x47/0xb0
<4> [479.329957]  __do_softirq+0xd8/0x483
<4> [479.329964]  ? _raw_spin_unlock+0x29/0x40
<4> [479.329973]  irq_exit+0xa9/0xc0
<4> [479.329977]  do_IRQ+0x9a/0x120
<4> [479.329985]  common_interrupt+0xf/0xf
<4> [479.329989]  </IRQ>
<4> [479.329996] RIP: 0010:cpuidle_enter_state+0xab/0x340
<4> [479.330000] Code: 44 00 00 31 ff e8 25 88 94 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 70 02 00 00 31 ff e8 0e 2b 9b ff e8 39 fd 9e ff fb 4c 29 fb <48> ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
<4> [479.330003] RSP: 0018:ffffc900000a7e90 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffde
<4> [479.330010] RAX: ffff88027623ce40 RBX: 00000000000345c1 RCX: 0000000000000000
<4> [479.330013] RDX: 0000000000000046 RSI: ffffffff82124e7a RDI: ffffffff820d38bf
<4> [479.330017] RBP: 0000000000000004 R08: 0000000000000001 R09: 0000000000000000
<4> [479.330020] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880277bac980
<4> [479.330024] R13: ffffffff82298578 R14: 0000000000000000 R15: 0000006f99a3e023
<4> [479.330045]  do_idle+0x1f3/0x260
<4> [479.330054]  cpu_startup_entry+0x6a/0x70
<4> [479.330061]  start_secondary+0x19d/0x1f0
<4> [479.330068]  secondary_startup_64+0xa4/0xb0
<4> [479.330084] irq event stamp: 8264263
<4> [479.330089] hardirqs last  enabled at (8264262): [<ffffffff8108c8d9>] tasklet_action_common.isra.5+0x29/0xb0
<4> [479.330094] hardirqs last disabled at (8264263): [<ffffffff8194113d>] _raw_spin_lock_irqsave+0xd/0x50
<4> [479.330098] softirqs last  enabled at (8264258): [<ffffffff8108c488>] irq_enter+0x58/0x60
<4> [479.330103] softirqs last disabled at (8264259): [<ffffffff8108c539>] irq_exit+0xa9/0xc0
<4> [479.330190] WARNING: CPU: 3 PID: 0 at drivers/gpu/drm/i915/intel_guc_submission.c:638 guc_submission_tasklet+0x7db/0x960 [i915]
<4> [479.330196] ---[ end trace ef18452cc0701dee ]---
<3> [479.330205] __i915_request_submit:445 GEM_BUG_ON(intel_engine_signaled(engine, seqno))
<4> [479.330404] ------------[ cut here ]------------
<2> [479.330408] kernel BUG at drivers/gpu/drm/i915/i915_request.c:445!
<4> [479.330421] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
<4> [479.330426] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G     U  W         4.19.0-rc8-CI-CI_DRM_4984+ #1
<4> [479.330430] Hardware name: Intel corporation NUC6CAYS/NUC6CAYB, BIOS AYAPLCEL.86A.0056.2018.0926.1100 09/26/2018
<4> [479.330518] RIP: 0010:__i915_request_submit+0x271/0x280 [i915]
<4> [479.330522] Code: 2e 4b f9 e0 48 8b 35 9e fd 1b 00 49 c7 c0 28 79 28 a0 b9 bd 01 00 00 48 c7 c2 20 38 25 a0 48 c7 c7 1c 55 16 a0 e8 ff da ff e0 <0f> 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 48 89 fd 53
<4> [479.330526] RSP: 0018:ffff880277b83e70 EFLAGS: 00010082
<4> [479.330531] RAX: 0000000000000011 RBX: ffff8801886c0940 RCX: 0000000000000000
<4> [479.330534] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff880276991a98
<4> [479.330538] RBP: ffff880277b83e98 R08: 000000000009903e R09: ffff8802762c5000
<4> [479.330541] R10: ffff880277b83e88 R11: ffff880276991a98 R12: 0000000000000005
<4> [479.330544] R13: ffff8802373b4730 R14: ffff8802373b42a8 R15: ffff8801886c0b38
<4> [479.330548] FS:  0000000000000000(0000) GS:ffff880277b80000(0000) knlGS:0000000000000000
<4> [479.330551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [479.330555] CR2: 00005652ab31c4a8 CR3: 0000000005210000 CR4: 00000000003406e0
<4> [479.330558] Call Trace:
<4> [479.330561]  <IRQ>
<4> [479.330648]  guc_submission_tasklet+0x33e/0x960 [i915]
<4> [479.330659]  tasklet_action_common.isra.5+0x47/0xb0
<4> [479.330666]  __do_softirq+0xd8/0x483
<4> [479.330671]  ? _raw_spin_unlock+0x29/0x40
<4> [479.330677]  irq_exit+0xa9/0xc0
<4> [479.330682]  do_IRQ+0x9a/0x120
<4> [479.330687]  common_interrupt+0xf/0xf
<4> [479.330691]  </IRQ>
<4> [479.330695] RIP: 0010:cpuidle_enter_state+0xab/0x340
<4> [479.330699] Code: 44 00 00 31 ff e8 25 88 94 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 70 02 00 00 31 ff e8 0e 2b 9b ff e8 39 fd 9e ff fb 4c 29 fb <48> ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
<4> [479.330703] RSP: 0018:ffffc900000a7e90 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffde
<4> [479.330708] RAX: ffff88027623ce40 RBX: 00000000000345c1 RCX: 0000000000000000
<4> [479.330711] RDX: 0000000000000046 RSI: ffffffff82124e7a RDI: ffffffff820d38bf
<4> [479.330714] RBP: 0000000000000004 R08: 0000000000000001 R09: 0000000000000000
<4> [479.330718] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880277bac980
<4> [479.330721] R13: ffffffff82298578 R14: 0000000000000000 R15: 0000006f99a3e023
<4> [479.330734]  do_idle+0x1f3/0x260
<4> [479.330740]  cpu_startup_entry+0x6a/0x70
<4> [479.330746]  start_secondary+0x19d/0x1f0
<4> [479.330751]  secondary_startup_64+0xa4/0xb0
<4> [479.330760] Modules linked in: i915(+) amdgpu chash gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btrtl x86_pkg_temp_thermal coretemp btbcm crct10dif_pclmul btintel crc32_pclmul bluetooth ghash_clmulni_intel ecdh_generic lpc_ich r8169 snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me pinctrl_broxton pinctrl_intel mei prime_numbers [last unloaded: i915]
<0> [479.330813] Dumping ftrace buffer:
<0> [479.330818] ---------------------------------
Comment 6 Martin Peres 2018-11-13 16:15:57 UTC
Filed https://bugs.freedesktop.org/show_bug.cgi?id=108732 for GEM_BUG_ON(intel_engine_signaled(engine, seqno)), which is the only one happening in BAT.

Now closing this bug and archiving. I will write new bugs when issues will come in drmtip.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.