Bug 112069 - [CI][BAT] : igt@i915_selftest@live_hangcheck - incomplete - GEM_BUG_ON(!assert_pending_valid(execlists, "promote"))
Summary: [CI][BAT] : igt@i915_selftest@live_hangcheck - incomplete - GEM_BUG_ON(!asser...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: not set not set
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-18 20:10 UTC by Lakshmi
Modified: 2019-11-07 08:58 UTC (History)
1 user (show)

See Also:
i915 platform: CFL
i915 features: GEM/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lakshmi 2019-10-18 20:10:12 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7130/fi-cfl-guc/igt@i915_selftest@live_hangcheck.html
<3> [464.309546] process_csb:1940 GEM_BUG_ON(!assert_pending_valid(execlists, "promote"))
<4> [464.309558] ------------[ cut here ]------------
<2> [464.309559] kernel BUG at drivers/gpu/drm/i915/gt/intel_lrc.c:1940!
<4> [464.309564] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [464.309565] CPU: 1 PID: 6925 Comm: i915_selftest Tainted: G     U            5.4.0-rc3-CI-CI_DRM_7130+ #1
<4> [464.309567] Hardware name: Micro-Star International Co., Ltd. MS-7B54/Z370M MORTAR (MS-7B54), BIOS 1.10 12/28/2017
<4> [464.309602] RIP: 0010:process_csb+0xad3/0xb60 [i915]
<4> [464.309603] Code: 4c 13 78 e0 48 8b 35 d4 88 24 00 49 c7 c0 80 c4 b1 a0 b9 94 07 00 00 48 c7 c2 f0 3c af a0 48 c7 c7 4e 9c 9a a0 e8 1d 0f 7f e0 <0f> 0b e8 e6 3d 7a e0 49 2b 86 88 17 00 00 49 01 86 98 17 00 00 e9
<4> [464.309604] RSP: 0018:ffffc900005cb7a0 EFLAGS: 00010086
<4> [464.309606] RAX: 000000000000000f RBX: ffff88822c672000 RCX: 0000000000000000
<4> [464.309607] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff888265579c00
<4> [464.309608] RBP: ffffffffa09a06d0 R08: 00000000003f1ccd R09: ffff88824074f000
<4> [464.309608] R10: ffff88824074ff04 R11: ffff888265579c00 R12: ffff888248a255f8
<4> [464.309609] R13: 0000000000000004 R14: ffff8881817034c0 R15: 0000000000000000
<4> [464.309610] FS:  00007f0fa5d55e40(0000) GS:ffff888266680000(0000) knlGS:0000000000000000
<4> [464.309611] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [464.309612] CR2: 00007f1040ff1d80 CR3: 000000019b2e0005 CR4: 00000000003606e0
<4> [464.309613] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [464.309614] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [464.309615] Call Trace:
<4> [464.309648]  __execlists_reset+0x30/0xb30 [i915]
<4> [464.309678]  execlists_reset+0x3d/0x50 [i915]
<4> [464.309708]  intel_engine_reset+0xdf/0x230 [i915]
<4> [464.309736]  __igt_atomic_reset_engine+0x4b/0x90 [i915]
<4> [464.309762]  igt_atomic_reset_engine+0xd2/0x260 [i915]
<4> [464.309791]  ? intel_gt_set_wedged+0x70/0x70 [i915]
<4> [464.309794]  ? queue_work_node+0x70/0x70
<4> [464.309821]  igt_reset_engines_atomic+0x70/0xb0 [i915]
<4> [464.309858]  __i915_subtests+0xb8/0x210 [i915]
<4> [464.309894]  ? __i915_live_teardown+0x50/0x50 [i915]
<4> [464.309927]  ? __intel_gt_live_setup+0x10/0x10 [i915]
<4> [464.309957]  intel_hangcheck_live_selftests+0xb7/0x140 [i915]
<4> [464.309992]  __run_selftests+0x112/0x170 [i915]
<4> [464.310027]  i915_live_selftests+0x2c/0x60 [i915]
<4> [464.310054]  i915_pci_probe+0x93/0x1b0 [i915]
<4> [464.310057]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4> [464.310059]  pci_device_probe+0x9e/0x120
<4> [464.310062]  really_probe+0xea/0x420
<4> [464.310064]  driver_probe_device+0x10b/0x120
<4> [464.310066]  device_driver_attach+0x4a/0x50
<4> [464.310068]  __driver_attach+0x97/0x130
<4> [464.310069]  ? device_driver_attach+0x50/0x50
<4> [464.310071]  bus_for_each_dev+0x74/0xc0
<4> [464.310073]  bus_add_driver+0x142/0x220
<4> [464.310074]  ? 0xffffffffa01af000
<4> [464.310076]  driver_register+0x56/0xf0
<4> [464.310077]  ? 0xffffffffa01af000
<4> [464.310078]  do_one_initcall+0x58/0x2ff
<4> [464.310081]  ? rcu_read_lock_sched_held+0x4d/0x80
<4> [464.310083]  ? kmem_cache_alloc_trace+0x290/0x2c0
<4> [464.310086]  do_init_module+0x56/0x1f8
<4> [464.310088]  load_module+0x243e/0x29f0
<4> [464.310092]  ? __do_sys_finit_module+0xe9/0x110
<4> [464.310094]  __do_sys_finit_module+0xe9/0x110
<4> [464.310097]  do_syscall_64+0x4f/0x210
<4> [464.310099]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [464.310100] RIP: 0033:0x7f0fa540c839
<4> [464.310102] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
<4> [464.310103] RSP: 002b:00007ffedf6b6918 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4> [464.310104] RAX: ffffffffffffffda RBX: 000055da61c4cfc0 RCX: 00007f0fa540c839
<4> [464.310105] RDX: 0000000000000000 RSI: 000055da61c45f90 RDI: 0000000000000006
<4> [464.310106] RBP: 000055da61c45f90 R08: 0000000000000004 R09: 000055da61c46150
<4> [464.310107] R10: 00007ffedf6b6a60 R11: 0000000000000246 R12: 0000000000000000
<4> [464.310108] R13: 000055da61c3f000 R14: 0000000000000020 R15: 0000000000000048
<4> [464.310110] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp mei_hdcp crct10dif_pclmul snd_intel_nhlt crc32_pclmul snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e snd_pcm mei_me ptp mei pps_core prime_numbers [last unloaded: i915]
<0> [464.310118] Dumping ftrace buffer:
<0> [464.310125] ---------------------------------
Comment 1 CI Bug Log 2019-10-18 20:10:55 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* CFL: igt@i915_selftest@live_hangcheck - incomplete - GEM_BUG_ON(!assert_pending_valid(execlists, &quot;promote&quot;))
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7130/fi-cfl-guc/igt@i915_selftest@live_hangcheck.html
Comment 2 Chris Wilson 2019-10-18 20:37:17 UTC
Still the same incomplete bug as before. We need the full trace.
Comment 3 CI Bug Log 2019-10-23 12:32:17 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* CFL:  igt@i915_selftest@live_hangcheck - incomplete
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7153/fi-cfl-8700k/igt@i915_selftest@live_hangcheck.html
Comment 4 Chris Wilson 2019-10-23 23:24:12 UTC
Found the missing tell-tale on bsw! So simple! Just a missing sync in the selftest before starting the manual reset.
Comment 5 Chris Wilson 2019-10-24 08:30:25 UTC
Going from the hit on bsw and assuming this the one and the same bug,

commit 93100fdeb4de5b13a7f9113ede93cd062ba779f1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Oct 24 00:24:43 2019 +0100

    drm/i915/selftests: Flush interrupts before disabling tasklets
    
    When setting up the system to perform the atomic reset, we need to
    serialise with any ongoing interrupt tasklet or else:
    
    <0> [472.951428] i915_sel-4442    0d..1 466527056us : __i915_request_submit: rcs0 fence 11659:2, current 0
    <0> [472.951554] i915_sel-4442    0d..1 466527059us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:yes
    <0> [472.951681] i915_sel-4442    0d..1 466527061us : trace_ports: rcs0: submit { 11659:2, 0:0 }
    <0> [472.951805] i915_sel-4442    0.... 466527114us : __igt_atomic_reset_engine: i915_reset_engine(rcs0:active) under hardirq
    <0> [472.951932] i915_sel-4442    0d... 466527115us : intel_engine_reset: rcs0 flags=11d
    <0> [472.952056] i915_sel-4442    0d... 466527117us : execlists_reset_prepare: rcs0: depth<-1
    <0> [472.952179] i915_sel-4442    0d... 466527119us : intel_engine_stop_cs: rcs0
    <0> [472.952305]   <idle>-0       1..s1 466527119us : process_csb: rcs0 cs-irq head=3, tail=4
    <0> [472.952431] i915_sel-4442    0d... 466527122us : __intel_gt_reset: engine_mask=1
    <0> [472.952557]   <idle>-0       1..s1 466527124us : process_csb: rcs0 csb[4]: status=0x00000001:0x00000000
    <0> [472.952683]   <idle>-0       1..s1 466527130us : trace_ports: rcs0: promote { 11659:2*, 0:0 }
    <0> [472.952808] i915_sel-4442    0d... 466527131us : execlists_reset: rcs0
    <0> [472.952933] i915_sel-4442    0d..1 466527133us : process_csb: rcs0 cs-irq head=3, tail=4
    <0> [472.953059] i915_sel-4442    0d..1 466527134us : process_csb: rcs0 csb[4]: status=0x00000001:0x00000000
    <0> [472.953185] i915_sel-4442    0d..1 466527136us : trace_ports: rcs0: preempted { 11659:2*, 0:0 }
    <0> [472.953310] i915_sel-4442    0d..1 466527150us : assert_pending_valid: Nothing pending for promotion!
    <0> [472.953436] i915_sel-4442    0d..1 466527158us : process_csb: process_csb:1930 GEM_BUG_ON(!assert_pending_valid(execlists, "promote"))
    
    We have the same CSB events being seen by process_csb() on two different
    processors. One being issued by the reset in the test, the other by the
    interrupt; this scenario is supposed to be prevented by flushing the
    interrupt tasklet with tasklet_disable() before we enter the atomic
    reset.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112069
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191023232443.17450-1-chris@chris-wilson.co.uk
Comment 6 CI Bug Log 2019-11-07 08:43:07 UTC
A CI Bug Log filter associated to this bug has been updated:

{- CFL: igt@i915_selftest@live_hangcheck - incomplete - GEM_BUG_ON(!assert_pending_valid(execlists, &quot;promote&quot;)) -}
{+ CFL: igt@i915_selftest@live_hangcheck - incomplete - GEM_BUG_ON(!assert_pending_valid(execlists, &quot;promote&quot;)) +}


  No new failures caught with the new filter
Comment 7 CI Bug Log 2019-11-07 08:43:51 UTC
A CI Bug Log filter associated to this bug has been updated:

{- CFL: igt@i915_selftest@live_hangcheck - incomplete - GEM_BUG_ON(!assert_pending_valid(execlists, &quot;promote&quot;)) -}
{+ CFL TGL: igt@i915_selftest@live_hangcheck - incomplete - GEM_BUG_ON(!assert_pending_valid(execlists, &quot;promote&quot;)) +}


  No new failures caught with the new filter
Comment 8 Lakshmi 2019-11-07 08:58:46 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7129/fi-tgl-u2/igt@i915_selftest@live_execlists.html
Adding the TGL failure for reference. 

Reproduction rate of this issue is once in 8 runs. Last seen CI_DRM_7153 (2 weeks, 1 day old) and current run is 7277.

Closing and archiving this issue.
Comment 9 CI Bug Log 2019-11-07 08:58:59 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.