On CI_DRM_2915, the machine fi-bxt-j4205 produced the following kernel BUG when running igt@kms_pipe_crc_basic@nonblocking-crc-pipe-b: <4>[ 412.528644] ------------[ cut here ]------------ <2>[ 412.528664] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:604! <4>[ 412.528680] invalid opcode: 0000 [#1] PREEMPT SMP <4>[ 412.528691] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 mei_me mii mei lpc_ich prime_numbers i2c_hid pinctrl_broxton pinctrl_intel <4>[ 412.529242] CPU: 2 PID: 4271 Comm: kms_pipe_crc_ba Tainted: G U W 4.13.0-rc3-CI-CI_DRM_2915+ #1 <4>[ 412.529262] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.10 09/29/2016 <4>[ 412.529283] task: ffff88027663ce40 task.stack: ffffc90000b5c000 <4>[ 412.529371] RIP: 0010:intel_lrc_irq_handler+0x25c/0x500 [i915] <4>[ 412.529383] RSP: 0018:ffff88027fd03ea0 EFLAGS: 00010202 <4>[ 412.529412] RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffff880263ccb1c0 <4>[ 412.529426] RDX: 0000000000000006 RSI: ffffc900008223a0 RDI: 00000000ffffffff <4>[ 412.529440] RBP: ffff88027fd03f00 R08: ffff88026f6e0000 R09: 0000000000000000 <4>[ 412.529454] R10: ffff88026cae8058 R11: 0000000000000000 R12: ffff88026cae8008 <4>[ 412.529468] R13: 0000000000008002 R14: 0000000000000003 R15: ffffc90000822370 <4>[ 412.529483] FS: 00007f2e25787a40(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 <4>[ 412.529501] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 412.529513] CR2: 000000401903a088 CR3: 00000002187f0000 CR4: 00000000003406e0 <4>[ 412.529526] Call Trace: <4>[ 412.529534] <IRQ> <4>[ 412.529547] tasklet_hi_action+0x93/0x120 <4>[ 412.529558] __do_softirq+0xbb/0x4b0 <4>[ 412.529570] irq_exit+0xa9/0xc0 <4>[ 412.529581] do_IRQ+0x6c/0x130 <4>[ 412.529592] common_interrupt+0x90/0x90 <4>[ 412.529602] RIP: 0010:_raw_spin_unlock_irqrestore+0x54/0x60 <4>[ 412.529614] RSP: 0018:ffffc90000b5fa48 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff6d <4>[ 412.529634] RAX: ffffffffffffffff RBX: 0000000000000206 RCX: 0000000000000000 <4>[ 412.529648] RDX: ffffffff8146d625 RSI: 0000000000000001 RDI: ffffffff8188f292 <4>[ 412.529662] RBP: ffffc90000b5fa58 R08: 0000000000000001 R09: 0000000000000000 <4>[ 412.529676] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82da4028 <4>[ 412.529690] R13: ffffffffa025ae60 R14: ffff880273e1a008 R15: ffff880263cc9348 <4>[ 412.529704] </IRQ> <4>[ 412.529714] ? __debug_object_init+0x215/0x420 <4>[ 412.529725] ? _raw_spin_unlock_irqrestore+0x52/0x60 <4>[ 412.529738] __debug_object_init+0x215/0x420 <4>[ 412.529782] ? reset_all_global_seqno.part.7+0x108/0x108 [i915] <4>[ 412.529796] debug_object_init+0x16/0x20 <4>[ 412.529834] __i915_sw_fence_init+0x2e/0x60 [i915] <4>[ 412.529879] i915_gem_request_alloc+0x191/0x400 [i915] <4>[ 412.529922] i915_gem_do_execbuffer+0x6ba/0x12b0 [i915] <4>[ 412.529941] ? lock_acquire+0xb0/0x200 <4>[ 412.529953] ? __might_fault+0x39/0x90 <4>[ 412.529995] i915_gem_execbuffer2+0x9e/0x1a0 [i915] <4>[ 412.530038] ? i915_gem_execbuffer+0x2b0/0x2b0 [i915] <4>[ 412.530052] drm_ioctl_kernel+0x64/0xb0 <4>[ 412.530063] drm_ioctl+0x2f4/0x3d0 <4>[ 412.530107] ? i915_gem_execbuffer+0x2b0/0x2b0 [i915] <4>[ 412.530125] ? mntput_no_expire+0x7f/0x3d0 <4>[ 412.530136] ? mntput+0x1f/0x30 <4>[ 412.530147] do_vfs_ioctl+0x8f/0x660 <4>[ 412.530158] ? task_work_run+0x8f/0xb0 <4>[ 412.530169] SyS_ioctl+0x3c/0x70 <4>[ 412.530180] entry_SYSCALL_64_fastpath+0x1c/0xb1 <4>[ 412.530191] RIP: 0033:0x7f2e2397cf07 <4>[ 412.530200] RSP: 002b:00007ffe0be94258 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 <4>[ 412.530218] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f2e2397cf07 <4>[ 412.530232] RDX: 00007ffe0be942f0 RSI: 0000000040406469 RDI: 0000000000000004 <4>[ 412.530245] RBP: 0000000000000000 R08: 0000000000000008 R09: 0000000000000000 <4>[ 412.530259] R10: 000000000000001f R11: 0000000000000246 R12: 0000000000000001 <4>[ 412.530273] R13: 00007f2e23c45c40 R14: 0000000000000000 R15: 0000000000000000 <4>[ 412.530289] Code: ff 41 83 e5 08 0f 85 d8 fe ff ff 0f 0b 0f 0b 0f 0b 48 89 cf 4c 89 45 c0 4c 89 55 c8 e8 be 4c 47 e1 4c 8b 45 c0 4c 8b 55 c8 eb 92 <0f> 0b 0f 0b 49 8d 84 24 40 03 00 00 48 83 e2 fc 48 89 45 a8 74 <1>[ 412.530434] RIP: intel_lrc_irq_handler+0x25c/0x500 [i915] RSP: ffff88027fd03ea0 <4>[ 412.530481] ---[ end trace 8d25a06d708c0561 ]--- <0>[ 412.740935] Kernel panic - not syncing: Fatal exception in interrupt <0>[ 412.740975] Kernel Offset: disabled Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2915/fi-bxt-j4205/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-b.html
The HW replied with one more context-switch than we submitted. It is most disturbing. gem_concurrent_blit can typically hit this within a few minutes.
Hitting the same assertion on bxt/gem_concurrent_blit, always has a trail of breadcrumbs like: [ 61.964007] bcs0(3): 0+ rq.seqno=2275 [ctx=91], count=1 [ 61.964050] bcs0(1): [3/3] status=1 [ 61.964096] bcs0(1): 0+ rq.seqno=2276 [ctx=91], count=2 [ 61.964180] bcs0(1): [4/4] status=8002 [ 61.964189] bcs0(1): - rq.seqno=2276 [ctx=91], status=8002, count=2 [ 61.977615] bcs0(1): [5/5] status=18 [ 61.977640] bcs0(1): - rq.seqno=2276 [ctx=91], status=18, count=1 [ 61.995341] bcs0(2): 0+ rq.seqno=2277 [ctx=9b], count=1 [ 61.995427] bcs0(2): 0+ rq.seqno=2278 [ctx=9b], count=2 [ 61.995623] bcs0(1): [0/1] status=1 [ 61.995786] bcs0(1): [1/1] status=8002 [ 61.995802] bcs0(1): - rq.seqno=2278 [ctx=9b], status=8002, count=2 [ 62.008366] bcs0(1): [2/2] status=18 [ 62.008391] bcs0(1): - rq.seqno=2278 [ctx=9b], status=18, count=1 [ 62.020052] bcs0(0): 0+ rq.seqno=2279 [ctx=8c], count=1 [ 62.020142] bcs0(0): 0+ rq.seqno=2280 [ctx=8c], count=2 [ 62.020207] bcs0(1): [3/4] status=12 [ 62.020219] bcs0(1): - rq.seqno=2280 [ctx=8c], status=12, count=2 ^ This is the bogus event. 0x12 == COMPLETE | PREEMPTED, but it is following the ACTIVE->IDLE notification, where we always expect the IDLE->ACTIVE (status=1) afterwards (see examples above). Following the bogus 0x12 wakeup, the context-switch notification never "catches up", i.e. we never see the final context-switch of ELEMENT_SWITCH or IDLE. At that point only a reset seems to cure it. [ 62.020228] bcs0(1): [4/4] status=8002 [ 62.020235] bcs0(1): - rq.seqno=2280 [ctx=8c], status=8002, count=1 [ 62.024699] bcs0(3): 1+ rq.seqno=2281 [ctx=96], count=1 [ 62.024724] bcs0(3): 0+ rq.seqno=2280 [ctx=8c], count=2 [ 62.024753] bcs0(1): [5/5] status=8002 [ 62.024774] bcs0(1): - rq.seqno=2280 [ctx=8c], status=8002, count=2 Now, answers on a postcard what we are meant to do to this prevent heading into this cul-de-sac.
Based on CI data failure rate of 1 / 105 runs (1 %). Dropping priority.
*** Bug 102705 has been marked as a duplicate of this bug. ***
Here we go again: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3112/fi-bxt-j4205/igt@kms_busy@basic-flip-c.html
*** Bug 103190 has been marked as a duplicate of this bug. ***
*** Bug 103178 has been marked as a duplicate of this bug. ***
Glk? I thought glk followed the cnl pattern (random context id reported) rather than this pattern of impossible lite-restore following complete | idle.
(In reply to Chris Wilson from comment #8) > Glk? I thought glk followed the cnl pattern (random context id reported) > rather than this pattern of impossible lite-restore following complete | > idle. Nope, see https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5993/fi-glk-1/dmesg-1507731622_Panic_2.log definitely glk shares the bxt failure.
Also, https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3209/shard-apl5/igt@kms_busy@extended-pageflip-hang-oldfb-render-B.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3209/shard-apl5/dmesg-1507719682_Oops_1.log https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3209/shard-apl5/dmesg-1507719682_Panic_2.log
ALso on: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3228/shard-apl7/igt@kms_properties@connector-properties-legacy.html Note there is an issue that when we get multiple pstore logs on one of the shared machines, you have to compare with dmesg to figure out which one that belong to which shard. In this case it is this pstore backtrace that is relevant: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3228/shard-apl7/dmesg-1507893658_Panic_2.log
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3237/shard-apl2/igt@kms_cursor_legacy@cursorA-vs-flipA-atomic.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3238/shard-apl2/igt@kms_draw_crc@draw-method-xrgb8888-render-xtiled.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3242/shard-apl2/igt@kms_vblank@query-forked.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3242/shard-apl4/igt@kms_universal_plane@cursor-fb-leak-pipe-B.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3248/shard-apl3/igt@kms_flip@flip-vs-modeset-vs-hang.html
reproduced: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3251/shard-apl1/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-cpu-ytiled.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3252/shard-apl7/igt@kms_draw_crc@draw-method-xrgb8888-mmap-cpu-untiled.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3253/shard-apl4/igt@kms_cursor_legacy@basic-flip-before-cursor-legacy.html
Reproduced on GLK-shards https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3252/shard-glkb2/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-move.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3252/shard-glkb2/dmesg-1508246195_Oops_1.log
Also, https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3254/shard-glkb1/igt@kms_frontbuffer_tracking@psr-1p-offscren-pri-shrfb-draw-render.html and: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3254/shard-glkb1/igt@kms_setmode@basic.html Note both pstore from shard-glkb1 has the same issue.
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3256/shard-glkb3/igt@kms_frontbuffer_tracking@fbcpsr-2p-shrfb-fliptrack.html
GLk-shards again: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3257/shard-glkb3/igt@kms_frontbuffer_tracking@psr-2p-primscrn-pri-shrfb-draw-mmap-gtt.html
Also, CI_DRM_3266 fi-skl-6260u: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3266/fi-skl-6260u/igt@kms_cursor_legacy@basic-flip-after-cursor-varying-size.html <4>[ 284.776232] ------------[ cut here ]------------ <2>[ 284.776235] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:782! <4>[ 284.776260] invalid opcode: 0000 [#1] PREEMPT SMP <4>[ 284.776268] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp snd_hda_intel snd_hda_codec coretemp snd_hwdep crct10dif_pclmul snd_hda_core crc32_pclmul ghash_clmulni_intel snd_pcm e1000e ptp pps_core mei_me mei prime_numbers pinctrl_sunrisepoint pinctrl_intel i2c_hid <4>[ 284.776320] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G U 4.14.0-rc5-CI-CI_DRM_3266+ #1 <4>[ 284.776329] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0057.2017.0119.1758 01/19/2017 <4>[ 284.776338] task: ffff8802652f50c0 task.stack: ffffc900000ac000 <4>[ 284.776369] RIP: 0010:intel_lrc_irq_handler+0x304/0x8a0 [i915] <4>[ 284.776375] RSP: 0018:ffff88026ed03ec0 EFLAGS: 00010246 <4>[ 284.776382] RAX: ffff8802525545f0 RBX: ffff880252554590 RCX: 0000000000000000 <4>[ 284.776389] RDX: 0000000000000000 RSI: ffffffff81d0d974 RDI: ffff8802525542a8 <4>[ 284.776396] RBP: ffff88026ed03f18 R08: 0000000000000000 R09: 0000000000000001 <4>[ 284.776403] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880259990000 <4>[ 284.776409] R13: 0000000000000000 R14: ffffffff81d1849f R15: 0000000000000000 <4>[ 284.776417] FS: 0000000000000000(0000) GS:ffff88026ed00000(0000) knlGS:0000000000000000 <4>[ 284.776425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 284.776431] CR2: 00007f6cef3eb000 CR3: 0000000003e0f001 CR4: 00000000003606e0 <4>[ 284.776438] Call Trace: <4>[ 284.776442] <IRQ> <4>[ 284.776447] ? tasklet_hi_action+0x71/0x120 <4>[ 284.776454] ? __this_cpu_preempt_check+0x13/0x20 <4>[ 284.776461] tasklet_hi_action+0x98/0x120 <4>[ 284.776467] __do_softirq+0xc0/0x4ae <4>[ 284.776473] irq_exit+0xae/0xc0 <4>[ 284.776478] smp_apic_timer_interrupt+0x9e/0x2e0 <4>[ 284.776484] apic_timer_interrupt+0x9a/0xa0 <4>[ 284.776489] </IRQ> <4>[ 284.776493] RIP: 0010:cpuidle_enter_state+0x136/0x370 <4>[ 284.776498] RSP: 0018:ffffc900000afe80 EFLAGS: 00000216 ORIG_RAX: ffffffffffffff10 <4>[ 284.776507] RAX: ffff8802652f50c0 RBX: 0000000000020f9c RCX: 0000000000000001 <4>[ 284.776514] RDX: 0000000000000000 RSI: ffffffff81d0d974 RDI: ffffffff81cc18d6 <4>[ 284.776521] RBP: ffffc900000afeb8 R08: 000000000000071d R09: 0000000000000018 <4>[ 284.776528] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002 <4>[ 284.776536] R13: 0000000000000002 R14: ffff88026ed25430 R15: 000000424df9c186 <4>[ 284.776546] cpuidle_enter+0x17/0x20 <4>[ 284.776551] call_cpuidle+0x23/0x40 <4>[ 284.776557] do_idle+0x192/0x1e0 <4>[ 284.776563] cpu_startup_entry+0x1d/0x20 <4>[ 284.776568] start_secondary+0x11c/0x140 <4>[ 284.776574] secondary_startup_64+0xa5/0xa5 <4>[ 284.776581] Code: 4d c8 05 a0 03 00 00 48 03 81 a8 0b 00 00 44 8b 28 44 89 eb 41 c1 ed 08 41 83 e5 07 83 e3 07 45 89 af 84 03 00 00 e9 af fd ff ff <0f> 0b 0f 0b 41 80 bf 68 03 00 00 00 4c 8b 65 c8 74 1e 41 8b b7 <1>[ 284.776664] RIP: intel_lrc_irq_handler+0x304/0x8a0 [i915] RSP: ffff88026ed03ec0 <4>[ 284.776673] ---[ end trace 75789ff91a01554d ]---
CI_DRM_3267 fi-skl-6700hq <14>[ 384.567935] [IGT] kms_flip: exiting, ret=0 <4>[ 384.570958] ------------[ cut here ]------------ <2>[ 384.570961] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:782! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3267/fi-skl-6700hq/igt@kms_flip@basic-flip-vs-wf_vblank.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3268/shard-glkb4/igt@kms_frontbuffer_tracking@psr-1p-offscren-pri-indfb-draw-pwrite.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3268/shard-glkb5/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-indfb-msflip-blt.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3266/shard-glkb4/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-pri-indfb-draw-blt.html
BAT-machine fi-skl-6700hq hit this while running: igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence: kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:782! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3269/fi-skl-6700hq/igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence.html
(In reply to Marta Löfstedt from comment #27) > BAT-machine fi-skl-6700hq hit this while running: > igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence: > > kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:782! > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3269/fi-skl-6700hq/ > igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence.html That's not connected to this bug. This bug is very specifically about the hw reporting COMPLETED | PREEMPTED, and not about the interrupt being received after we have finished parsing the CSB after parking the engines.
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3271/fi-cnl-y/igt@gem_exec_flush@basic-batch-kernel-default-uc.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3273/shard-glkb4/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-pri-indfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3271/shard-glkb1/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-indfb-pgflip-blt.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3271/shard-glkb1/dmesg-1508538544_Oops_1.log
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3273/shard-glkb3/dmesg-1508602166_Oops_1.log https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3273/shard-glkb3/igt@kms_flip@blocking-wf_vblank.html
The: "kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:782!" once are now handled in bug 103410.
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3274/shard-glkb4/igt@kms_flip@flip-vs-modeset-interruptible.html <2>[ 322.306569] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:879!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3274/shard-glkb3/igt@kms_busy@extended-modeset-hang-oldfb-render-C.html <2>[ 3379.405906] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:879!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3275/shard-glkb2/igt@kms_atomic@plane_cursor_legacy.html 41.938315] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:879!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3275/shard-apl2/igt@kms_frontbuffer_tracking@fbc-1p-primscrn-cur-indfb-draw-mmap-cpu.html <2>[ 1325.066224] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:879!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3275/shard-glkb2/igt@gem_exec_create@forked.html <2>[ 41.938315] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:879!
Talking point: diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 6b10a01dd371..f5be85a01f83 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -905,6 +905,8 @@ static void intel_lrc_irq_handler(unsigned long data) /* After the final element, the hw should be idle */ GEM_BUG_ON(port_count(port) == 0 && !(status & GEN8_CTX_STATUS_ACTIVE_IDLE)); + if (status & GEN8_CTX_STATUS_ACTIVE_IDLE) + mdelay(2); } if (head != execlists->csb_head) { prevents the COMPLETED | PREEMPT occurring after the COMPLETED | ACTIVE_IDLE. (Or just significantly increased the mtbf to mask the issue.)
Raised priority
More complete would be diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index eeb3622803a8..1df66aeca3ba 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -599,7 +599,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine) * the driver is unable to keep up the supply of new * work). */ - if (port_count(&port[1])) + if (port_count(&port[0])) goto unlock; /* WaIdleLiteRestore:bdw,skl Needs a w/a assigned.
Reminds me of commit 70962fbe5c75e785d250c04db4d01c18b7316c13 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jan 23 13:05:56 2017 +0000 drm/i915: Remove disable_lite_restore_wa This w/a (WaEnableForceRestoreInCtxtDescForVCS) was only used for preproduction hw, which is no longer in use. Remove the workaround to simplify the code. but that is only mentioned for VCS, whereas we see the PREEMPT | COMPLETED on, at least, bcs.
<2>[ 366.551741] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:888! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3279/shard-glkb5/igt@kms_flip_event_leak.html
CI_DRM_3283 fi-bxt-j4205 igt@kms_busy@basic-flip-c <2>[ 364.185707] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:888! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3283/fi-bxt-j4205/igt@kms_busy@basic-flip-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3285/shard-glkb1/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-move.html <2>[ 1515.047901] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:888!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3283/shard-apl6/igt@kms_color@ctm-0-5-pipe1.html <2>[ 77.455586] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:888!
CI_DRM_3288 shard-glkb1 igt@kms_flip@flip-vs-panning-vs-hang-interruptible <2>[ 261.125313] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3288/shard-glkb1/dmesg-1509061688_Panic_2.log https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3288/shard-glkb1/igt@kms_flip@flip-vs-panning-vs-hang-interruptible.html Note new line again! GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
CI_DRM_3288 shard-glkb1 igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-pwrite <2>[ 904.259553] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3288/shard-glkb1/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-pwrite.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3292/shard-glkb2/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-pri-shrfb-draw-mmap-cpu.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3292/shard-glkb3/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-mmap-cpu.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3293/shard-glkb1/igt@kms_cursor_legacy@short-flip-after-cursor-atomic-transitions.html <2>[ 4502.680683] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
<2>[ 2776.149999] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3300/shard-glkb4/igt@kms_flip@flip-vs-panning-interruptible.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3300/shard-glkb2/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-onoff.html <2>[ 1043.960206] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
CI_DRM_3304 fi-glk-1 igt@pm_backlight@basic-brightness <2>[ 523.218649] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3304/fi-glk-1/igt@pm_backlight@basic-brightness.html
CI_DRM_3302 shard-apl8 igt@kms_rotation_crc@bad-tiling <2>[ 703.522499] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3302/shard-apl8/igt@kms_rotation_crc@bad-tiling.html
CI_DRM_3301 shard-glkb4 igt@kms_busy@extended-modeset-hang-newfb-render-C <2>[ 3461.438418] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3301/shard-glkb4/igt@kms_busy@extended-modeset-hang-newfb-render-C.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3307/shard-glkb3/igt@kms_flip@wf_vblank-interruptible.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3307/shard-glkb3/dmesg-1509580929_Panic_2.log kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3311/shard-apl4/igt@kms_busy@extended-modeset-hang-newfb-render-B.html <2>[ 75.821455] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3313/shard-glkb2/igt@gem_exec_schedule@smoketest-blt.html <2>[ 1099.391030] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3317/shard-glkb3/igt@kms_flip@flip-vs-rmfb-interruptible.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3317/shard-glkb3/dmesg-1509998109_Panic_2.log kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/shard-glkb2/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-pri-shrfb-draw-mmap-gtt.html <2>[ 1839.277970] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/shard-glkb1/igt@kms_cursor_legacy@cursora-vs-flipa-atomic-transitions.html <2>[ 90.044653] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3319/shard-glkb4/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-indfb-msflip-blt.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3319/shard-glkb4/dmesg-1510073961_Panic_2.log <2>[ 3148.546706] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3319/shard-glkb2/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-cur-indfb-draw-pwrite.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3319/shard-glkb2/dmesg-1510073893_Oops_1.log <2>[ 704.851043] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
Also, https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_479/shard-apl4/igt@kms_flip_event_leak.html
Reference from Chris: https://patchwork.freedesktop.org/series/33965/
another reference: https://patchwork.freedesktop.org/series/34081/
commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Nov 20 12:34:58 2017 +0000 drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write The hardware needs some time to process the information received in the ExecList Submission Port, and expects us to not write anything more until it has 'acknowledged' this new submission by sending an IDLE_ACTIVE or PREEMPTED CSB event. If we do not follow this, the driver could write new data into the ELSP before HW had finishing fetching the previous one, putting us in 'undefined behaviour' space. This seems to be the problem causing the spurious PREEMPTED & COMPLETE events after a COMPLETE like the one below: [] vcs0: sw rd pointer = 2, hw wr pointer = 0, current 'head' = 3. [] vcs0: Execlist CSB[0]: 0x00000018 _ 0x00000007 [] vcs0: Execlist CSB[1]: 0x00000001 _ 0x00000000 [] vcs0: Execlist CSB[2]: 0x00000018 _ 0x00000007 <<< COMPLETE [] vcs0: Execlist CSB[3]: 0x00000012 _ 0x00000007 <<< PREEMPTED & COMPLETE [] vcs0: Execlist CSB[4]: 0x00008002 _ 0x00000006 [] vcs0: Execlist CSB[5]: 0x00000014 _ 0x00000006 The ELSP writes that lead to this CSB sequence show that the HW hadn't started executing the previous execlist (the one with only ctx 0x6) by the time the new one was submitted; this is a bit more clear in the data show in the EXECLIST_STATUS register at the time of the ELSP write. [] vcs0: ELSP[0] = 0x0_0 [execlist1] - status_reg = 0x0_302 [] vcs0: ELSP[1] = 0x6_fedb2119 [execlist0] - status_reg = 0x0_8302 [] vcs0: ELSP[2] = 0x7_fedaf119 [execlist1] - status_reg = 0x0_8308 [] vcs0: ELSP[3] = 0x6_fedb2119 [execlist0] - status_reg = 0x7_8308 Note that having to wait for this ack does not disable lite-restores, although it may reduce their numbers. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102035 Signed-off-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/<20171118003038.7935-1-michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171120123458.23242-4-chris@chris-wilson.co.uk Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The patch was integrated to CI_DRM_3364. Let's close and archive this issue!
*** Bug 102393 has been marked as a duplicate of this bug. ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.