Bug 108315

Summary: [CI][BAT] icl incomplete - GEM_BUG_ON(!i915_request_completed(rq)) / GEM_BUG_ON(buf[2 * head + 1] != port->context_id) HSDES#:1806554093
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: imre.deak, intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ICL i915 features: GEM/Other

Description Martin Peres 2018-10-10 13:27:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4960/fi-icl-u2/igt@drv_selftest@live_hangcheck.html

<3> [584.387917] process_csb:977 GEM_BUG_ON(!i915_request_completed(rq))
<4> [584.388022] ------------[ cut here ]------------
<2> [584.388024] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:977!
<4> [584.388031] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [584.388035] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G     U  W         4.19.0-rc7-CI-CI_DRM_4960+ #1
<4> [584.388037] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.2352.A01.1808281852 08/28/2018
<4> [584.388096] RIP: 0010:process_csb+0x24b/0x790 [i915]
<4> [584.388100] Code: 74 69 e1 e0 48 8b 35 64 db 19 00 49 c7 c0 8e ab 3c a0 b9 d1 03 00 00 48 c7 c2 f0 0b 3b a0 48 c7 c7 c6 36 2e a0 e8 c5 f8 e7 e0 <0f> 0b 49 8b 96 a8 00 00 00 48 8b 92 c0 02 00 00 8b 92 c0 00 00 00
<4> [584.388103] RSP: 0018:ffff8804b0703e20 EFLAGS: 00010082
<4> [584.388108] RAX: 000000000000000b RBX: ffff8804805a0008 RCX: 0000000000000000
<4> [584.388111] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff8804aea40aa8
<4> [584.388114] RBP: ffff8804b0703e88 R08: 00000000006ae0d9 R09: ffff8804aeb98000
<4> [584.388117] R10: 0000000000000000 R11: ffff8804aea40aa8 R12: ffff8804a51b104c
<4> [584.388120] R13: 0000000000000001 R14: ffff8804229d70c0 R15: ffff8804a51b1040
<4> [584.388124] FS:  0000000000000000(0000) GS:ffff8804b0700000(0000) knlGS:0000000000000000
<4> [584.388127] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [584.388130] CR2: 0000561022285a30 CR3: 0000000005210002 CR4: 0000000000760ee0
<4> [584.388133] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [584.388136] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [584.388139] PKRU: 55555554
<4> [584.388141] Call Trace:
<4> [584.388144]  <IRQ>
<4> [584.388198]  __execlists_submission_tasklet+0x2c/0xc20 [i915]
<4> [584.388253]  execlists_submission_tasklet+0x46/0x60 [i915]
<4> [584.388260]  tasklet_action_common.isra.5+0x47/0xb0
<4> [584.388264]  __do_softirq+0xd8/0x483
<4> [584.388268]  ? _raw_spin_unlock+0x29/0x40
<4> [584.388271]  irq_exit+0xa9/0xc0
<4> [584.388274]  do_IRQ+0x9a/0x120
<4> [584.388278]  common_interrupt+0xf/0xf
<4> [584.388280]  </IRQ>
<4> [584.388284] RIP: 0010:cpuidle_enter_state+0xab/0x340
<4> [584.388287] Code: 44 00 00 31 ff e8 45 88 94 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 70 02 00 00 31 ff e8 1e 2b 9b ff e8 c9 fc 9e ff fb 4c 29 fb <48> ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
<4> [584.388290] RSP: 0018:ffffc9000010fe90 EFLAGS: 00000216 ORIG_RAX: ffffffffffffffda
<4> [584.388295] RAX: ffff8804aeb0c040 RBX: 000000000005db6a RCX: 0000000000000000
<4> [584.388297] RDX: 0000000000000046 RSI: ffffffff82124e2a RDI: ffffffff820d386f
<4> [584.388300] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
<4> [584.388302] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8804ab720008
<4> [584.388303] R13: ffffffff8229bc18 R14: 0000000000000000 R15: 000000881035ced7
Comment 1 Chris Wilson 2018-10-10 13:33:14 UTC
Note the bogons started in live_contexts. The gpu never seemed to recover from its lost confidence. My gut says this is a hw peculiarity.
Comment 2 Francesco Balestrieri 2018-10-11 07:57:21 UTC
Please make sure the bug has a i915 feature assigned before setting to ReadyForDev. Assigned to GEM/Other for lack of better understanding.
Comment 3 Lakshmi 2018-10-18 06:52:34 UTC
Chris, Any update here? Should I report to HW team?
Comment 4 Chris Wilson 2018-10-18 06:58:02 UTC
Still waiting for feedback on the live_context improvement.
Comment 5 Chris Wilson 2018-10-24 14:14:52 UTC
The problem with icl not handling the read-only flag is incidental to this death.
Comment 6 Chris Wilson 2018-10-24 14:15:01 UTC
*** Bug 108535 has been marked as a duplicate of this bug. ***
Comment 7 Chris Wilson 2018-10-24 14:17:46 UTC
Merging these 2 bugs as both stem from the HW behaving unusually following the execlists submission. In the latter case (bug 108535) we have a clean sequence where the CS event following the submission did not happen (the context being resubmitted was close to completion). Similar bad CS events would explain a lot of related bugs...

So start by looking for common patterns.
Comment 8 Chris Wilson 2018-10-26 12:49:46 UTC
*** Bug 108564 has been marked as a duplicate of this bug. ***
Comment 9 Chris Wilson 2018-10-26 20:08:40 UTC
(In reply to Chris Wilson from comment #1)
> Note the bogons started in live_contexts. The gpu never seemed to recover
> from its lost confidence. My gut says this is a hw peculiarity.

See bug 108569 for the GPU hangs icl encounters with read-only pages.
Comment 10 Lakshmi 2018-10-29 15:45:14 UTC
Update: Last seen this issue IGT_4682 (1 week, 5 days / 147 runs ago).
Comment 11 Francesco Balestrieri 2018-10-30 09:52:32 UTC
Investigating a possible reason with https://patchwork.freedesktop.org/series/51703/
Comment 12 Chris Wilson 2018-10-30 11:25:56 UTC
@@ -910,6 +910,9 @@  static void process_csb(struct intel_engine_cs *engine)
 			  execlists->active);
 
 		status = buf[2 * head];
+		GEM_BUG_ON(!status);
+		GEM_DEBUG_EXEC(WRITE_ONCE(*(u32 *)(buf + 2 * head), 0));

->

<0> [48.247809]   <idle>-0       2..s1 48257761us : execlists_submission_tasklet: rcs0 awake?=1, active=5
<0> [48.247866]   <idle>-0       2d.s2 48257762us : process_csb: rcs0 cs-irq head=1, tail=2
<0> [48.247922]   <idle>-0       2d.s2 48257763us : process_csb: rcs0 csb[2]: status=0x10000014:0x000001c0, active=0x5
<0> [48.247982]   <idle>-0       2d.s2 48257764us : process_csb: rcs0 out[0]: ctx=448.1, global=60303 (fence 3cda:943) (current 60304), prio=0
<0> [48.248043]   <idle>-0       2d.s2 48257765us : process_csb: rcs0 completed ctx=448
<0> [48.248099]   <idle>-0       2d.s2 48257765us : __i915_request_submit: rcs0 fence 3cdc:943 -> global=60305, current 60304
<0> [48.248159]   <idle>-0       2d.s2 48257768us : __execlists_submission_tasklet: rcs0 in[1]:  ctx=512.1, global=60305 (fence 3cdc:943) (current 60304), prio=0
<0> [48.248223]   <idle>-0       2d.s2 48257769us : __execlists_submission_tasklet: rcs0 in[0]:  ctx=480.2, global=60304 (fence 3cdb:943) (current 60304), prio=0
<0> [48.248287] gem_ctx_-1086    6.... 48257775us : i915_request_add: rcs0 fence 3cf4:1080
<0> [48.248344]   <idle>-0       2..s1 48257781us : execlists_submission_tasklet: rcs0 awake?=1, active=1
<0> [48.248402]   <idle>-0       2d.s2 48257800us : process_csb: rcs0 cs-irq head=2, tail=4
<0> [48.248459]   <idle>-0       2d.s2 48257801us : process_csb: rcs0 csb[3]: status=0x10000018:0x000001e0, active=0x1
<0> [48.248518]   <idle>-0       2d.s2 48257802us : process_csb: rcs0 out[0]: ctx=480.2, global=60304 (fence 3cdb:943) (current 60304), prio=0
<0> [48.248580]   <idle>-0       2d.s2 48257803us : process_csb: rcs0 csb[4]: status=0x10000001:0x00000000, active=0x1
<0> [48.248640] gem_ctx_-1086    6.... 48257810us : i915_request_retire: rcs0 fence 3cf5:942, global=60266, current 60304
<0> [48.248700] gem_ctx_-1086    6.... 48257810us : i915_request_retire: __retire_engine_request(rcs0) fence 3cf5:942, global=60266, current 60304
<0> [48.248762]   <idle>-0       2..s1 48257817us : execlists_submission_tasklet: rcs0 awake?=1, active=5
<0> [48.248819]   <idle>-0       2d.s2 48257818us : process_csb: rcs0 cs-irq head=4, tail=4
<0> [48.248875]   <idle>-0       2..s1 48257839us : execlists_submission_tasklet: rcs0 awake?=1, active=5
<0> [48.248933]   <idle>-0       2d.s2 48257840us : process_csb: rcs0 cs-irq head=4, tail=5
<0> [48.248989]   <idle>-0       2d.s2 48257841us : process_csb: rcs0 csb[5]: status=0x10000014:0x000001e0, active=0x5
<0> [48.249049] gem_ctx_-1086    6.... 48257842us : i915_request_add: rcs0 fence 3cf5:1080
<0> [48.249106]   <idle>-0       2d.s2 48257842us : process_csb: rcs0 out[0]: ctx=480.1, global=60304 (fence 3cdb:943) (current 60305), prio=0
<0> [48.249167]   <idle>-0       2d.s2 48257843us : process_csb: rcs0 completed ctx=480
<0> [48.249223]   <idle>-0       2d.s2 48257844us : __i915_request_submit: rcs0 fence 3cdd:943 -> global=60306, current 60305
<0> [48.249284]   <idle>-0       2d.s2 48257847us : __execlists_submission_tasklet: rcs0 in[1]:  ctx=544.1, global=60306 (fence 3cdd:943) (current 60305), prio=0
<0> [48.249348]   <idle>-0       2d.s2 48257848us : __execlists_submission_tasklet: rcs0 in[0]:  ctx=512.2, global=60305 (fence 3cdc:943) (current 60305), prio=0
<0> [48.249411]   <idle>-0       2..s1 48257856us : execlists_submission_tasklet: rcs0 awake?=1, active=1
<0> [48.249468]   <idle>-0       2d.s2 48257857us : process_csb: rcs0 cs-irq head=5, tail=1
<0> [48.249528]   <idle>-0       2d.s2 48257858us : process_csb: rcs0 csb[0]: status=0x10000018:0x00000200, active=0x1
<0> [48.249587]   <idle>-0       2d.s2 48257859us : process_csb: rcs0 out[0]: ctx=512.2, global=60305 (fence 3cdc:943) (current 60305), prio=0
<0> [48.249648]   <idle>-0       2d.s2 48257860us : process_csb: rcs0 csb[1]: status=0x10000001:0x00000000, active=0x1
<0> [48.249708] gem_ctx_-1086    6.... 48257863us : i915_request_retire: rcs0 fence 3cf6:942, global=60267, current 60305
<0> [48.249768] gem_ctx_-1086    6.... 48257881us : i915_request_retire: __retire_engine_request(rcs0) fence 3cf6:942, global=60267, current 60305
<0> [48.249831]   <idle>-0       2..s1 48257894us : execlists_submission_tasklet: rcs0 awake?=1, active=5
<0> [48.249889]   <idle>-0       2d.s2 48257896us : process_csb: rcs0 cs-irq head=1, tail=2
<0> [48.249946]   <idle>-0       2d.s2 48257897us : process_csb: rcs0 csb[2]: status=0x00000000:0x000001c0, active=0x5
<0> [48.250005] gem_ctx_-1086    6.... 48257930us : i915_request_add: rcs0 fence 3cf6:1080
<0> [48.250061]   <idle>-0       2d.s2 48258061us : process_csb: process_csb:913 GEM_BUG_ON(!status)
Comment 13 Martin Peres 2018-10-30 14:35:39 UTC
Also seen on live_context in pre-merge: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2018/fi-icl-u/igt@drv_selftest@live_contexts.html

<5> [534.212171] i915 0000:00:02.0: Resetting bcs0 for hang on bcs0
<3> [534.213048] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout
<7> [534.213404] [drm:i915_reset_engine [i915]] Failed to reset bcs0, ret=-110
<7> [534.214786] [drm:i915_reset_device [i915]] resetting chip
<5> [534.216220] i915 0000:00:02.0: Resetting chip for hang on bcs0
<3> [534.216386] process_csb:977 GEM_BUG_ON(!i915_request_completed(rq))
<4> [534.216719] ------------[ cut here ]------------
<2> [534.216725] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:977!
<4> [534.216745] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [534.216755] CPU: 7 PID: 4747 Comm: drv_selftest Tainted: G     U  W         4.19.0-CI-CI_DRM_5054+ #1
<4> [534.216761] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018
<4> [534.216906] RIP: 0010:process_csb+0x24b/0x790 [i915]
<4> [534.216914] Code: 74 53 90 e0 48 8b 35 04 07 1a 00 49 c7 c0 c0 de 8d a0 b9 d1 03 00 00 48 c7 c2 d0 3d 8c a0 48 c7 c7 e6 4c 7f a0 e8 75 e3 96 e0 <0f> 0b 49 8b 96 a8 00 00 00 48 8b 92 c0 02 00 00 8b 92 c0 00 00 00
<4> [534.216921] RSP: 0018:ffffc9000080f760 EFLAGS: 00010086
<4> [534.216930] RAX: 000000000000000b RBX: ffff8804a4c58008 RCX: 0000000000000000
<4> [534.216935] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8804ae252538
<4> [534.216941] RBP: ffffc9000080f7c8 R08: 0000000000077fb6 R09: ffff8804ad910000
<4> [534.216947] R10: 0000000000000000 R11: ffff8804ae252538 R12: ffff88049681f04c
<4> [534.216952] R13: 0000000000000001 R14: ffff8804214d3640 R15: ffff88049681f040
<4> [534.216959] FS:  00007f9ae929f980(0000) GS:ffff8804affc0000(0000) knlGS:0000000000000000
<4> [534.216965] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [534.216971] CR2: 0000560b357eb7b0 CR3: 0000000470868002 CR4: 0000000000760ee0
<4> [534.216976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [534.216981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [534.216986] PKRU: 55555554
<4> [534.216991] Call Trace:
<4> [534.217137]  execlists_reset_prepare+0x54/0x150 [i915]
<4> [534.217268]  i915_gem_reset_prepare_engine+0x20/0x40 [i915]
<4> [534.217381]  i915_gem_reset_prepare+0x2c/0x70 [i915]
<4> [534.217504]  ? i915_request_wait+0x16d/0x840 [i915]
<4> [534.217612]  i915_reset+0x117/0x280 [i915]
<4> [534.217731]  __i915_wait_request_check_and_reset.isra.8+0x44/0x50 [i915]
<4> [534.217843]  i915_request_wait+0x4fd/0x840 [i915]
<4> [534.217857]  ? lock_acquire+0xa6/0x1c0
<4> [534.217867]  ? wake_up_q+0x70/0x70
<4> [534.217875]  ? wake_up_q+0x70/0x70
<4> [534.217992]  i915_gem_object_wait_fence+0x8a/0x110 [i915]
<4> [534.218105]  i915_gem_object_wait+0x2dc/0x510 [i915]
<4> [534.218217]  i915_gem_object_set_to_cpu_domain+0x35/0x130 [i915]
<4> [534.218328]  igt_vm_isolation+0xb1f/0xfe0 [i915]
<4> [534.218481]  __i915_subtests+0x5e/0xf0 [i915]
<4> [534.218625]  __run_selftests+0x10b/0x190 [i915]
<4> [534.218789]  i915_live_selftests+0x2c/0x60 [i915]
<4> [534.218948]  i915_pci_probe+0x50/0xa0 [i915]
<4> [534.218967]  pci_device_probe+0xa1/0x130
<4> [534.218983]  really_probe+0x25d/0x3c0
<4> [534.218996]  driver_probe_device+0x10a/0x120
<4> [534.219008]  __driver_attach+0xdb/0x100
<4> [534.219020]  ? driver_probe_device+0x120/0x120
<4> [534.219030]  bus_for_each_dev+0x74/0xc0
<4> [534.219043]  bus_add_driver+0x15f/0x250
<4> [534.219054]  ? 0xffffffffa09fe000
<4> [534.219065]  driver_register+0x56/0xe0
<4> [534.219075]  ? 0xffffffffa09fe000
<4> [534.219086]  do_one_initcall+0x58/0x2e0
<4> [534.219098]  ? rcu_lockdep_current_cpu_online+0x8f/0xd0
<4> [534.219108]  ? do_init_module+0x1d/0x1ea
<4> [534.219120]  ? rcu_read_lock_sched_held+0x6f/0x80
<4> [534.219132]  ? kmem_cache_alloc_trace+0x264/0x290
<4> [534.219144]  do_init_module+0x56/0x1ea
<4> [534.219158]  load_module+0x26f5/0x29d0
<4> [534.219177]  ? vfs_read+0x122/0x140
<4> [534.219201]  ? __se_sys_finit_module+0xd3/0xf0
<4> [534.219212]  __se_sys_finit_module+0xd3/0xf0
<4> [534.219232]  do_syscall_64+0x55/0x190
<4> [534.219246]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [534.219255] RIP: 0033:0x7f9ae8b68839
<4> [534.219265] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
<4> [534.219273] RSP: 002b:00007ffc0c2b8ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4> [534.219286] RAX: ffffffffffffffda RBX: 000055800bb9bd40 RCX: 00007f9ae8b68839
<4> [534.219293] RDX: 0000000000000000 RSI: 000055800bb9cb30 RDI: 0000000000000006
<4> [534.219301] RBP: 000055800bb9cb30 R08: 0000000000000004 R09: 0000000000000000
<4> [534.219309] R10: 00007ffc0c2b8c50 R11: 0000000000000246 R12: 0000000000000000
<4> [534.219316] R13: 000055800bb96410 R14: 0000000000000020 R15: 000000000000003c
<4> [534.219334] Modules linked in: i915(+) amdgpu chash gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ax88179_178a usbnet mii x86_pkg_temp_thermal snd_hda_codec coretemp crct10dif_pclmul crc32_pclmul snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm e1000e prime_numbers [last unloaded: i915]
Comment 15 Lakshmi 2018-11-05 16:27:00 UTC
Dropping the priority as this bug fix doesn't depend on the drm/i915 Intel GFX Driver.
Comment 16 Francesco Balestrieri 2018-11-06 08:07:01 UTC
Somehow this hasn't been reported in BAT for 3 days. Was a workaround merged, or some other fix?
Comment 17 Francesco Balestrieri 2018-11-06 08:15:53 UTC
Or more simply, it looks like the last BAT run was on Nov 2nd...
Comment 18 Chris Wilson 2018-11-06 11:09:25 UTC
*** Bug 108679 has been marked as a duplicate of this bug. ***
Comment 19 Jani Saarinen 2018-11-06 12:25:32 UTC
We need to get systems back to grid, I suppose.
Comment 20 Martin Peres 2018-11-07 15:30:55 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4712/fi-icl-u/igt@drv_selftest@live_execlists.html

<0>[  549.241934] drv_self-4908    7d..1 549274550us : process_csb: process_csb:977 GEM_BUG_ON(!i915_request_completed(rq))
Comment 21 Chris Wilson 2018-11-08 12:24:52 UTC
Further postulation:

commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 8 08:17:38 2018 +0000

    drm/i915/execlists: Force write serialisation into context image vs execution
    
    Ensure that the writes into the context image are completed prior to the
    register mmio to trigger execution. Although previously we were assured
    by the SDM that all writes are flushed before an uncached memory
    transaction (our mmio write to submit the context to HW for execution),
    we have empirical evidence to believe that this is not actually the
    case.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181108081740.25615-1-chris@chris-wilson.co.uk
    Cc: stable@vger.kernel.org
Comment 22 Chris Wilson 2018-11-08 17:03:02 UTC
*** Bug 108697 has been marked as a duplicate of this bug. ***
Comment 23 Martin Peres 2018-11-09 13:18:42 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10758/fi-icl-u/igt@gem_exec_nop@basic-parallel.html

<6> [106.068733] [IGT] gem_exec_nop: starting subtest basic-parallel
<3> [106.599535] execlists_submit_ports:434 GEM_BUG_ON(count > !n)
<4> [106.599652] ------------[ cut here ]------------
<2> [106.599655] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:434!
<4> [106.599681] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [106.599690] CPU: 2 PID: 1221 Comm: gem_exec_nop Tainted: G     U  W         4.20.0-rc1-CI-Patchwork_10758+ #1
<4> [106.599702] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018
<4> [106.599768] RIP: 0010:__execlists_submission_tasklet+0xa72/0xc20 [i915]
<4> [106.599778] Code: bd b1 ef e0 48 8b 35 75 00 1a 00 49 c7 c0 f3 99 2e a0 b9 b2 01 00 00 48 c7 c2 10 f7 2c a0 48 c7 c7 cd 0f 20 a0 e8 0e 38 f6 e0 <0f> 0b 48 c7 c1 30 7c 30 a0 ba f3 01 00 00 48 c7 c6 50 f7 2c a0 48
<4> [106.599798] RSP: 0018:ffffc9000088f948 EFLAGS: 00010086
<4> [106.599806] RAX: 000000000000000a RBX: ffff8804a5e00480 RCX: 0000000000000000
<4> [106.599815] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8804ae26f4e8
<4> [106.599824] RBP: ffffc9000088f9a8 R08: 00000000002497c4 R09: ffff8804ae3fb000
<4> [106.599833] R10: 0000000000000000 R11: ffff8804ae26f4e8 R12: ffff8803e82b70c0
<4> [106.599842] R13: 0000000000000000 R14: ffff8804a5e00480 R15: ffff8804a5e00008
<4> [106.599851] FS:  00007f0e444d2980(0000) GS:ffff8804afe80000(0000) knlGS:0000000000000000
<4> [106.599861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [106.599869] CR2: 0000557f2fa10920 CR3: 00000004aa93a002 CR4: 0000000000760ee0
<4> [106.599878] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [106.599887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [106.599896] PKRU: 55555554
<4> [106.599901] Call Trace:
<4> [106.599947]  execlists_submit_request+0x181/0x190 [i915]
<4> [106.599991]  submit_notify+0x8d/0x124 [i915]
<4> [106.600023]  __i915_sw_fence_complete+0x81/0x250 [i915]
<4> [106.600065]  i915_request_add+0x36b/0x7e0 [i915]
<4> [106.600103]  i915_gem_do_execbuffer+0x94f/0x1580 [i915]
<4> [106.600114]  ? deactivate_slab.isra.26+0x74b/0x7a0
<4> [106.600126]  ? ___slab_alloc.constprop.34+0x21c/0x380
<4> [106.600134]  ? ___slab_alloc.constprop.34+0x21c/0x380
<4> [106.600168]  ? i915_gem_execbuffer2_ioctl+0xc4/0x3f0 [i915]
<4> [106.600185]  ? lock_acquire+0xa6/0x1c0
<4> [106.600193]  ? __might_fault+0x38/0x90
<4> [106.600228]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [106.600262]  i915_gem_execbuffer2_ioctl+0x21b/0x3f0 [i915]
<4> [106.600296]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [106.600307]  drm_ioctl_kernel+0x81/0xf0
<4> [106.600315]  drm_ioctl+0x2de/0x390
<4> [106.600347]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [106.600358]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4> [106.600365]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [106.600373]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4> [106.600382]  do_vfs_ioctl+0xa0/0x6e0
<4> [106.600390]  ? retint_kernel+0x2d/0x2d
<4> [106.600399]  ksys_ioctl+0x35/0x60
<4> [106.600406]  __x64_sys_ioctl+0x11/0x20
<4> [106.600412]  do_syscall_64+0x55/0x190
<4> [106.600420]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [106.600428] RIP: 0033:0x7f0e4375e5d7
<4> [106.600434] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4> [106.600455] RSP: 002b:00007ffd0caf68c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4> [106.600465] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0e4375e5d7
<4> [106.600474] RDX: 00007ffd0caf6a00 RSI: 0000000040406469 RDI: 0000000000000005
<4> [106.600483] RBP: 00007ffd0caf6a00 R08: 00007ffd0cb090b0 R09: 00007ffd0cb09080
<4> [106.600492] R10: 000000000002116c R11: 0000000000000246 R12: 0000000040406469
<4> [106.600501] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000
<4> [106.600513] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal ax88179_178a coretemp usbnet mii crct10dif_pclmul snd_hda_intel snd_hda_codec crc32_pclmul snd_hwdep snd_hda_core ghash_clmulni_intel snd_pcm e1000e prime_numbers
Comment 24 Francesco Balestrieri 2018-11-13 10:48:34 UTC
Did these logs reveal anything new?
Comment 25 Lakshmi 2018-11-21 17:50:48 UTC
Francesco, this issue is seen again, this needs to be investigated by appropriate team.

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_144/fi-icl-u/igt@gem_exec_schedule@wide-render.html

Dmesg-Warnings	
<2> [271.860952] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:965!

Dmesg	
<6> [266.023718] Console: switching to colour dummy device 80x25
<6> [266.023776] [IGT] gem_exec_schedule: executing
<5> [266.032416] Setting dangerous option reset - tainting kernel
<5> [266.040641] Setting dangerous option reset - tainting kernel
<5> [266.040945] Setting dangerous option reset - tainting kernel
<6> [266.044643] [IGT] gem_exec_schedule: starting subtest wide-render
<7> [266.045729] [drm:vgem_gem_dumb_create [vgem]] Created object of size 1
<7> [266.103938] [drm:vgem_gem_dumb_create [vgem]] Created object of size 1
<3> [271.860814] process_csb:965 GEM_BUG_ON(buf[2 * head + 1] != port->context_id)
<4> [271.860950] ------------[ cut here ]------------
<2> [271.860952] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:965!
<4> [271.860972] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [271.860979] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G     U            4.20.0-rc2-g2ba924da1cb3-drmtip_144+ #1
<4> [271.860989] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018
<4> [271.861040] RIP: 0010:process_csb+0x5c6/0x790 [i915]
<4> [271.861047] Code: d9 19 c7 d6 48 8b 35 21 2c 1a 00 49 c7 c0 00 7d 59 c0 b9 c5 03 00 00 48 c7 c2 b0 f8 55 c0 48 c7 c7 a1 f2 48 c0 e8 7a a0 cd d6 <0f> 0b 48 c7 c1 b8 9a 57 c0 ba c7 03 00 00 48 c7 c6 b0 f8 55 c0 48
<4> [271.861062] RSP: 0018:ffff9b34efe03e20 EFLAGS: 00010082
<4> [271.861069] RAX: 000000000000000e RBX: ffff9b34da5042a8 RCX: 0000000000000000
Comment 26 Francesco Balestrieri 2018-11-21 18:16:45 UTC
Mika is trying to reproduce this on his C0.
Comment 27 Mika Kuoppala 2018-11-22 14:48:31 UTC
It is there with C1:

[20269.450705] [IGT] gem_exec_nop: starting subtest basic-parallel
[20273.153995] process_csb:965 GEM_BUG_ON(buf[2 * head + 1] != port->context_id)
Comment 28 Chris Wilson 2018-12-04 20:24:45 UTC
*** Bug 108801 has been marked as a duplicate of this bug. ***
Comment 29 Martin Peres 2018-12-05 13:14:41 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5262/shard-iclb6/igt@gem_exec_nop@basic-series.html

<3> [1216.645106] execlists_submit_ports:446 GEM_BUG_ON(count > !n)
<4> [1216.645233] ------------[ cut here ]------------
<2> [1216.645235] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:446!
<4> [1216.645257] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [1216.645265] CPU: 5 PID: 3628 Comm: gem_exec_nop Tainted: G     U            4.20.0-rc5-CI-CI_DRM_5262+ #1
<4> [1216.645276] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018
<4> [1216.645331] RIP: 0010:__execlists_submission_tasklet+0xa75/0xc20 [i915]
<4> [1216.645340] Code: 6a fc f1 e0 48 8b 35 ca 9d 1a 00 49 c7 c0 38 c4 2c a0 b9 be 01 00 00 48 c7 c2 10 19 2b a0 48 c7 c7 10 d2 1d a0 e8 db 82 f8 e0 <0f> 0b 48 c7 c1 48 ac 2e a0 ba ff 01 00 00 48 c7 c6 50 19 2b a0 48
<4> [1216.645358] RSP: 0018:ffffc9000066f950 EFLAGS: 00010082
<4> [1216.645365] RAX: 000000000000000a RBX: ffff8884a0ac04c0 RCX: 0000000000000000
<4> [1216.645373] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8884ae26e4f8
<4> [1216.645381] RBP: ffffc9000066f9b0 R08: 000000000023dd30 R09: ffff8884ad893000
<4> [1216.645389] R10: 0000000000000000 R11: ffff8884ae26e4f8 R12: ffff88846bd0c3c0
<4> [1216.645396] R13: 0000000000000000 R14: ffff8884a0ac04c0 R15: ffff8884a0ac0008
<4> [1216.645405] FS:  00007fc3f2769980(0000) GS:ffff8884aff40000(0000) knlGS:0000000000000000
<4> [1216.645414] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [1216.645421] CR2: 000055a9e32b8050 CR3: 000000044dabe003 CR4: 0000000000760ee0
<4> [1216.645428] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [1216.645436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [1216.645444] PKRU: 55555554
<4> [1216.645448] Call Trace:
<4> [1216.645489]  execlists_submit_request+0x181/0x190 [i915]
<4> [1216.645528]  submit_notify+0x8d/0x124 [i915]
<4> [1216.645557]  __i915_sw_fence_complete+0x81/0x250 [i915]
<4> [1216.645595]  i915_request_add+0x36b/0x7e0 [i915]
<4> [1216.645629]  i915_gem_do_execbuffer+0xe45/0x1820 [i915]
<4> [1216.645642]  ? ___slab_alloc.constprop.34+0x21c/0x380
<4> [1216.645649]  ? ___slab_alloc.constprop.34+0x21c/0x380
<4> [1216.645680]  ? i915_gem_execbuffer2_ioctl+0xc4/0x3f0 [i915]
<4> [1216.645690]  ? lock_acquire+0xa6/0x1c0
<4> [1216.645698]  ? __might_fault+0x38/0x90
<4> [1216.645733]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [1216.645764]  i915_gem_execbuffer2_ioctl+0x21b/0x3f0 [i915]
<4> [1216.645796]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [1216.645806]  drm_ioctl_kernel+0x81/0xf0
<4> [1216.645813]  drm_ioctl+0x2de/0x390
<4> [1216.645843]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [1216.645852]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [1216.645859]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4> [1216.645868]  do_vfs_ioctl+0xa0/0x6e0
<4> [1216.645875]  ksys_ioctl+0x35/0x60
<4> [1216.645882]  __x64_sys_ioctl+0x11/0x20
<4> [1216.645887]  do_syscall_64+0x55/0x190
<4> [1216.645895]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1216.645902] RIP: 0033:0x7fc3f19f35d7
<4> [1216.645908] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4> [1216.645926] RSP: 002b:00007fff23e22438 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4> [1216.645935] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3f19f35d7
<4> [1216.645943] RDX: 00007fff23e22570 RSI: 0000000040406469 RDI: 0000000000000005
<4> [1216.645951] RBP: 00007fff23e22570 R08: 00007fff23e550b0 R09: 00007fff23e55080
<4> [1216.645959] R10: 000000000017d692 R11: 0000000000000246 R12: 0000000040406469
<4> [1216.645967] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000
Comment 30 Mika Kuoppala 2018-12-07 12:42:48 UTC
Optimistically but with a grain of confidence: 
commit d8f505311717cfc2033c44a8b26ba96c66187303
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Wed Dec 5 15:46:12 2018 +0200

    drm/i915/icl: Forcibly evict stale csb entries
    
    Gen11 fails to deliver wrt global observation point on
    tail/entry updates and we sometimes see old entry.
    
    Use clflush to forcibly evict our possibly stale copy
    of the cacheline in hopes that we get fresh one from gpu.
    Obviously there is something amiss in the coherency protocol so
    this can be consired as a workaround until real cause
    is found.
    
    The working hardware will do the evict without our cue anyways,
    so the cost in there should be ameliorated by that fact.
    
    v2: for next pass, s/flush/evict, add reset (Chris)
    
    References: https://bugzilla.freedesktop.org/show_bug.cgi?id=108315
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181205134612.24822-1-mika.kuoppala@linux.intel.com
Comment 31 Francesco Balestrieri 2018-12-11 05:19:04 UTC
CI buglog seems to report another occurrence:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5293/shard-iclb7/igt@runner@aborted.html

<3>[ 2834.229599] process_csb:994 GEM_BUG_ON(!i915_request_completed(rq))
<4>[ 2834.229748] ------------[ cut here ]------------
<2>[ 2834.229750] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:994!
<4>[ 2834.229774] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4>[ 2834.229782] CPU: 2 PID: 13552 Comm: i915_selftest Tainted: G     U            4.20.0-rc5-CI-CI_DRM_5293+ #1
<4>[ 2834.229793] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018
<4>[ 2834.229859] RIP: 0010:process_csb+0x248/0x7a0 [i915]
<4>[ 2834.229868] Code: 17 fd ee e0 48 8b 35 bf ac 1a 00 49 c7 c0 c9 d3 2f a0 b9 e2 03 00 00 48 c7 c2 e0 29 2e a0 48 c7 c7 63 d1 20 a0 e8 88 83 f5 e0 <0f> 0b 49 8b 96 a8 00 00 00 48 8b 92 c0 02 00 00 8b 92 c0 00 00 00
<4>[ 2834.229890] RSP: 0018:ffffc9000037f760 EFLAGS: 00010086
<4>[ 2834.229899] RAX: 000000000000000b RBX: ffff8884954b2158 RCX: 0000000000000000
<4>[ 2834.229910] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8884ae26f4e8
<4>[ 2834.229919] RBP: ffffc9000037f7c8 R08: 00000000006fd9b9 R09: ffff8884ae3fa000
<4>[ 2834.229927] R10: 0000000000000000 R11: ffff8884ae26f4e8 R12: ffff8884a1d1906c
<4>[ 2834.229936] R13: 0000000000000005 R14: ffff88839fd31b40 R15: ffff8884a1d19040
<4>[ 2834.229945] FS:  00007f45b47cb980(0000) GS:ffff8884afe80000(0000) knlGS:0000000000000000
<4>[ 2834.229956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 2834.229963] CR2: 00007fe9badb6100 CR3: 00000003e90ae001 CR4: 0000000000760ee0
<4>[ 2834.229972] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 2834.229981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[ 2834.229989] PKRU: 55555554
<4>[ 2834.229994] Call Trace:
<4>[ 2834.230045]  execlists_reset_prepare+0x54/0x150 [i915]
<4>[ 2834.230098]  i915_gem_reset_prepare_engine+0x20/0x40 [i915]
<4>[ 2834.230144]  i915_gem_reset_prepare+0x2c/0x70 [i915]
<4>[ 2834.230194]  ? i915_request_wait+0x16d/0x840 [i915]
<4>[ 2834.230241]  i915_reset+0x117/0x280 [i915]
<4>[ 2834.230286]  __i915_wait_request_check_and_reset.isra.8+0x44/0x50 [i915]
<4>[ 2834.230329]  i915_request_wait+0x4fd/0x840 [i915]
<4>[ 2834.230340]  ? lock_acquire+0xa6/0x1c0
<4>[ 2834.230347]  ? wake_up_q+0x70/0x70
<4>[ 2834.230354]  ? wake_up_q+0x70/0x70
<4>[ 2834.230399]  i915_gem_object_wait_fence+0x8a/0x110 [i915]
<4>[ 2834.230440]  i915_gem_object_wait+0x2d0/0x500 [i915]
<4>[ 2834.230482]  i915_gem_object_set_to_cpu_domain+0x35/0x130 [i915]
<4>[ 2834.230528]  igt_vm_isolation+0xb1f/0xfe0 [i915]
<4>[ 2834.230591]  __i915_subtests+0x5e/0xf0 [i915]
<4>[ 2834.230647]  __run_selftests+0x10b/0x190 [i915]
<4>[ 2834.230696]  i915_live_selftests+0x2c/0x60 [i915]
<4>[ 2834.230743]  i915_pci_probe+0x50/0xa0 [i915]
<4>[ 2834.230752]  pci_device_probe+0xa1/0x130
<4>[ 2834.230761]  really_probe+0xf3/0x3e0
<4>[ 2834.230769]  driver_probe_device+0x10a/0x120
<4>[ 2834.230777]  __driver_attach+0xdb/0x100
<4>[ 2834.230784]  ? driver_probe_device+0x120/0x120
<4>[ 2834.230791]  ? driver_probe_device+0x120/0x120
<4>[ 2834.230799]  bus_for_each_dev+0x74/0xc0
<4>[ 2834.230806]  bus_add_driver+0x15f/0x250
<4>[ 2834.230813]  ? 0xffffffffa06d1000
<4>[ 2834.230819]  driver_register+0x56/0xe0
<4>[ 2834.230826]  ? 0xffffffffa06d1000
<4>[ 2834.230833]  do_one_initcall+0x58/0x2e0
<4>[ 2834.230841]  ? do_init_module+0x1d/0x1ea
<4>[ 2834.230849]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[ 2834.230857]  ? kmem_cache_alloc_trace+0x264/0x290
<4>[ 2834.230866]  do_init_module+0x56/0x1ea
<4>[ 2834.230874]  load_module+0x227a/0x29c0
<4>[ 2834.230887]  ? __se_sys_finit_module+0xd3/0xf0
<4>[ 2834.230895]  __se_sys_finit_module+0xd3/0xf0
<4>[ 2834.230908]  do_syscall_64+0x55/0x190
<4>[ 2834.230929]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[ 2834.230939] RIP: 0033:0x7f45b4093839
<4>[ 2834.230946] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
<4>[ 2834.230966] RSP: 002b:00007ffc71a34958 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4>[ 2834.230979] RAX: ffffffffffffffda RBX: 0000555ebf178bb0 RCX: 00007f45b4093839
<4>[ 2834.230987] RDX: 0000000000000000 RSI: 0000555ebf179950 RDI: 0000000000000006
<4>[ 2834.230995] RBP: 0000555ebf179950 R08: 0000000000000004 R09: 0000000000000000
<4>[ 2834.231004] R10: 00007ffc71a34ad0 R11: 0000000000000246 R12: 0000000000000000
<4>[ 2834.231012] R13: 0000555ebf16fc00 R14: 0000000000000020 R15: 000000000000003c
Comment 32 Francesco Balestrieri 2018-12-28 09:11:18 UTC
The logs in my previous comment are about a different issue. Mika confirmed that the fix was successful, closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.