Bug 104945

Summary:	[CI] igt@gem_* - incomplete - GEM_BUG_ON(buf[2 * head + 1] != port->context_id) - kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:869!
Product:	DRI	Reporter:	Marta Löfstedt <marta.lofstedt>
Component:	DRM/Intel	Assignee:	Marta Löfstedt <marta.lofstedt>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
See Also:	https://bugs.freedesktop.org/show_bug.cgi?id=105341
Whiteboard:	ReadyForDev
i915 platform:	BXT, GLK, KBL	i915 features:	GEM/Other

Description Marta Löfstedt 2018-02-05 07:22:46 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3718/shard-kbl6/igt@gem_exec_capture@capture-render.html
and
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4213/shard-kbl6/igt@gem_exec_capture@capture-render.html

From pstore:
<3>[   69.992810] GEM_BUG_ON(buf[2 * head + 1] != port->context_id)
<4>[   69.992889] ------------[ cut here ]------------
<2>[   69.992892] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:869!
<4>[   69.992917] invalid opcode: 0000 [#1] PREEMPT SMP PTI
...
ftrace
...
<4>[   69.431753] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel crct10dif_pclmul snd_hda_codec crc32_pclmul snd_hwdep snd_hda_core ghash_clmulni_intel e1000e snd_pcm mei_me ptp mei pps_core prime_numbers
<4>[   69.431796] CPU: 0 PID: 1503 Comm: gem_exec_captur Tainted: G     U  W        4.15.0-CI-CI_DRM_3717+ #1
<4>[   69.431806] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017
<4>[   69.431837] RIP: 0010:execlists_submission_tasklet+0x471/0xde0 [i915]
<4>[   69.431844] RSP: 0018:ffff88027ec03ea8 EFLAGS: 00010296
<4>[   69.431851] RAX: 0000000000000031 RBX: 0000000000000018 RCX: 0000000000000102
<4>[   69.431859] RDX: 0000000000000102 RSI: ffffffff8211476f RDI: 00000000ffffffff
<4>[   69.431867] RBP: ffff88027ec03f20 R08: ffff8802695c0940 R09: 00000000f389b231
<4>[   69.431874] R10: 000000000000000c R11: 0000000000000000 R12: ffff880269a57040
<4>[   69.431882] R13: ffff88026a090010 R14: ffff880269a5704c R15: ffff88026a090008
<4>[   69.431890] FS:  00007f2ede1ae8c0(0000) GS:ffff88027ec00000(0000) knlGS:0000000000000000
<4>[   69.431899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   69.431906] CR2: 0000563cac5275b0 CR3: 0000000269f9c002 CR4: 00000000003606f0
<4>[   69.431914] Call Trace:
<4>[   69.431918]  <IRQ>
<4>[   69.431928]  tasklet_hi_action+0x89/0x110
<4>[   69.431935]  __do_softirq+0xc1/0x4aa
<4>[   69.431943]  irq_exit+0xa4/0xb0
<4>[   69.431948]  do_IRQ+0x67/0x120
<4>[   69.431954]  common_interrupt+0x9f/0x9f
<4>[   69.431960]  </IRQ>
<4>[   69.431964] RIP: 0010:_raw_spin_unlock_irqrestore+0x4e/0x60
<4>[   69.431971] RSP: 0018:ffffc900009afb58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd6
<4>[   69.431980] RAX: ffff8802695c0040 RBX: 0000000000000246 RCX: 0000000000000006
<4>[   69.431988] RDX: 00000000000013ba RSI: ffffffff821087d9 RDI: ffffffff820b9aa6
<4>[   69.431996] RBP: ffff88025ff26050 R08: ffff8802695c0940 R09: 00000000f389b231
<4>[   69.432003] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0284d40
<4>[   69.432011] R13: ffff880268eea228 R14: ffff88026ab32558 R15: ffff880268eea228
<4>[   69.432024]  __irq_put_desc_unlock+0x13/0x40
<4>[   69.432030]  enable_irq+0x42/0x60
<4>[   69.432055]  i915_reset+0xcd/0x270 [i915]
<4>[   69.432080]  i915_reset_device+0x1cb/0x230 [i915]
<4>[   69.432106]  ? gen8_gt_irq_ack+0x160/0x160 [i915]
<4>[   69.432115]  ? work_on_cpu_safe+0x50/0x50
<4>[   69.432141]  i915_handle_error+0x2d3/0x430 [i915]
<4>[   69.432152]  ? __might_fault+0x39/0x90
<4>[   69.432158]  ? __might_fault+0x39/0x90
<4>[   69.432186]  i915_wedged_set+0x79/0xc0 [i915]
<4>[   69.432193]  simple_attr_write+0xab/0xc0
<4>[   69.432200]  full_proxy_write+0x4b/0x70
<4>[   69.432207]  __vfs_write+0x1e/0x130
<4>[   69.432213]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[   69.432220]  ? rcu_sync_lockdep_assert+0x25/0x50
<4>[   69.432226]  ? __sb_start_write+0xd9/0x1f0
<4>[   69.432232]  ? __sb_start_write+0xf3/0x1f0
<4>[   69.432238]  vfs_write+0xbd/0x1b0
<4>[   69.432244]  SyS_write+0x40/0xa0
<4>[   69.432251]  entry_SYSCALL_64_fastpath+0x22/0x8f
<4>[   69.432257] RIP: 0033:0x7f2edc994670
<4>[   69.432262] RSP: 002b:00007fff734c5cc8 EFLAGS: 00000246
<4>[   69.432266] Code: fd ff ff 48 c7 c6 20 6c 2a a0 48 c7 c7 cb b0 28 a0 e8 09 ba f2 e0 0f 0b 48 c7 c6 48 6c 2a a0 48 c7 c7 cb b0 28 a0 e8 f4 b9 f2 e0 <0f> 0b 49 8d 9e 60 15 00 00 4c 89 5d b0 48 89 df e8 1a 01 70 e1 
<1>[   69.432359] RIP: execlists_submission_tasklet+0x471/0xde0 [i915] RSP: ffff88027ec03ea8
<4>[   69.432386] ---[ end trace 168600b65ef640ff ]---

Comment 1 Chris Wilson 2018-02-05 08:44:41 UTC

It's the same problem as the bogus tail pointer in live_hangcheck, the original bug in 104262

Comment 2 Marta Löfstedt 2018-02-21 14:06:01 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4277/shard-apl8/igt@gem_eio@in-flight-contexts.html

from pstore:
...
<3>[   69.741148] GEM_BUG_ON(buf[2 * head + 1] != port->context_id)
<4>[   69.741237] ------------[ cut here ]------------
<2>[   69.741241] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:876!
<4>[   69.741281] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<0>[   69.741296] Dumping ftrace buffer:
...
<4>[   69.750150] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep lpc_ich snd_hda_core snd_pcm r8169 mei_me mii mei prime_numbers pinctrl_broxton pinctrl_intel
<4>[   69.750239] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G     U           4.16.0-rc2-CI-CI_DRM_3814+ #1
<4>[   69.750257] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[   69.750323] RIP: 0010:execlists_submission_tasklet+0x477/0xe50 [i915]
<4>[   69.750338] RSP: 0018:ffff88027fc83ea8 EFLAGS: 00010296
<4>[   69.750351] RAX: 0000000000000031 RBX: 0000000000000014 RCX: 0000000000000102
<4>[   69.750366] RDX: 0000000080000102 RSI: ffffffff8211c1af RDI: 00000000ffffffff
<4>[   69.750381] RBP: ffff88027fc83f20 R08: 0000000000000001 R09: 0000000000000000
<4>[   69.750395] R10: 0000000000000005 R11: 0000000000000000 R12: ffff880261d49040
<4>[   69.750410] R13: ffff880270fea160 R14: ffff880261d49064 R15: ffff880270fea158
<4>[   69.750425] FS:  0000000000000000(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
<4>[   69.750442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   69.750454] CR2: 0000558cd3fc6818 CR3: 0000000005210000 CR4: 00000000003406e0
<4>[   69.750469] Call Trace:
<4>[   69.750478]  <IRQ>
<4>[   69.750493]  tasklet_hi_action+0x89/0x110
<4>[   69.750509]  __do_softirq+0xc1/0x4aa
<4>[   69.750524]  irq_exit+0xa4/0xb0
<4>[   69.750533]  do_IRQ+0x67/0x120
<4>[   69.750546]  common_interrupt+0x84/0x84
<4>[   69.750556]  </IRQ>
<4>[   69.750565] RIP: 0010:cpuidle_enter_state+0xaa/0x350
<4>[   69.750576] RSP: 0018:ffffc90000087eb8 EFLAGS: 00000216 ORIG_RAX: ffffffffffffffda
<4>[   69.750594] RAX: ffff880276008040 RBX: 00000000000c9a06 RCX: 0000000000000001
<4>[   69.750608] RDX: 0000000000000000 RSI: ffffffff8210fb59 RDI: ffffffff820c02e7
<4>[   69.750623] RBP: 0000000000000003 R08: 00000000000004d0 R09: 0000000000000002
<4>[   69.750637] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
<4>[   69.750652] R13: ffff880275a489f8 R14: 000000103cd532be R15: ffffffff82297ba0
<4>[   69.750674]  ? cpuidle_enter_state+0xa6/0x350
<4>[   69.750689]  do_idle+0x188/0x1d0
<4>[   69.750700]  cpu_startup_entry+0x14/0x20
<4>[   69.750712]  start_secondary+0x129/0x160
<4>[   69.750725]  secondary_startup_64+0xa5/0xb0
<4>[   69.750741] Code: fd ff ff 48 c7 c6 90 9d 27 a0 48 c7 c7 66 e1 25 a0 e8 73 a7 f5 e0 0f 0b 48 c7 c6 b8 9d 27 a0 48 c7 c7 66 e1 25 a0 e8 5e a7 f5 e0 <0f> 0b 49 8d 9e c0 14 00 00 4c 89 5d b0 48 89 df e8 c4 26 74 e1 
<1>[   69.750913] RIP: execlists_submission_tasklet+0x477/0xe50 [i915] RSP: ffff88027fc83ea8
<4>[   69.750976] ---[ end trace 964f2a4712d380ea ]---
<0>[   70.053877] Kernel panic - not syncing: Fatal exception in interrupt
<0>[   70.053915] Dumping ftrace buffer:
<0>[   70.053928]    (ftrace buffer empty)

Comment 3 Marta Löfstedt 2018-02-22 06:55:18 UTC

Here is another one with working pstore:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3816/shard-apl6/igt@gem_eio@in-flight.html

Comment 4 Marta Löfstedt 2018-02-23 06:57:04 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3826/shard-glkb6/igt@gem_eio@in-flight.html

This is all in pstore. But it is very similar to Comment #1

<4>[  414.627527] RIP: 0010:_raw_spin_unlock_irqrestore+0x4e/0x60
<4>[  414.627538] RSP: 0018:ffffc90000bbfac8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
<4>[  414.627553] RAX: ffff88014446a740 RBX: 0000000000000246 RCX: 0000000000000006
<4>[  414.627566] RDX: 000000000000136f RSI: ffffffff8210fb89 RDI: ffffffff820c0317
<4>[  414.627580] RBP: ffff8801749e9b40 R08: ffff88014446b040 R09: 00000000d7f092b6
<4>[  414.627593] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa025cf4a
<4>[  414.627606] R13: ffff880176e6a298 R14: ffff88016e0a9c08 R15: ffff880176e6a298
<4>[  414.627626]  __irq_put_desc_unlock+0x13/0x40
<4>[  414.627637]  enable_irq+0x42/0x60
<4>[  414.627680]  i915_reset+0xcd/0x270 [i915]
<4>[  414.627724]  i915_reset_device+0x1cb/0x230 [i915]
<4>[  414.627771]  ? __intel_get_crtc_scanline+0x1f0/0x1f0 [i915]
<4>[  414.627787]  ? work_on_cpu_safe+0x50/0x50
<4>[  414.627833]  i915_handle_error+0x2d3/0x430 [i915]
<4>[  414.627851]  ? scnprintf+0x3a/0x70
<4>[  414.627899]  hangcheck_declare_hang+0xc0/0xd0 [i915]
<4>[  414.627949]  ? fwtable_read32+0x21e/0x290 [i915]
<4>[  414.627998]  i915_hangcheck_elapsed+0x232/0x310 [i915]
<4>[  414.628014]  process_one_work+0x215/0x620
<4>[  414.628028]  worker_thread+0x48/0x3a0
<4>[  414.628040]  kthread+0xfb/0x130
<4>[  414.628048]  ? process_one_work+0x620/0x620
<4>[  414.628058]  ? _kthread_create_on_node+0x30/0x30
<4>[  414.628069]  ret_from_fork+0x3a/0x50
<4>[  414.628083] Code: fd ff ff 48 c7 c6 30 f1 27 a0 48 c7 c7 bb 34 26 a0 e8 18 53 f5 e0 0f 0b 48 c7 c6 58 f1 27 a0 48 c7 c7 bb 34 26 a0 e8 03 53 f5 e0 <0f> 0b 49 8d 9c 24 c0 14 00 00 4c 89 55 b0 48 89 df e8 68 d2 73 
<1>[  414.628225] RIP: execlists_submission_tasklet+0x4a2/0xe30 [i915] RSP: ffff88017fc83ea8
<4>[  414.628273] ---[ end trace d28955b8aee33f2c ]---

Comment 5 Marta Löfstedt 2018-02-23 06:59:05 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4284/shard-kbl7/igt@gem_eio@in-flight-external.html

Although pstore backtrace is partially overwritten by ftrace I believe this is the same issue.

<4>[   78.803859] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec x86_pkg_temp_thermal intel_powerclamp coretemp snd_hwdep crct10dif_pclmul snd_hda_core crc32_pclmul ghash_clmulni_intel snd_pcm e1000e mei_me mei prime_numbers
<4>[   78.803910] CPU: 0 PID: 3 Comm: kworker/0:0 Tainted: G     U           4.16.0-rc2-CI-CI_DRM_3821+ #1
<4>[   78.803922] Hardware name: Intel Corporation NUC7i5BNH/NUC7i5BNB, BIOS BNKBL357.86A.0060.2017.1214.2013 12/14/2017
<4>[   78.803955] Workqueue: events_long i915_hangcheck_elapsed [i915]
<4>[   78.803989] RIP: 0010:execlists_submission_tasklet+0x4a2/0xe30 [i915]
<4>[   78.803999] RSP: 0018:ffff88027ec03ea8 EFLAGS: 00010296
<4>[   78.804007] RAX: 0000000000000031 RBX: 0000000000000018 RCX: 0000000000000102
<4>[   78.804017] RDX: 0000000080000102 RSI: ffffffff8211c1af RDI: 00000000ffffffff
<4>[   78.804026] RBP: ffff88027ec03f20 R08: ffff8802751b3140 R09: 00000000beeb13b4
<4>[   78.804036] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88025d32205c
<4>[   78.804045] R13: ffff88025d322040 R14: ffff880269100010 R15: ffff880269100008
<4>[   78.804055] FS:  0000000000000000(0000) GS:ffff88027ec00000(0000) knlGS:0000000000000000
<4>[   78.804065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   78.804074] CR2: 000055b5d01b9dd0 CR3: 0000000005210005 CR4: 00000000003606f0
<4>[   78.804083] Call Trace:
<4>[   78.804089]  <IRQ>
<4>[   78.804098]  tasklet_hi_action+0x89/0x110
<4>[   78.804106]  __do_softirq+0xc1/0x4aa
<4>[   78.804116]  irq_exit+0xa4/0xb0
<4>[   78.804122]  do_IRQ+0x67/0x120
<4>[   78.804129]  common_interrupt+0x84/0x84
<4>[   78.804135]  </IRQ>
<4>[   78.804141] RIP: 0010:_raw_spin_unlock_irqrestore+0x4e/0x60
<4>[   78.804149] RSP: 0018:ffffc90000037ac8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdb
<4>[   78.804160] RAX: ffff8802751b2840 RBX: 0000000000000246 RCX: 0000000000000006
<4>[   78.804170] RDX: 000000000000128e RSI: ffffffff8210fb59 RDI: ffffffff820c02e7
<4>[   78.804179] RBP: ffff8802744d35d0 R08: ffff8802751b3140 R09: 00000000beeb13b4
<4>[   78.804189] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0287f2a
<4>[   78.804198] R13: ffff88026955a298 R14: ffff880268a28968 R15: ffff88026955a298
<4>[   78.804214]  __irq_put_desc_unlock+0x13/0x40
<4>[   78.804222]  enable_irq+0x42/0x60
<4>[   78.804252]  i915_reset+0xcd/0x270 [i915]
<4>[   78.804283]  i915_reset_device+0x1cb/0x230 [i915]
<4>[   78.804313]  ? __intel_get_crtc_scanline+0x1f0/0x1f0 [i915]
<4>[   78.804324]  ? work_on_cpu_safe+0x50/0x50
<4>[   78.804356]  i915_handle_error+0x2d3/0x430 [i915]
<4>[   78.804369]  ? scnprintf+0x3a/0x70
<4>[   78.804402]  hangcheck_declare_hang+0xc0/0xd0 [i915]
<4>[   78.804439]  ? fwtable_read32+0x21e/0x290 [i915]
<4>[   78.804476]  i915_hangcheck_elapsed+0x232/0x310 [i915]
<4>[   78.804488]  process_one_work+0x215/0x620
<4>[   78.804498]  worker_thread+0x48/0x3a0
<4>[   78.804506]  kthread+0xfb/0x130
<4>[   78.804513]  ? process_one_work+0x620/0x620
<4>[   78.804520]  ? _kthread_create_on_node+0x30/0x30
<4>[   78.804529]  ret_from_fork+0x3a/0x50
<4>[   78.804539] Code: fd ff ff 48 c7 c6 48 a1 2a a0 48 c7 c7 9b e4 28 a0 e8 08 a4 f2 e0 0f 0b 48 c7 c6 70 a1 2a a0 48 c7 c7 9b e4 28 a0 e8 f3 a3 f2 e0 <0f> 0b 49 8d 9c 24 c0 14 00 00 4c 89 55 b0 48 89 df e8 58 23 71 
<1>[   78.804642] RIP: execlists_submission_tasklet+0x4a2/0xe30 [i915] RSP: ffff88027ec03ea8
<4>[   78.804677] ---[ end trace 7b2c642a76185f25 ]---

Comment 6 Marta Löfstedt 2018-02-23 07:15:10 UTC

(In reply to Chris Wilson from comment #1)
> It's the same problem as the bogus tail pointer in live_hangcheck, the
> original bug in 104262

Chris would you prefer to re-open bug 104262?

Also looking at history of the test
https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_eio@in-flight-contexts.html
https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_eio@in-flight.html

 it looks like something may have happened around ~CI_DRM_3815

Also, the last occurrences on KBL-shards indicates a hang check rather than this thing.

Comment 7 Marta Löfstedt 2018-02-23 11:04:06 UTC

Speculation: is there a correlation between fix for bug 104676 and increase in APL, KBL incompletes on the same test?

Comment 8 Chris Wilson 2018-02-23 11:11:23 UTC

My working theory for the incompletes as a result of bug 104676 depend on a reset running concurrently to the tasklet. This bug is the HW reporting the wrong tail (or rather the tail not being updated before the interrupt is handled).

Comment 9 Chris Wilson 2018-02-23 11:13:23 UTC

Having said that, if the tasklet is running concurrently to a reset, everything is undefined.

Comment 10 Chris Wilson 2018-03-05 10:21:49 UTC

Found a related bug,

commit 963ddd63c314e9b5d9cd999873d473a93aed5380
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 2 11:33:24 2018 +0000

    drm/i915: Suspend submission tasklets around wedging

where the port->context_id was being reset with tasklet in flight on wedging. Which then uncovered the problem in bug 105341 (previously seen in the kasan runs). I _think_ this GEM_BUG_ON(CSB[] != port->context_id) is resolved. Just the can of worms got larger.

Comment 11 Marta Löfstedt 2018-03-05 11:44:47 UTC

Fix integrate to CI_DRM_3868 I will monitor for some runs and close if OK

Comment 12 Marta Löfstedt 2018-03-06 07:09:18 UTC

It seem all hits are now on bug 105341. I will close and archive this

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.