Summary: | [CI] igt@gem_* - incomplete - GEM_BUG_ON(buf[2 * head + 1] != port->context_id) - kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:869! | ||
---|---|---|---|
Product: | DRI | Reporter: | Marta Löfstedt <marta.lofstedt> |
Component: | DRM/Intel | Assignee: | Marta Löfstedt <marta.lofstedt> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
See Also: | https://bugs.freedesktop.org/show_bug.cgi?id=105341 | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BXT, GLK, KBL | i915 features: | GEM/Other |
Description
Marta Löfstedt
2018-02-05 07:22:46 UTC
It's the same problem as the bogus tail pointer in live_hangcheck, the original bug in 104262 https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4277/shard-apl8/igt@gem_eio@in-flight-contexts.html from pstore: ... <3>[ 69.741148] GEM_BUG_ON(buf[2 * head + 1] != port->context_id) <4>[ 69.741237] ------------[ cut here ]------------ <2>[ 69.741241] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:876! <4>[ 69.741281] invalid opcode: 0000 [#1] PREEMPT SMP PTI <0>[ 69.741296] Dumping ftrace buffer: ... <4>[ 69.750150] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep lpc_ich snd_hda_core snd_pcm r8169 mei_me mii mei prime_numbers pinctrl_broxton pinctrl_intel <4>[ 69.750239] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G U 4.16.0-rc2-CI-CI_DRM_3814+ #1 <4>[ 69.750257] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017 <4>[ 69.750323] RIP: 0010:execlists_submission_tasklet+0x477/0xe50 [i915] <4>[ 69.750338] RSP: 0018:ffff88027fc83ea8 EFLAGS: 00010296 <4>[ 69.750351] RAX: 0000000000000031 RBX: 0000000000000014 RCX: 0000000000000102 <4>[ 69.750366] RDX: 0000000080000102 RSI: ffffffff8211c1af RDI: 00000000ffffffff <4>[ 69.750381] RBP: ffff88027fc83f20 R08: 0000000000000001 R09: 0000000000000000 <4>[ 69.750395] R10: 0000000000000005 R11: 0000000000000000 R12: ffff880261d49040 <4>[ 69.750410] R13: ffff880270fea160 R14: ffff880261d49064 R15: ffff880270fea158 <4>[ 69.750425] FS: 0000000000000000(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000 <4>[ 69.750442] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 69.750454] CR2: 0000558cd3fc6818 CR3: 0000000005210000 CR4: 00000000003406e0 <4>[ 69.750469] Call Trace: <4>[ 69.750478] <IRQ> <4>[ 69.750493] tasklet_hi_action+0x89/0x110 <4>[ 69.750509] __do_softirq+0xc1/0x4aa <4>[ 69.750524] irq_exit+0xa4/0xb0 <4>[ 69.750533] do_IRQ+0x67/0x120 <4>[ 69.750546] common_interrupt+0x84/0x84 <4>[ 69.750556] </IRQ> <4>[ 69.750565] RIP: 0010:cpuidle_enter_state+0xaa/0x350 <4>[ 69.750576] RSP: 0018:ffffc90000087eb8 EFLAGS: 00000216 ORIG_RAX: ffffffffffffffda <4>[ 69.750594] RAX: ffff880276008040 RBX: 00000000000c9a06 RCX: 0000000000000001 <4>[ 69.750608] RDX: 0000000000000000 RSI: ffffffff8210fb59 RDI: ffffffff820c02e7 <4>[ 69.750623] RBP: 0000000000000003 R08: 00000000000004d0 R09: 0000000000000002 <4>[ 69.750637] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003 <4>[ 69.750652] R13: ffff880275a489f8 R14: 000000103cd532be R15: ffffffff82297ba0 <4>[ 69.750674] ? cpuidle_enter_state+0xa6/0x350 <4>[ 69.750689] do_idle+0x188/0x1d0 <4>[ 69.750700] cpu_startup_entry+0x14/0x20 <4>[ 69.750712] start_secondary+0x129/0x160 <4>[ 69.750725] secondary_startup_64+0xa5/0xb0 <4>[ 69.750741] Code: fd ff ff 48 c7 c6 90 9d 27 a0 48 c7 c7 66 e1 25 a0 e8 73 a7 f5 e0 0f 0b 48 c7 c6 b8 9d 27 a0 48 c7 c7 66 e1 25 a0 e8 5e a7 f5 e0 <0f> 0b 49 8d 9e c0 14 00 00 4c 89 5d b0 48 89 df e8 c4 26 74 e1 <1>[ 69.750913] RIP: execlists_submission_tasklet+0x477/0xe50 [i915] RSP: ffff88027fc83ea8 <4>[ 69.750976] ---[ end trace 964f2a4712d380ea ]--- <0>[ 70.053877] Kernel panic - not syncing: Fatal exception in interrupt <0>[ 70.053915] Dumping ftrace buffer: <0>[ 70.053928] (ftrace buffer empty) Here is another one with working pstore: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3816/shard-apl6/igt@gem_eio@in-flight.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3826/shard-glkb6/igt@gem_eio@in-flight.html This is all in pstore. But it is very similar to Comment #1 <4>[ 414.627527] RIP: 0010:_raw_spin_unlock_irqrestore+0x4e/0x60 <4>[ 414.627538] RSP: 0018:ffffc90000bbfac8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc <4>[ 414.627553] RAX: ffff88014446a740 RBX: 0000000000000246 RCX: 0000000000000006 <4>[ 414.627566] RDX: 000000000000136f RSI: ffffffff8210fb89 RDI: ffffffff820c0317 <4>[ 414.627580] RBP: ffff8801749e9b40 R08: ffff88014446b040 R09: 00000000d7f092b6 <4>[ 414.627593] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa025cf4a <4>[ 414.627606] R13: ffff880176e6a298 R14: ffff88016e0a9c08 R15: ffff880176e6a298 <4>[ 414.627626] __irq_put_desc_unlock+0x13/0x40 <4>[ 414.627637] enable_irq+0x42/0x60 <4>[ 414.627680] i915_reset+0xcd/0x270 [i915] <4>[ 414.627724] i915_reset_device+0x1cb/0x230 [i915] <4>[ 414.627771] ? __intel_get_crtc_scanline+0x1f0/0x1f0 [i915] <4>[ 414.627787] ? work_on_cpu_safe+0x50/0x50 <4>[ 414.627833] i915_handle_error+0x2d3/0x430 [i915] <4>[ 414.627851] ? scnprintf+0x3a/0x70 <4>[ 414.627899] hangcheck_declare_hang+0xc0/0xd0 [i915] <4>[ 414.627949] ? fwtable_read32+0x21e/0x290 [i915] <4>[ 414.627998] i915_hangcheck_elapsed+0x232/0x310 [i915] <4>[ 414.628014] process_one_work+0x215/0x620 <4>[ 414.628028] worker_thread+0x48/0x3a0 <4>[ 414.628040] kthread+0xfb/0x130 <4>[ 414.628048] ? process_one_work+0x620/0x620 <4>[ 414.628058] ? _kthread_create_on_node+0x30/0x30 <4>[ 414.628069] ret_from_fork+0x3a/0x50 <4>[ 414.628083] Code: fd ff ff 48 c7 c6 30 f1 27 a0 48 c7 c7 bb 34 26 a0 e8 18 53 f5 e0 0f 0b 48 c7 c6 58 f1 27 a0 48 c7 c7 bb 34 26 a0 e8 03 53 f5 e0 <0f> 0b 49 8d 9c 24 c0 14 00 00 4c 89 55 b0 48 89 df e8 68 d2 73 <1>[ 414.628225] RIP: execlists_submission_tasklet+0x4a2/0xe30 [i915] RSP: ffff88017fc83ea8 <4>[ 414.628273] ---[ end trace d28955b8aee33f2c ]--- https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4284/shard-kbl7/igt@gem_eio@in-flight-external.html Although pstore backtrace is partially overwritten by ftrace I believe this is the same issue. <4>[ 78.803859] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec x86_pkg_temp_thermal intel_powerclamp coretemp snd_hwdep crct10dif_pclmul snd_hda_core crc32_pclmul ghash_clmulni_intel snd_pcm e1000e mei_me mei prime_numbers <4>[ 78.803910] CPU: 0 PID: 3 Comm: kworker/0:0 Tainted: G U 4.16.0-rc2-CI-CI_DRM_3821+ #1 <4>[ 78.803922] Hardware name: Intel Corporation NUC7i5BNH/NUC7i5BNB, BIOS BNKBL357.86A.0060.2017.1214.2013 12/14/2017 <4>[ 78.803955] Workqueue: events_long i915_hangcheck_elapsed [i915] <4>[ 78.803989] RIP: 0010:execlists_submission_tasklet+0x4a2/0xe30 [i915] <4>[ 78.803999] RSP: 0018:ffff88027ec03ea8 EFLAGS: 00010296 <4>[ 78.804007] RAX: 0000000000000031 RBX: 0000000000000018 RCX: 0000000000000102 <4>[ 78.804017] RDX: 0000000080000102 RSI: ffffffff8211c1af RDI: 00000000ffffffff <4>[ 78.804026] RBP: ffff88027ec03f20 R08: ffff8802751b3140 R09: 00000000beeb13b4 <4>[ 78.804036] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88025d32205c <4>[ 78.804045] R13: ffff88025d322040 R14: ffff880269100010 R15: ffff880269100008 <4>[ 78.804055] FS: 0000000000000000(0000) GS:ffff88027ec00000(0000) knlGS:0000000000000000 <4>[ 78.804065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 78.804074] CR2: 000055b5d01b9dd0 CR3: 0000000005210005 CR4: 00000000003606f0 <4>[ 78.804083] Call Trace: <4>[ 78.804089] <IRQ> <4>[ 78.804098] tasklet_hi_action+0x89/0x110 <4>[ 78.804106] __do_softirq+0xc1/0x4aa <4>[ 78.804116] irq_exit+0xa4/0xb0 <4>[ 78.804122] do_IRQ+0x67/0x120 <4>[ 78.804129] common_interrupt+0x84/0x84 <4>[ 78.804135] </IRQ> <4>[ 78.804141] RIP: 0010:_raw_spin_unlock_irqrestore+0x4e/0x60 <4>[ 78.804149] RSP: 0018:ffffc90000037ac8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdb <4>[ 78.804160] RAX: ffff8802751b2840 RBX: 0000000000000246 RCX: 0000000000000006 <4>[ 78.804170] RDX: 000000000000128e RSI: ffffffff8210fb59 RDI: ffffffff820c02e7 <4>[ 78.804179] RBP: ffff8802744d35d0 R08: ffff8802751b3140 R09: 00000000beeb13b4 <4>[ 78.804189] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0287f2a <4>[ 78.804198] R13: ffff88026955a298 R14: ffff880268a28968 R15: ffff88026955a298 <4>[ 78.804214] __irq_put_desc_unlock+0x13/0x40 <4>[ 78.804222] enable_irq+0x42/0x60 <4>[ 78.804252] i915_reset+0xcd/0x270 [i915] <4>[ 78.804283] i915_reset_device+0x1cb/0x230 [i915] <4>[ 78.804313] ? __intel_get_crtc_scanline+0x1f0/0x1f0 [i915] <4>[ 78.804324] ? work_on_cpu_safe+0x50/0x50 <4>[ 78.804356] i915_handle_error+0x2d3/0x430 [i915] <4>[ 78.804369] ? scnprintf+0x3a/0x70 <4>[ 78.804402] hangcheck_declare_hang+0xc0/0xd0 [i915] <4>[ 78.804439] ? fwtable_read32+0x21e/0x290 [i915] <4>[ 78.804476] i915_hangcheck_elapsed+0x232/0x310 [i915] <4>[ 78.804488] process_one_work+0x215/0x620 <4>[ 78.804498] worker_thread+0x48/0x3a0 <4>[ 78.804506] kthread+0xfb/0x130 <4>[ 78.804513] ? process_one_work+0x620/0x620 <4>[ 78.804520] ? _kthread_create_on_node+0x30/0x30 <4>[ 78.804529] ret_from_fork+0x3a/0x50 <4>[ 78.804539] Code: fd ff ff 48 c7 c6 48 a1 2a a0 48 c7 c7 9b e4 28 a0 e8 08 a4 f2 e0 0f 0b 48 c7 c6 70 a1 2a a0 48 c7 c7 9b e4 28 a0 e8 f3 a3 f2 e0 <0f> 0b 49 8d 9c 24 c0 14 00 00 4c 89 55 b0 48 89 df e8 58 23 71 <1>[ 78.804642] RIP: execlists_submission_tasklet+0x4a2/0xe30 [i915] RSP: ffff88027ec03ea8 <4>[ 78.804677] ---[ end trace 7b2c642a76185f25 ]--- (In reply to Chris Wilson from comment #1) > It's the same problem as the bogus tail pointer in live_hangcheck, the > original bug in 104262 Chris would you prefer to re-open bug 104262? Also looking at history of the test https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_eio@in-flight-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_eio@in-flight.html it looks like something may have happened around ~CI_DRM_3815 Also, the last occurrences on KBL-shards indicates a hang check rather than this thing. Speculation: is there a correlation between fix for bug 104676 and increase in APL, KBL incompletes on the same test? My working theory for the incompletes as a result of bug 104676 depend on a reset running concurrently to the tasklet. This bug is the HW reporting the wrong tail (or rather the tail not being updated before the interrupt is handled). Having said that, if the tasklet is running concurrently to a reset, everything is undefined. Found a related bug, commit 963ddd63c314e9b5d9cd999873d473a93aed5380 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 2 11:33:24 2018 +0000 drm/i915: Suspend submission tasklets around wedging where the port->context_id was being reset with tasklet in flight on wedging. Which then uncovered the problem in bug 105341 (previously seen in the kasan runs). I _think_ this GEM_BUG_ON(CSB[] != port->context_id) is resolved. Just the can of worms got larger. Fix integrate to CI_DRM_3868 I will monitor for some runs and close if OK It seem all hits are now on bug 105341. I will close and archive this |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.