107424 – BSW - igt@drv_selftest@live_* - dmesg-warn - WARN_ON(intel_gpu_reset(i915, (~0)))

Bug 107424 - BSW - igt@drv_selftest@live_* - dmesg-warn - WARN_ON(intel_gpu_reset(i915, (~0)))

Summary: BSW - igt@drv_selftest@live_* - dmesg-warn - WARN_ON(intel_gpu_reset(i915, (~...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Duplicates (1):	107497 (view as bug list)
Depends on:
Blocks:

Reported:	2018-07-30 08:52 UTC by Tomi Sarvela
Modified:	2018-10-19 12:55 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	BSW/CHT
i915 features:	power/Other

Attachments

Description Tomi Sarvela 2018-07-30 08:52:02 UTC

This happened with CI_DRM_4577 build with BSW drv_selftests, since fixed or disappeared.

[  707.360533] Setting dangerous option live_selftests - tainting kernel
[  707.433259] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout
[  707.433330] ------------[ cut here ]------------
[  707.433336] WARN_ON(intel_gpu_reset(i915, (~0)))
[  707.433469] WARNING: CPU: 1 PID: 8575 at drivers/gpu/drm/i915/i915_gem.c:5038 i915_gem_sanitize+0xbb/0xc0 [i915]
[  707.433475] Modules linked in: i915(+) vgem btusb btrtl btbcm btintel bluetooth ecdh_generic coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel r8169 mii lpc_ich snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec snd_hwdep snd_hda_core pinctrl_cherryview snd_pcm prime_numbers [last unloaded: i915]
[  707.433593] CPU: 1 PID: 8575 Comm: drv_selftest Tainted: G     U  W         4.18.0-rc6-CI-CI_DRM_4577+ #1
[  707.433598] Hardware name:  /NUC5CPYB, BIOS PYBSWCEL.86A.0058.2016.1102.1842 11/02/2016
[  707.433696] RIP: 0010:i915_gem_sanitize+0xbb/0xc0 [i915]
[  707.433701] Code: c0 75 14 48 89 df e8 64 4f 02 00 eb b2 48 89 df e8 ea 7e ff ff eb 9f 48 c7 c6 38 c1 23 a0 48 c7 c7 cc 3b 22 a0 e8 25 55 f7 e0 <0f> 0b eb 91 90 55 53 48 89 fd 48 c7 c7 b0 f6 10 a0 48 8d 5d 68 48 
[  707.433969] RSP: 0018:ffffc90000247b18 EFLAGS: 00010286
[  707.433978] RAX: 0000000000000000 RBX: ffff8801288f0000 RCX: 0000000000000001
[  707.433983] RDX: 0000000080000001 RSI: ffffffff820c6e04 RDI: 00000000ffffffff
[  707.433988] RBP: ffff8801288f0068 R08: 000000007933894f R09: 0000000000000000
[  707.433993] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  707.433998] R13: 0000000000000000 R14: ffff8801288f0d58 R15: 0000000000000048
[  707.434003] FS:  00007ff08e460980(0000) GS:ffff88017fd00000(0000) knlGS:0000000000000000
[  707.434008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  707.434013] CR2: 00007f8c23034078 CR3: 0000000177730000 CR4: 00000000001006e0
[  707.434017] Call Trace:
[  707.434116]  i915_driver_load+0x714/0x10a0 [i915]
[  707.434129]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[  707.434139]  ? trace_hardirqs_on_caller+0xe0/0x1b0
[  707.434238]  i915_pci_probe+0x29/0xa0 [i915]
[  707.434249]  pci_device_probe+0xa1/0x130
[  707.434262]  driver_probe_device+0x306/0x480
[  707.434273]  __driver_attach+0xdb/0x100
[  707.434279]  ? driver_probe_device+0x480/0x480
[  707.434286]  ? driver_probe_device+0x480/0x480
[  707.434294]  bus_for_each_dev+0x74/0xc0
[  707.434307]  bus_add_driver+0x15f/0x250
[  707.434315]  ? 0xffffffffa0701000
[  707.434322]  driver_register+0x56/0xe0
[  707.434330]  ? 0xffffffffa0701000
[  707.434336]  do_one_initcall+0x58/0x370
[  707.434347]  ? do_init_module+0x1d/0x1ea
[  707.434354]  ? rcu_read_lock_sched_held+0x6f/0x80
[  707.434360]  ? kmem_cache_alloc_trace+0x282/0x2e0
[  707.434374]  do_init_module+0x56/0x1ea
[  707.434383]  load_module+0x2435/0x2b20
[  707.434427]  ? __se_sys_finit_module+0xd3/0xf0
[  707.434434]  __se_sys_finit_module+0xd3/0xf0
[  707.434460]  do_syscall_64+0x55/0x190
[  707.434469]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  707.434475] RIP: 0033:0x7ff08dd34839
[  707.434479] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48 
[  707.434747] RSP: 002b:00007ffe145da768 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  707.434757] RAX: ffffffffffffffda RBX: 0000555798d80a70 RCX: 00007ff08dd34839
[  707.434762] RDX: 0000000000000000 RSI: 0000555798d817c0 RDI: 0000000000000004
[  707.434766] RBP: 0000555798d817c0 R08: 0000000000000004 R09: 0000000000000000
[  707.434771] R10: 00007ffe145da8e0 R11: 0000000000000246 R12: 0000000000000000
[  707.434776] R13: 0000555798d79ea0 R14: 0000000000000020 R15: 0000000000000039
[  707.434801] irq event stamp: 232520
[  707.434808] hardirqs last  enabled at (232519): [<ffffffff810f8b7c>] console_unlock+0x3fc/0x600
[  707.434815] hardirqs last disabled at (232520): [<ffffffff81a0111c>] error_entry+0x7c/0x100
[  707.434821] softirqs last  enabled at (232492): [<ffffffff81c0034f>] __do_softirq+0x34f/0x505
[  707.434828] softirqs last disabled at (232485): [<ffffffff8108c825>] irq_exit+0xa5/0xc0
[  707.434926] WARNING: CPU: 1 PID: 8575 at drivers/gpu/drm/i915/i915_gem.c:5038 i915_gem_sanitize+0xbb/0xc0 [i915]
[  707.434931] ---[ end trace 5b1357ee092a3605 ]---
[  707.701804] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout
[  707.719394] [drm:gen8_reset_engines [i915]] *ERROR* bcs0: reset request timeout
[  707.719920] ------------[ cut here ]------------

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4577/fi-bsw-n3050/igt@drv_selftest@live_coherency.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4577/fi-bsw-n3050/igt@drv_selftest@live_evict.html

Comment 1 Chris Wilson 2018-07-30 09:43:09 UTC

It's what happens if that "reset request timeout" problem strikes and then we reload the module. As the gpu is toast and doesn't recover, we warn. It's not that easy to rearrange the code to avoid the warn unfortunately, so I think we'll keep the risk of WARN around until we need to rearrange the init routines.

Comment 2 Simon Lee 2018-08-04 15:31:01 UTC

Hi Tomi,

With the information provided by Chris, are you ok with keeping the WARN appearing?

Comment 3 Tomi Sarvela 2018-08-06 07:37:00 UTC

If the WARN is considered as useful information, then by all means, keep it there and mark this issue as WONTFIX.

CIbuglog can filter out this particular error if it happens only on limited set of tests.

Comment 4 Chris Wilson 2018-08-06 08:08:51 UTC

The WARN is overkill, we just need an error to note that we're weding the driver to keep the system alive. Ultimately, we should never hit it in the first place. The underlying bug we have in a couple of places, e.g. bug 106683.

Comment 5 Chris Wilson 2018-08-06 14:10:45 UTC

*** Bug 107497 has been marked as a duplicate of this bug. ***

Comment 6 Lakshmi 2018-10-19 12:55:43 UTC

This issue occurred only one in round CI_DRM_4577 (2 months, 3 weeks / 1094 runs ago) and caused few tests to fail. Since then this issue was not seen. I assume this issue is fixed.
Closing this bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.