[ 349.327569] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
[ 355.104919] usb usb3: root hub lost power or was reset
[ 355.104920] usb usb1: root hub lost power or was reset
[ 355.104921] usb usb4: root hub lost power or was reset
[ 355.104965] usb usb2: root hub lost power or was reset
[ 355.108210] sd 4:0:0:0: [sda] Starting disk
[ 355.121935] ------------[ cut here ]------------
[ 355.121938] vblank wait timed out on crtc 0
[ 355.121953] WARNING: CPU: 0 PID: 2657 at drivers/gpu/drm/drm_vblank.c:1084 drm_wait_one_vblank+0x174/0x180
[ 355.121955] Modules linked in: vgem snd_hda_codec_hdmi i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mei_me mii mei prime_numbers lpc_ich
[ 355.121996] CPU: 0 PID: 2657 Comm: kworker/u16:63 Tainted: G U W 4.17.0-rc1-CI-CI_DRM_4055+ #1
[ 355.121998] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
[ 355.122000] Workqueue: events_unbound async_run_entry_fn
[ 355.122003] RIP: 0010:drm_wait_one_vblank+0x174/0x180
[ 355.122004] RSP: 0018:ffffc90001027a28 EFLAGS: 00010286
[ 355.122006] RAX: 0000000000000000 RBX: ffff8803f6f80000 RCX: 0000000000000001
[ 355.122007] RDX: 0000000080000001 RSI: ffffffff821209d1 RDI: 00000000ffffffff
[ 355.122009] RBP: 0000000000000000 R08: ffff8803f9c83098 R09: 000000001a3175a3
[ 355.122010] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 355.122011] R13: 0000000000003339 R14: ffff8803f71a12b0 R15: ffff8803f7d59bf8
[ 355.122012] FS: 0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
[ 355.122013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 355.122015] CR2: 00007f8f3ee67458 CR3: 0000000005210002 CR4: 00000000001606f0
[ 355.122016] Call Trace:
[ 355.122019] ? wait_woken+0x90/0x90
[ 355.122051] hsw_enable_crt+0x38/0x80 [i915]
[ 355.122086] intel_encoders_enable.isra.61+0x5e/0x70 [i915]
[ 355.122114] haswell_crtc_enable+0x33b/0x7c0 [i915]
[ 355.122139] intel_update_crtc+0x3f/0x80 [i915]
[ 355.122162] intel_update_crtcs+0x52/0x70 [i915]
[ 355.122191] intel_atomic_commit_tail+0x339/0xd40 [i915]
[ 355.122230] intel_atomic_commit+0x24f/0x330 [i915]
[ 355.122234] ? pci_pm_suspend_noirq+0x1e0/0x1e0
[ 355.122256] __intel_display_resume+0x76/0xc0 [i915]
[ 355.122278] intel_display_resume+0xd3/0x100 [i915]
[ 355.122298] i915_pm_restore+0xc4/0x130 [i915]
[ 355.122301] dpm_run_callback+0x5f/0x310
[ 355.122304] device_resume+0xb7/0x1e0
[ 355.122307] ? dpm_watchdog_set+0x60/0x60
[ 355.122311] async_resume+0x14/0x40
[ 355.122313] async_run_entry_fn+0x2e/0x160
[ 355.122316] process_one_work+0x21a/0x640
[ 355.122321] worker_thread+0x48/0x3a0
[ 355.122326] kthread+0xfb/0x130
[ 355.122327] ? process_one_work+0x640/0x640
[ 355.122329] ? _kthread_create_on_node+0x60/0x60
[ 355.122333] ret_from_fork+0x3a/0x50
[ 355.122340] Code: 2d ff ff ff e8 0e 31 a7 ff 48 89 e6 4c 89 f7 e8 e3 43 ac ff 45 85 e4 0f 85 0a ff ff ff 89 ee 48 c7 c7 b0 88 0f 82 e8 8c 2e a7 ff <0f> 0b e9 f5 fe ff ff 0f 1f 44 00 00 8b b7 f8 00 00 00 48 8b 3f
[ 355.122399] WARNING: CPU: 0 PID: 2657 at drivers/gpu/drm/drm_vblank.c:1084 drm_wait_one_vblank+0x174/0x180
[ 355.122401] irq event stamp: 5084
[ 355.122403] hardirqs last enabled at (5083): [<ffffffff810f826d>] vprintk_emit+0x4cd/0x4f0
[ 355.122405] hardirqs last disabled at (5084): [<ffffffff81a0111c>] error_entry+0x7c/0x100
[ 355.122406] softirqs last enabled at (4294): [<ffffffff81c003a1>] __do_softirq+0x3a1/0x4aa
[ 355.122409] softirqs last disabled at (4287): [<ffffffff8108b0f4>] irq_exit+0xa4/0xb0
[ 355.122410] ---[ end trace 2406a7e8191d8b62 ]---
[ 355.589003] Setting dangerous option reset - tainting kernel
[ 355.589388] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
[ 355.603027] Setting dangerous option reset - tainting kernel
[ 355.603548] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
Increasing the priority as it is likely a regression from Linux 4.17-rc1.
Also seen on suspend tests: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8764/shard-hsw2/igt@firstname.lastname@example.org
Also seen on non-shard HSW: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_24/fi-hsw-4770/igt@email@example.com
Also seen on SNB: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4446/shard-snb5/igt@firstname.lastname@example.org
Also seen on ILK: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_25/fi-ilk-650/igt@email@example.com
I was able to reproduce this on HSW. Trying to bisect.
Petri, any further findings?
Bisect landed on commit 9eb31227cbccd3a37da0f42604f1ab5fc556bc53 with some certainty. The issue does not reproduce on its parents with 200-something runs.
The results are suspect though because that commit is a merge commit, and it merges changes in crypto.
I did a piecemeal merge and bisected that, and the latter bisect landed on a commit merging changes for crypto/ux500/hash/. Again, with quite a bit of certainty.
I applied a revert of the merged commit into the merge commit the original bisect landed on and ran some 20 rounds on the result. The issue does not reproduce.
Before you can celebrate finding the culprit, I'd like to point out that the commit in question changes code that isn't even built with my .config.
More test rounds to follow.
Reverting the revert changes nothing as expected, bisecting is suspect. I'm tabling this bisecting effort for now.
Looks drm_wait_one_vblank fails because it is being called from intel_display_suspend/resume, when vblank interrupts are not going to happen because system is in the suspended state. This causes a time out if were unlucky not to have any vblank events accumulated in the queue.
I guess the fix should be that we check if we are not in suspended state in intel_pre_plane_update, otherwise it makes no sense to wait for it.
I have done a small fix, so I just need to verify it now.
Is this already fixed?
(In reply to Jani Saarinen from comment #12)
> Is this already fixed?
I can't tell yet using the CI results, because the chances of it happening were slim. Maybe Stan has landed a fix?
Stan, can you comment here?
I have investigated this issue and figured out that drm_wait_one_vblank was always timing out when going to suspend or resume. My idea was that this happens because it sometimes waits for vblank when vblank interrupts aren't turned on yet. However after some source code investigation it turned out that it at least tries to turn on the interrupts before intel_display_suspend/resume is called.
This seems to be not reproducible even in CI at the moment, so my guess was that something has been fixed already or the chances of reproducing it got even more slim.
Based on the investigation, I assume this issue has been fixed. Last seen this issue 1 month 1 week ago (or) 24 rounds of drmtip execution. This issue used to appear regularly.
Closing this bug as resolved/fixed.