106086 – [CI] igt@*(suspend|hibernate)* | igt@kms_cursor_crc@* - vblank wait timed out on crtc 0

Bug 106086 - [CI] igt@*(suspend|hibernate)* | igt@kms_cursor_crc@* - vblank wait timed out on crtc 0

Summary: [CI] igt@*(suspend|hibernate)* | igt@kms_cursor_crc@* - vblank wait timed out...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Stanislav Lisovskiy
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:	106319
	Show dependency tree / graph

Reported:	2018-04-16 18:14 UTC by Martin Peres
Modified:	2018-09-25 12:28 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	ALL
i915 features:	display/Other

Attachments

Description Martin Peres 2018-04-16 18:14:32 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4055/shard-hsw3/igt@gem_eio@hibernate.html

[  349.327569] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
[  355.104919] usb usb3: root hub lost power or was reset
[  355.104920] usb usb1: root hub lost power or was reset
[  355.104921] usb usb4: root hub lost power or was reset
[  355.104965] usb usb2: root hub lost power or was reset
[  355.108210] sd 4:0:0:0: [sda] Starting disk
[  355.121935] ------------[ cut here ]------------
[  355.121938] vblank wait timed out on crtc 0
[  355.121953] WARNING: CPU: 0 PID: 2657 at drivers/gpu/drm/drm_vblank.c:1084 drm_wait_one_vblank+0x174/0x180
[  355.121955] Modules linked in: vgem snd_hda_codec_hdmi i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mei_me mii mei prime_numbers lpc_ich
[  355.121996] CPU: 0 PID: 2657 Comm: kworker/u16:63 Tainted: G     U  W         4.17.0-rc1-CI-CI_DRM_4055+ #1
[  355.121998] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
[  355.122000] Workqueue: events_unbound async_run_entry_fn
[  355.122003] RIP: 0010:drm_wait_one_vblank+0x174/0x180
[  355.122004] RSP: 0018:ffffc90001027a28 EFLAGS: 00010286
[  355.122006] RAX: 0000000000000000 RBX: ffff8803f6f80000 RCX: 0000000000000001
[  355.122007] RDX: 0000000080000001 RSI: ffffffff821209d1 RDI: 00000000ffffffff
[  355.122009] RBP: 0000000000000000 R08: ffff8803f9c83098 R09: 000000001a3175a3
[  355.122010] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  355.122011] R13: 0000000000003339 R14: ffff8803f71a12b0 R15: ffff8803f7d59bf8
[  355.122012] FS:  0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
[  355.122013] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  355.122015] CR2: 00007f8f3ee67458 CR3: 0000000005210002 CR4: 00000000001606f0
[  355.122016] Call Trace:
[  355.122019]  ? wait_woken+0x90/0x90
[  355.122051]  hsw_enable_crt+0x38/0x80 [i915]
[  355.122086]  intel_encoders_enable.isra.61+0x5e/0x70 [i915]
[  355.122114]  haswell_crtc_enable+0x33b/0x7c0 [i915]
[  355.122139]  intel_update_crtc+0x3f/0x80 [i915]
[  355.122162]  intel_update_crtcs+0x52/0x70 [i915]
[  355.122191]  intel_atomic_commit_tail+0x339/0xd40 [i915]
[  355.122230]  intel_atomic_commit+0x24f/0x330 [i915]
[  355.122234]  ? pci_pm_suspend_noirq+0x1e0/0x1e0
[  355.122256]  __intel_display_resume+0x76/0xc0 [i915]
[  355.122278]  intel_display_resume+0xd3/0x100 [i915]
[  355.122298]  i915_pm_restore+0xc4/0x130 [i915]
[  355.122301]  dpm_run_callback+0x5f/0x310
[  355.122304]  device_resume+0xb7/0x1e0
[  355.122307]  ? dpm_watchdog_set+0x60/0x60
[  355.122311]  async_resume+0x14/0x40
[  355.122313]  async_run_entry_fn+0x2e/0x160
[  355.122316]  process_one_work+0x21a/0x640
[  355.122321]  worker_thread+0x48/0x3a0
[  355.122326]  kthread+0xfb/0x130
[  355.122327]  ? process_one_work+0x640/0x640
[  355.122329]  ? _kthread_create_on_node+0x60/0x60
[  355.122333]  ret_from_fork+0x3a/0x50
[  355.122340] Code: 2d ff ff ff e8 0e 31 a7 ff 48 89 e6 4c 89 f7 e8 e3 43 ac ff 45 85 e4 0f 85 0a ff ff ff 89 ee 48 c7 c7 b0 88 0f 82 e8 8c 2e a7 ff <0f> 0b e9 f5 fe ff ff 0f 1f 44 00 00 8b b7 f8 00 00 00 48 8b 3f 
[  355.122399] WARNING: CPU: 0 PID: 2657 at drivers/gpu/drm/drm_vblank.c:1084 drm_wait_one_vblank+0x174/0x180
[  355.122401] irq event stamp: 5084
[  355.122403] hardirqs last  enabled at (5083): [<ffffffff810f826d>] vprintk_emit+0x4cd/0x4f0
[  355.122405] hardirqs last disabled at (5084): [<ffffffff81a0111c>] error_entry+0x7c/0x100
[  355.122406] softirqs last  enabled at (4294): [<ffffffff81c003a1>] __do_softirq+0x3a1/0x4aa
[  355.122409] softirqs last disabled at (4287): [<ffffffff8108b0f4>] irq_exit+0xa4/0xb0
[  355.122410] ---[ end trace 2406a7e8191d8b62 ]---
[  355.589003] Setting dangerous option reset - tainting kernel
[  355.589388] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
[  355.603027] Setting dangerous option reset - tainting kernel
[  355.603548] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff

Comment 1 Martin Peres 2018-04-16 18:15:43 UTC

Increasing the priority as it is likely a regression from Linux 4.17-rc1.

Comment 2 Martin Peres 2018-04-24 12:20:34 UTC

Also seen on suspend tests: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8764/shard-hsw2/igt@gem_exec_suspend@basic-s3-devices.html

Comment 3 Martin Peres 2018-04-24 13:59:48 UTC

Also seen on non-shard HSW: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_24/fi-hsw-4770/igt@gem_eio@suspend.html

Comment 4 Martin Peres 2018-04-24 15:11:19 UTC

Also seen on SNB: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4446/shard-snb5/igt@gem_exec_suspend@basic-s3-devices.html

Comment 5 Martin Peres 2018-04-24 15:42:55 UTC

Also seen on ILK: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_25/fi-ilk-650/igt@gem_exec_suspend@basic-s3-devices.html

Comment 6 Martin Peres 2018-04-24 15:44:54 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_26/fi-elk-e7500/igt@kms_cursor_crc@cursor-128x128-onscreen.html

Comment 7 Petri Latvala 2018-05-14 10:35:00 UTC

I was able to reproduce this on HSW. Trying to bisect.

Comment 8 Francesco Balestrieri 2018-05-29 12:30:36 UTC

Petri, any further findings?

Comment 9 Petri Latvala 2018-06-05 13:37:48 UTC

Bisect landed on commit 9eb31227cbccd3a37da0f42604f1ab5fc556bc53 with some certainty. The issue does not reproduce on its parents with 200-something runs.

The results are suspect though because that commit is a merge commit, and it merges changes in crypto.

I did a piecemeal merge and bisected that, and the latter bisect landed on a commit merging changes for crypto/ux500/hash/. Again, with quite a bit of certainty.

I applied a revert of the merged commit into the merge commit the original bisect landed on and ran some 20 rounds on the result. The issue does not reproduce.

Before you can celebrate finding the culprit, I'd like to point out that the commit in question changes code that isn't even built with my .config.

More test rounds to follow.

Comment 10 Petri Latvala 2018-06-08 08:07:07 UTC

Reverting the revert changes nothing as expected, bisecting is suspect. I'm tabling this bisecting effort for now.

Lowering prio.

Comment 11 Stanislav Lisovskiy 2018-08-30 11:25:12 UTC

Looks drm_wait_one_vblank fails because it is being called from intel_display_suspend/resume, when vblank interrupts are not going to happen because system is in the suspended state. This causes a time out if were unlucky not to have any vblank events accumulated in the queue. 
I guess the fix should be that we check if we are not in suspended state in intel_pre_plane_update, otherwise it makes no sense to wait for it.

I have done a small fix, so I just need to verify it now.

Comment 12 Jani Saarinen 2018-09-18 11:15:52 UTC

Is this already fixed?

Comment 13 Martin Peres 2018-09-20 17:25:17 UTC

(In reply to Jani Saarinen from comment #12)
> Is this already fixed?

I can't tell yet using the CI results, because the chances of it happening were slim. Maybe Stan has landed a fix?

Comment 14 Jani Saarinen 2018-09-24 06:20:25 UTC

Stan, can you comment here?

Comment 15 Stanislav Lisovskiy 2018-09-24 07:19:29 UTC

I have investigated this issue and figured out that drm_wait_one_vblank was always timing out when going to suspend or resume. My idea was that this happens because it sometimes waits for vblank when vblank interrupts aren't turned on yet. However after some source code investigation it turned out that it at least tries to turn on the interrupts before intel_display_suspend/resume is called.

This seems to be not reproducible even in CI at the moment, so my guess was that something has been fixed already or the chances of reproducing it got even more slim.

Comment 16 Lakshmi 2018-09-25 12:27:52 UTC

Based on the investigation, I assume this issue has been fixed. Last seen this issue 1 month 1 week ago (or) 24 rounds of drmtip execution. This issue used to appear regularly.

Closing this bug as resolved/fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.