Bug 105428 - Kernel 4.14.20, i915: pipe B vblank wait timed out
Summary: Kernel 4.14.20, i915: pipe B vblank wait timed out
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-03-10 17:38 UTC by Peter Klotz
Modified: 2018-06-04 11:06 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features: display/Other


Attachments
journalctl output (with drm.debug=0xe parameter set) showing the error (89.47 KB, application/gzip)
2018-04-24 18:07 UTC, Peter Klotz
no flags Details
Journal (3.79 MB, text/plain)
2018-05-14 06:36 UTC, Jani Saarinen
no flags Details

Description Peter Klotz 2018-03-10 17:38:35 UTC
Errors of this kind occur sporadically when using Arch Linux LTS kernels >= 4.14.20:

[77056.280048] pipe B vblank wait timed out
[77056.280074] ------------[ cut here ]------------
[77056.280142] WARNING: CPU: 1 PID: 55132 at drivers/gpu/drm/i915/intel_display.c:12202 intel_atomic_commit_tail+0xed8/0xee0 [i915]
[77056.280143] Modules linked in: tun iTCO_wdt iTCO_vendor_support coretemp kvm_intel i915 kvm arc4 irqbypass iwldvm mac80211 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core psmouse videodev pcspkr media btusb btrtl iwlwifi btbcm btintel bluetooth joydev mousedev evdev input_leds mac_hid ecdh_generic snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core i2c_algo_bit cfg80211 snd_hwdep r592 drm_kms_helper snd_pcm memstick lpc_ich drm snd_timer snd r8169 thermal mii soundcore syscopyarea intel_agp sysfillrect sysimgblt shpchp fb_sys_fops intel_gtt agpgart asus_laptop acpi_cpufreq sparse_keymap ac rfkill battery input_polldev video button sch_fq_codel vboxnetadp(O) pci_stub vboxpci(O) fuse vboxnetflt(O) vboxdrv(O) loop sg crypto_user
[77056.280200]  ip_tables x_tables ext4 crc16 mbcache jbd2 fscrypto dm_crypt cbc algif_skcipher af_alg hid_cherry hid_generic usbhid hid dm_mod sr_mod cdrom sd_mod serio_raw atkbd libps2 uhci_hcd ahci libahci libata scsi_mod sdhci_pci sdhci led_class firewire_ohci mmc_core firewire_core crc_itu_t i8042 serio ehci_pci ehci_hcd usbcore usb_common
[77056.280231] CPU: 1 PID: 55132 Comm: kworker/u4:3 Tainted: G        W  O    4.14.24-1-lts #1
[77056.280232] Hardware name: ASUSTeK Computer Inc.         B50A                /B50A      , BIOS 212     10/09/2009
[77056.280263] Workqueue: events_unbound intel_atomic_commit_work [i915]
[77056.280265] task: ffff9ebad92cea00 task.stack: ffffb01804904000
[77056.280294] RIP: 0010:intel_atomic_commit_tail+0xed8/0xee0 [i915]
[77056.280296] RSP: 0018:ffffb01804907de0 EFLAGS: 00010282
[77056.280297] RAX: 000000000000001c RBX: 0000000000000001 RCX: 0000000000000000
[77056.280299] RDX: 0000000000000000 RSI: ffff9ebbffd16598 RDI: ffff9ebbffd16598
[77056.280300] RBP: 00000000001f4b0a R08: 000000000001d2aa R09: 0000000000000004
[77056.280301] R10: ffffb01804907de0 R11: 0000000000000001 R12: ffff9ebbfcf00000
[77056.280302] R13: ffff9ebbfbe38800 R14: 0000000000000002 R15: 0000000000000000
[77056.280304] FS:  0000000000000000(0000) GS:ffff9ebbffd00000(0000) knlGS:0000000000000000
[77056.280305] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[77056.280307] CR2: 00007fe8722a6000 CR3: 000000023b00a000 CR4: 00000000000006e0
[77056.280308] Call Trace:
[77056.280317]  ? pick_next_task_fair+0x2ce/0x560
[77056.280320]  ? wait_woken+0x80/0x80
[77056.280324]  process_one_work+0x1da/0x3d0
[77056.280327]  worker_thread+0x2b/0x3f0
[77056.280329]  ? process_one_work+0x3d0/0x3d0
[77056.280332]  kthread+0x11a/0x130
[77056.280334]  ? kthread_create_on_node+0x70/0x70
[77056.280337]  ? SyS_exit+0x13/0x20
[77056.280341]  ret_from_fork+0x35/0x40
[77056.280343] Code: 24 88 01 00 00 48 8d 74 24 30 48 83 c7 08 e8 90 1f 4d c8 4d 85 ff 0f 85 da fa ff ff 8d 73 41 48 c7 c7 00 82 c8 c0 e8 c3 ca 4e c8 <0f> 0b e9 c4 fa ff ff 90 66 66 66 66 90 41 57 41 56 41 55 41 54
[77056.280378] ---[ end trace 9bf5213050c0390c ]---


00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. Mobile 4 Series Chipset Integrated Graphics Controller
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at fd000000 (64-bit, non-prefetchable) [size=4M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at bc00 [size=8]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
        Capabilities: [d0] Power Management version 3
        Kernel driver in use: i915
        Kernel modules: i915


Once this problem has happened the machine needs a reboot because it is in some sort of "stutter" mode. All graphics operations are delayed. The problem seems to occur more likely if the system is under load like doing some I/O operation and watching a video in parallel. I currently have no scenario to reliably reproduce it.

What I can definitely say is that kernels 4.14.16 up to 4.14.19 did not show this error. Kernel 4.14.19 ran (according to my logs) for about a week without problems. This should have been enough time for the problem to show up.
Comment 1 Elizabeth 2018-03-12 16:11:06 UTC
Hello, could you please attach full dmesg with the error and preferable with drm.debug=0xe parameter on grub? Is it possible that you bisect this issue? Thank you.
Comment 2 Peter Klotz 2018-03-20 21:15:31 UTC
Sorry, but no luck so far.

It seems the drm.debug kernel parameter reduces the likelihood of the error significantly.
Comment 3 Jani Saarinen 2018-03-29 07:10:55 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 4 Jani Saarinen 2018-04-20 15:57:29 UTC
Closing, please re-open if still occurs.
Comment 5 Peter Klotz 2018-04-24 05:11:23 UTC
The problem just occurred again in 4.14.34, so it is definitely still there.

What about a different approach finding the issue?
It must be some (graphics related) change that was made in 4.14.20 since 4.14.19 worked fine.
Comment 6 Jani Saarinen 2018-04-24 06:35:36 UTC
Can you try with latest drm-tip: https://cgit.freedesktop.org/drm-tip?
Comment 7 Peter Klotz 2018-04-24 18:07:25 UTC
Created attachment 139068 [details]
journalctl output (with drm.debug=0xe parameter set) showing the error

The problem occurred again. This time with kernel 4.14.35 and parameter drm.debug=0xe set.
The journalctl output contains all kernel messages starting from system boot.

The error occurs at "18:59:07" for the first time and is then repeated a few hundred times.

Hope this helps, Peter.
Comment 8 Peter Klotz 2018-05-11 19:53:43 UTC
The problem still occurs in LTS kernel 4.14.39. Will now give 4.16.x a try but my hopes aren't too high.

If the problem persists I am facing 3 options:

 * Stay at kernel 4.14.19 forever
 * Buy new hardware
 * Switch the operating system

Was the output I pasted any good?
Comment 9 Jani Saarinen 2018-05-11 20:15:56 UTC
HI, 
Can you try using https://cgit.freedesktop.org/drm-tip and send dmesg with drm.debug=0x1e log_buf_len=4M? and do not grep any logs but attact from start to error.
Comment 10 Peter Klotz 2018-05-11 20:43:07 UTC
Hi Jani

You realize, how hard it is to reproduce this issue, especially with debugging turned on?

Didn't the first output give you any hints were the problem might be located?

What do you mean by "not grep any logs". The first output should contain all the information from start to the point where the problem occurred. Is there a relevant difference between dmesg and journald messages?

Btw: The first output is approx. 4MB and the error happened 47 minutes after the reboot. Normally it takes longer (a few hours), so a 4MB buffer size will not be able to hold all messages from boot.

Regards, Peter.
Comment 11 Jani Saarinen 2018-05-14 06:36:53 UTC
Created attachment 139552 [details]
Journal

Pasting journal as plain text.
Comment 12 Jani Nikula 2018-05-17 10:27:34 UTC
Please try v4.16, v4.17-rc5, or drm-tip.
Comment 13 Peter Klotz 2018-05-17 21:04:40 UTC
Hi Jani

It seems that in 4.16.8 the original "pipe B vblank wait timed out" error is no longer reproducible. I am not 100% sure but I tried the scenarios that would previously sometimes trigger the bug.

However I got this (never seen before) warning in the dmesg output:

[434226.016578] ------------[ cut here ]------------
[434226.016603] WARN_ON(i915_gem_object_has_pinned_pages(obj))
[434226.016681] WARNING: CPU: 0 PID: 16958 at drivers/gpu/drm/i915/i915_gem.c:4668 __i915_gem_free_objects+0x2ad/0x2c0 [i915]
[434226.016682] Modules linked in: tun iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass i915 arc4 psmouse pcspkr uvcvideo iwldvm videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 mac80211 btusb btrtl btbcm btintel videobuf2_common joydev bluetooth mousedev videodev input_leds iwlwifi i2c_algo_bit snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic media ecdh_generic drm_kms_helper r852 snd_hda_intel cfg80211 sm_common nand snd_hda_codec nand_ecc lpc_ich nand_bch r592 bch mtd memstick snd_hda_core r8169 mii snd_hwdep drm shpchp snd_pcm snd_timer snd rtc_cmos intel_agp asus_laptop intel_gtt syscopyarea sparse_keymap sysfillrect rfkill sysimgblt agpgart soundcore battery fb_sys_fops ac evdev input_polldev mac_hid acpi_cpufreq vboxnetadp(O) vboxpci(O) fuse vboxnetflt(O) vboxdrv(O)
[434226.016739]  loop sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_crypt cbc algif_skcipher af_alg hid_cherry hid_generic usbhid hid dm_mod sr_mod cdrom sd_mod serio_raw atkbd libps2 ahci libahci sdhci_pci uhci_hcd cqhci sdhci libata led_class firewire_ohci scsi_mod mmc_core firewire_core crc_itu_t ehci_pci ehci_hcd usbcore usb_common i8042 serio
[434226.016776] CPU: 0 PID: 16958 Comm: kworker/0:1 Tainted: G           O     4.16.8-1-ARCH #1
[434226.016777] Hardware name: ASUSTeK Computer Inc.         B50A                /B50A      , BIOS 212     10/09/2009
[434226.016805] Workqueue: events __i915_gem_free_work [i915]
[434226.016832] RIP: 0010:__i915_gem_free_objects+0x2ad/0x2c0 [i915]
[434226.016834] RSP: 0018:ffffa81e8405fe30 EFLAGS: 00010286
[434226.016836] RAX: 0000000000000000 RBX: ffff913eeb341e00 RCX: 0000000000000001
[434226.016837] RDX: 0000000080000001 RSI: 0000000000000092 RDI: 00000000ffffffff
[434226.016838] RBP: ffff913eeb341ef8 R08: ffffffffffcd35ac R09: 0000000000000593
[434226.016839] R10: ffffffffbf5dc720 R11: 0000000000000000 R12: ffff913fb5520000
[434226.016841] R13: ffff913e82374f00 R14: ffff913fb5524098 R15: ffff913fb5520070
[434226.016842] FS:  0000000000000000(0000) GS:ffff913fbfc00000(0000) knlGS:0000000000000000
[434226.016844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[434226.016845] CR2: 00003676fc9c6000 CR3: 0000000005f02000 CR4: 00000000000006f0
[434226.016846] Call Trace:
[434226.016878]  __i915_gem_free_work+0x62/0x90 [i915]
[434226.016884]  process_one_work+0x1d1/0x3b0
[434226.016887]  worker_thread+0x2b/0x3d0
[434226.016889]  ? process_one_work+0x3b0/0x3b0
[434226.016891]  kthread+0x112/0x130
[434226.016893]  ? kthread_create_on_node+0x60/0x60
[434226.016896]  ? do_syscall_64+0x74/0x190
[434226.016899]  ? SyS_exit+0x13/0x20
[434226.016901]  ret_from_fork+0x35/0x40
[434226.016904] Code: e7 65 ff 0d e6 29 16 3f 0f 85 d3 fd ff ff e8 a0 fc 14 fd e9 c9 fd ff ff 48 c7 c6 d0 af f9 c0 48 c7 c7 09 bd f8 c0 e8 7d 84 1c fd <0f> 0b c7 83 c8 01 00 00 00 00 00 00 e9 7b fe ff ff 66 90 66 66 
[434226.016942] ---[ end trace 41cebd42bf16ce99 ]---

Looks like Bug #105360.

Is there a possibility to backport the fix for the original issue to LTS kernel 4.14?

Regards, Peter
Comment 14 Jani Nikula 2018-05-22 15:47:29 UTC
(In reply to Peter Klotz from comment #13)
> Is there a possibility to backport the fix for the original issue to LTS
> kernel 4.14?

Perhaps... if we figure out what the fix was. The changes between v4.14.19 and v4.14.20 are

$ git log --oneline v4.14.19..v4.14.20 -- drivers/gpu/drm
a51421b4cb09 drm/i915: Avoid PPS HW/SW state mismatch due to rounding
050b86b5bf20 drm/i915: Fix deadlock in i830_disable_pipe()
50018d09843c drm/i915: Redo plane sanitation during readout
19d8e5122fef drm/i915: Add .get_hw_state() method for planes

which don't strike me as likely culprits. As you say, merely adding more debug outputs makes the problem less likely to occur. It could be something unrelated to the driver.

I think the best bet would be to do a reverse git bisect to find the commit that fixed this. Or bisect which commit introduced the problem in stable.

Anyway, seeing that the bug is fixed upstream, I'm closing the bug. Sorry. I suggest updating to a newer kernel.
Comment 15 Jani Saarinen 2018-06-04 11:06:01 UTC
Closing


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.