Bug 94927 - [drm] GPU HANG: ecode 9:0:0x85dffffb, in compiz [1755], reason: Ring hung, action: reset
Summary: [drm] GPU HANG: ecode 9:0:0x85dffffb, in compiz [1755], reason: Ring hung, ac...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 94464 95167 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-04-14 00:23 UTC by john.stultz
Modified: 2017-07-24 23:15 UTC (History)
4 users (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
full dmesg (69.33 KB, text/plain)
2016-04-14 00:24 UTC, john.stultz
no flags Details
/sys/class/drm/card0/error log (3.47 MB, text/plain)
2016-04-14 00:25 UTC, john.stultz
no flags Details
cat /sys/class/drm/card0/error (559.28 KB, text/plain)
2016-08-13 01:05 UTC, Sverd Johnsen
no flags Details

Description john.stultz 2016-04-14 00:23:57 UTC
Possible duplicate of #94464, but filing just in case.

After logging into my Ubuntu 15.10 system, I get some blocky grpahics glitches, then a long stall (where the following message is printed to dmesg), then it will sometimes recover for a few seconds then another stall, with more messages, over and over. Eventually the system will crash hard and I've not been able to capture that log.  

Warning in dmesg:

[   18.898605] [drm] stuck on render ring
[   18.899840] [drm] GPU HANG: ecode 9:0:0x85dffffb, in compiz [1755], reason: Ring hung, action: reset
[   18.899847] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   18.899851] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   18.899855] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   18.899858] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   18.899862] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   18.900022] ------------[ cut here ]------------
[   18.900117] WARNING: CPU: 1 PID: 928 at /build/linux-HVWSXI/linux-4.2.0/drivers/gpu/drm/drm_crtc.c:5357 drm_mode_page_flip_ioctl+0x25a/0x340 [drm]()
[   18.900126] Modules linked in: rfcomm xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 bridge stp llc ebtable_filter ebtables snd_hda_codec_hdmi bnep arc4 input_leds joydev snd_hda_codec_realtek snd_hda_codec_generic hid_generic 8250_dw snd_hda_intel iwlmvm mac80211 snd_hda_codec intel_rapl snd_hda_core x86_pkg_temp_thermal intel_powerclamp uvcvideo coretemp videobuf2_vmalloc videobuf2_memops snd_usb_audio videobuf2_core kvm_intel v4l2_common snd_usbmidi_lib snd_hwdep usbhid videodev media kvm crct10dif_pclmul crc32_pclmul snd_pcm snd_seq_midi iwlwifi snd_seq_midi_event aesni_intel snd_rawmidi aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_seq cfg80211 snd_seq_device idma64 snd_timer virt_dma snd ir_xmp_decoder intel_lpss_pci shpchp soundcore ir_lirc_codec
[   18.900353]  lirc_dev ir_mce_kbd_decoder ir_sharp_decoder btusb ir_sanyo_decoder btrtl ir_sony_decoder btbcm btintel ir_jvc_decoder ir_rc6_decoder ir_rc5_decoder bluetooth ir_nec_decoder mei_me mei rc_rc6_mce ite_cir rc_core ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 intel_lpss_acpi intel_lpss acpi_als kfifo_buf xt_hl acpi_pad mac_hid industrialio ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter ip_tables parport_pc x_tables ppdev lp parport autofs4 i915 e1000e i2c_algo_bit ptp drm_kms_helper pps_core sdhci_pci sdhci drm ahci libahci video
[   18.900543]  i2c_hid pinctrl_sunrisepoint hid pinctrl_intel
[   18.900565] CPU: 1 PID: 928 Comm: Xorg Not tainted 4.2.0-35-generic #40-Ubuntu
[   18.900574] Hardware name:                  /NUC6i5SYB, BIOS SYSKLi35.86A.0042.2016.0409.1246 04/09/2016
[   18.900582]  0000000000000286 00000000e19f05e8 ffff88008559bcb8 ffffffff817f1d7e
[   18.900597]  0000000000000000 ffffffffc00ab5a8 ffff88008559bcf8 ffffffff8107cb46
[   18.900610]  ffff88045c3ac000 ffff88008559bdc0 ffff88045bc09060 0000000000000000
[   18.900630] Call Trace:
[   18.900652]  [<ffffffff817f1d7e>] dump_stack+0x63/0x81
[   18.900670]  [<ffffffff8107cb46>] warn_slowpath_common+0x86/0xc0
[   18.900684]  [<ffffffff8107cc7a>] warn_slowpath_null+0x1a/0x20
[   18.900754]  [<ffffffffc008a0ea>] drm_mode_page_flip_ioctl+0x25a/0x340 [drm]
[   18.900775]  [<ffffffff810a77b0>] ? wake_up_q+0x70/0x70
[   18.900830]  [<ffffffffc0079505>] drm_ioctl+0x125/0x610 [drm]
[   18.900897]  [<ffffffffc0089e90>] ? drm_mode_gamma_get_ioctl+0xe0/0xe0 [drm]
[   18.900918]  [<ffffffff813d3ab9>] ? timerqueue_add+0x59/0xb0
[   18.900933]  [<ffffffff810e9c41>] ? enqueue_hrtimer+0x41/0x90
[   18.900946]  [<ffffffff810ea79d>] ? hrtimer_start_range_ns+0x1cd/0x3d0
[   18.900965]  [<ffffffff81213cd5>] do_vfs_ioctl+0x295/0x480
[   18.900980]  [<ffffffff816d031c>] ? __sys_recvmsg+0x8c/0xa0
[   18.900994]  [<ffffffff81213f39>] SyS_ioctl+0x79/0x90
[   18.901012]  [<ffffffff817f8c72>] entry_SYSCALL_64_fastpath+0x16/0x75
[   18.901021] ---[ end trace 8e70d1f7eba62526 ]---
[   18.902209] drm/i915: Resetting chip after gpu hang
[   20.874816] [drm] RC6 on
Comment 1 john.stultz 2016-04-14 00:24:21 UTC
Created attachment 122906 [details]
full dmesg
Comment 2 john.stultz 2016-04-14 00:25:15 UTC
Created attachment 122907 [details]
/sys/class/drm/card0/error log
Comment 3 john.stultz 2016-04-14 00:34:25 UTC
Oh, this is on my NUC6i5SYH with the latest (v42) BIOS. However it started happening with the v28 BIOS the system came with.

The odd part is that this only cropped up around the start of the month. Prior to that, the system had worked well for months.

I've tried jumping back to an older kernel, but that didn't help.

I don't see any userspace updates that would have effected the system between the time it was working fine and when it wasn't.

If I don't log in, and switch to the console VT I can use the system w/o issue. But since I need the system for graphical use, its sort of a brick now.
Comment 4 john.stultz 2016-04-14 00:38:13 UTC
Also, using the 15.10 live-cd, I've triggered the same problem. Which seems odd, as I don't recall seeing it when I originally installed the machine.

I've also run memcheck and didn't find any issues there, so it doesn't seem like the memory in the system has suddenly gone bad.

Using i915.enable_rc6=0 doesn't seem to solve the issue. Nor intel_pstate=disable.

Sometimes those allow the system to run for longer, w/o the graphical corruption, but when a hang occurs its a hard hang, and I can't recover.
Comment 5 john.stultz 2016-04-14 01:59:09 UTC
Tried the drm-nightly kernel here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2016-04-13-wily/

No real change. Still hangs quickly after logging in.
Comment 6 john.stultz 2016-04-22 21:31:42 UTC
Just tried booting the Ubuntu 16.04 live-usb installation media. After logging in I see the same blocky graphic glitches and long stalls.

The warning in dmesg looks the same as well there.
Comment 7 john.stultz 2016-04-22 22:00:46 UTC
I did try the latest nightly, as it looks like it has some recent fixes for skylake gpu hangs.

http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2016-04-22-wily/

But I still saw the blocky graphic noise after logging in, and then it hung. I rebooted and hopped over to the VT console after logging it and there it hit some "general protection fault: 0000 [#1] SMP" errors that seem to trigger from generic_permission().
Comment 8 Chris Wilson 2016-04-22 22:48:07 UTC
(In reply to john.stultz from comment #7)
> I did try the latest nightly, as it looks like it has some recent fixes for
> skylake gpu hangs.
> 
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2016-04-22-
> wily/
> 
> But I still saw the blocky graphic noise after logging in, and then it hung.
> I rebooted and hopped over to the VT console after logging it and there it
> hit some "general protection fault: 0000 [#1] SMP" errors that seem to
> trigger from generic_permission().

GPFs are a far more serious problem. Did anything make it to the syslog, or can you grab a photograph of whatever error remains on screen?
Comment 9 john.stultz 2016-04-22 22:52:56 UTC
Oh joy.

Hopefully this isn't the linux equivalent of the WHEA error issue these NUCS are seeing in Windows. :/

I'll try to capture a picture here shortly. I do appreciate the feedback!
Comment 10 john.stultz 2016-04-22 23:09:09 UTC
Hrm. So the GPFs didn't reproduce the next few boots.

I did see the same blocky graphics noise and, and this time I did see very similar GPU HANG messages as the original report (though w/o the backtrace now).

I dunno what else to do. I'm doing another round of memtest (it made it through successful runs previously after I started having this issue) just to make sure.
Comment 11 yann 2016-05-18 15:24:45 UTC
*** Bug 94464 has been marked as a duplicate of this bug. ***
Comment 12 yann 2016-05-18 15:26:34 UTC
*** Bug 95167 has been marked as a duplicate of this bug. ***
Comment 13 john.stultz 2016-06-08 23:01:22 UTC
Just FYI here, it seems this is caused by a hardware failure (at least in my case).

After reading about the similar suddenly appearing WHEA errors folks were seeing on Windows with the Skylake NUCs, I went through the RMA process, and the replaced NUC (with new bios and new -503 model number) does not show the problem with the exact same disk drive.
Comment 14 Abby 2016-06-09 09:45:19 UTC
FYI my hangs completely stopped after adding kernel parameter:

i915.enable_rc6=0

https://wiki.ubuntu.com/Kernel/KernelBootParameters
Comment 15 Abby 2016-06-09 09:48:01 UTC
(In reply to Abby from comment #14)
> FYI my hangs completely stopped after adding kernel parameter:
> 
> i915.enable_rc6=0
> 
> https://wiki.ubuntu.com/Kernel/KernelBootParameters

Some extra details; I'm running ubuntu 16.04 on a Dell XPS 13 9350, with a second screen connected via a dell usb-c port hub
Comment 16 Sverd Johnsen 2016-08-13 01:05:57 UTC
Created attachment 125760 [details]
cat /sys/class/drm/card0/error
Comment 17 Jari Tahvanainen 2017-04-10 14:18:33 UTC
Marking this resolved+fixed based on the comment 13 and comment 15.
Sverd - if your problem still exists on the latest kernels (preferable drm-tip from git://anongit.freedesktop.org/git/drm-tip), please create new bug (or reuse this if failure is exactly the same than on description) with attachments (see https://01.org/linuxgraphics/documentation/how-report-bugs) and description of the use case you are exercising.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.