Bug 97869

Summary: [drm] GPU HANG: ecode 9:0:0x85dfbfff, in gnome-shell [10783], reason: Ring hung, action: reset
Product: DRI Reporter: brian.chapman1
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: medium CC: intel-gfx-bugs, jozef.chudy
Version: XOrg git   
Hardware: Other   
OS: Linux (All)   
Whiteboard:
i915 platform: SKL i915 features: GPU hang
Attachments:
Description Flags
/var/spool/abrt/oops-date/backtrace
none
/var/spool/abrt/oops-date/dmesg
none
dmesg_error none

Description brian.chapman1 2016-09-20 01:24:03 UTC
[   32.233465] [drm] stuck on render ring
[   32.233765] [drm] GPU HANG: ecode 9:0:0x85dfbfff, in gnome-shell [10783], reason: Ring hung, action: reset
[   32.233767] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   32.233768] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   32.233769] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   32.233770] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   32.233770] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   32.235871] drm/i915: Resetting chip after gpu hang
[   34.234003] [drm] RC6 on


 448.405627] Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter bnep snd_hda_codec_hdmi arc4 intel_powerclamp coretemp snd_hda_codec_realtek intel_rapl iwlmvm kvm_intel snd_hda_codec_generic kvm mac80211 crc32_pclmul ghash_clmulni_intel ir_lirc_codec lirc_dev ir_mce_kbd_decoder ir_sanyo_decoder ir_sony_decoder snd_hda_intel iwlwifi snd_hda_codec snd_hda_core snd_hwdep ir_jvc_decoder snd_seq ir_rc6_decoder
[  448.405699]  snd_seq_device ir_rc5_decoder aesni_intel lrw gf128mul glue_helper snd_pcm ablk_helper cryptd cfg80211 ir_nec_decoder btusb sg bluetooth snd_timer pcspkr snd sdhci_pci rfkill sdhci shpchp soundcore i2c_i801 mmc_core rc_rc6_mce ite_cir rc_core acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 i2c_algo_bit drm_kms_helper crct10dif_pclmul crct10dif_common ahci libahci e1000e crc32c_intel ptp pps_core libata drm video i2c_hid i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  448.405781] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W      ------------   3.10.0-327.36.1.el7.x86_64 #1
[  448.405787] Hardware name:                  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
[  448.405792]  ffff88046ec83dd8 cbf661ff1e0a5065 ffff88046ec83d90 ffffffff81636301
[  448.405803]  ffff88046ec83dc8 ffffffff8107b260 ffff8804581ee000 ffff8804533a6000
[  448.405812]  ffff8804533a61a8 0000000000000001 ffff8804527a0000 ffff88046ec83e30
[  448.405821] Call Trace:
[  448.405824]  <IRQ>  [<ffffffff81636301>] dump_stack+0x19/0x1b
[  448.405846]  [<ffffffff8107b260>] warn_slowpath_common+0x70/0xb0
[  448.405853]  [<ffffffff8107b2fc>] warn_slowpath_fmt+0x5c/0x80
[  448.405918]  [<ffffffffa022e542>] ? __intel_pageflip_stall_check+0xc2/0x110 [i915]
[  448.405972]  [<ffffffffa023ead8>] intel_check_page_flip+0xd8/0xf0 [i915]
[  448.406020]  [<ffffffffa0207fee>] gen8_irq_handler+0x39e/0x4a0 [i915]
[  448.406028]  [<ffffffff8111c78e>] handle_irq_event_percpu+0x3e/0x1e0
[  448.406034]  [<ffffffff8111c96d>] handle_irq_event+0x3d/0x60
[  448.406041]  [<ffffffff8111f607>] handle_edge_irq+0x77/0x130
[  448.406050]  [<ffffffff81016ecf>] handle_irq+0xbf/0x150
[  448.406058]  [<ffffffff810e15ba>] ? tick_check_idle+0x8a/0xd0
[  448.406066]  [<ffffffff8164243a>] ? atomic_notifier_call_chain+0x1a/0x20
[  448.406073]  [<ffffffff81648eaf>] do_IRQ+0x4f/0xf0
[  448.406082]  [<ffffffff8163e1ed>] common_interrupt+0x6d/0x6d
[  448.406085]  <EOI>  [<ffffffff814d4ac2>] ? cpuidle_enter_state+0x52/0xc0
[  448.406098]  [<ffffffff814d4c09>] cpuidle_idle_call+0xd9/0x210
[  448.406107]  [<ffffffff8101e4ee>] arch_cpu_idle+0xe/0x30
[  448.406114]  [<ffffffff810d64a5>] cpu_startup_entry+0x245/0x290
[  448.406122]  [<ffffffff8104768a>] start_secondary+0x1ba/0x230
[  448.406127] ---[ end trace 4ad207009d862dc5 ]---
[  454.309934] [drm] stuck on render ring
[  454.310879] [drm] GPU HANG: ecode 9:0:0x87f99ff9, in gnome-shell [13221], reason: Ring hung, action: reset
[  454.313234] drm/i915: Resetting chip after gpu hang
[  456.310579] [drm] RC6 on
Comment 1 yann 2016-09-20 07:55:41 UTC
Can you attach gpu crash dump located at /sys/class/drm/card0/error ?
Comment 2 brian.chapman1 2016-09-20 22:57:27 UTC
Created attachment 126680 [details]
/var/spool/abrt/oops-date/backtrace

Error file was empty, but I've attached other files.

I've also installed http://elrepo.org/linux/kernel/el7/x86_64/RPMS/kernel-ml-4.7.4-1.el7.elrepo.x86_64.rpm and things have been more stable.

Thanks
Comment 3 brian.chapman1 2016-09-20 22:58:16 UTC
Created attachment 126681 [details]
/var/spool/abrt/oops-date/dmesg
Comment 4 Jani Nikula 2016-09-21 07:20:47 UTC
(In reply to brian.chapman1 from comment #2)
> Error file was empty, but I've attached other files.

Did you try to cp the file, or did you just look at the size? The file in debugfs is generated on the fly, so the fs reports 0 size.
Comment 5 Jozef Chudy 2016-10-06 10:14:48 UTC
Created attachment 127049 [details]
dmesg_error

/sys/class/drm/card0/error not present 0kb in size
Comment 6 Jozef Chudy 2016-10-06 10:16:12 UTC
(In reply to yann from comment #1)
> Can you attach gpu crash dump located at /sys/class/drm/card0/error ?

There is no error log in path mentioned
Comment 7 yann 2016-10-06 10:21:16 UTC
(In reply to Jozef Chudy from comment #6)
> (In reply to yann from comment #1)
> > Can you attach gpu crash dump located at /sys/class/drm/card0/error ?
> 
> There is no error log in path mentioned

0 kb is expected because this dump is done dynamically. So execute
"cat /sys/class/drm/card0/error | gzip > error.gz" and you should get it :)
Comment 8 Jozef Chudy 2016-10-06 10:29:41 UTC
tried .... erro.gz got created with "no error state collected" within
Comment 9 yann 2016-10-06 12:12:05 UTC
(In reply to Jozef Chudy from comment #8)
> tried .... erro.gz got created with "no error state collected" within

I would recommend to update your kernel and mesa to latest versions, to allow you benefit of all work done on GPU recovery code and re-test.
In parallel, please attached non truncated kernel log (starting from boot up to gpu hang message).
Comment 10 Ricardo 2017-02-22 16:31:42 UTC
Brian maybe you miss the last comment on your bug, but please review Yann's last comment and reply

if the problem exist with newest confirguration (kernel/mesa) update with logs and set the bug to reopen

if the problem is not there anymore please set the bug to resolved

if no action is taken in 30 days the bug will be closed due to lack of activity
Comment 11 Jari Tahvanainen 2017-04-10 09:24:55 UTC
Timeout - assuming to be fixed. If problem still persist on the latest kernels (preferable drm-tip from git://anongit.freedesktop.org/git/drm-tip), please change status to REOPENED and attach proper details (see https://01.org/linuxgraphics/documentation/how-report-bugs) and describe the use case your exercised ending up the failure.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.