53186 – [snb] GPU hung after some time of usage - gen6_gt_check_fifodbg -rc6 related ?

Bug 53186 - [snb] GPU hung after some time of usage - gen6_gt_check_fifodbg -rc6 related ?

Summary: [snb] GPU hung after some time of usage - gen6_gt_check_fifodbg -rc6 related ?

Status:	CLOSED DUPLICATE of bug 50619

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Daniel Vetter
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-08-06 19:50 UTC by igaldino
Modified:	2017-07-24 23:00 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
3rd dmesg (494.62 KB, text/plain) 2012-08-06 19:50 UTC, igaldino	no flags	Details
3rd i915_error_state (2.05 MB, text/x-log) 2012-08-06 19:52 UTC, igaldino	no flags	Details
4th dmesg (197.25 KB, text/x-log) 2012-08-08 12:52 UTC, igaldino	no flags	Details
This is my current sysctl (404 bytes, text/plain) 2012-08-08 13:14 UTC, igaldino	no flags	Details
5th Xorg (83.61 KB, text/plain) 2012-08-08 13:16 UTC, igaldino	no flags	Details
5th dmesg (494.36 KB, text/plain) 2012-08-08 13:16 UTC, igaldino	no flags	Details
6th dmesg (114.94 KB, text/x-log) 2012-08-13 10:56 UTC, igaldino	no flags	Details
View All

Description igaldino 2012-08-06 19:50:40 UTC

Created attachment 65204 [details]
3rd dmesg

After some time using X, it freezes. Only mouse works and I need to ssh to restart the system.

Checking logs, I can see 200 of the following WARNING messages:
[89248.305606] ------------[ cut here ]------------
[89248.305624] WARNING: at drivers/gpu/drm/i915/i915_drv.c:398 gen6_gt_check_fifodbg.isra.3+0x40/0x50 [i915]()
[89248.305628] Hardware name:         
[89248.305630] MMIO read or write has been dropped 3
[89248.305633] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek mei(C) snd_hda_intel snd_hda_codec snd_hwdep snd_pcm i915 snd_page_alloc snd_timer snd video i2c_algo_bit microcode ghash_clmulni_intel iTCO_wdt cryptd soundcore drm_kms_helper coretemp acpi_cpufreq joydev mperf psmouse serio_raw processor pcspkr button evdev iTCO_vendor_support drm shpchp pci_hotplug i2c_i801 i2c_core intel_agp intel_gtt e1000e crc32c_intel ext4 crc16 jbd2 mbcache usbhid hid sd_mod ahci xhci_hcd libahci libata scsi_mod ehci_hcd usbcore usb_common
[89248.305694] Pid: 469, comm: X Tainted: G        WC   3.4.7-1-ARCH #1
[89248.305698] Call Trace:
[89248.305707]  [<ffffffff810515bf>] warn_slowpath_common+0x7f/0xc0
[89248.305713]  [<ffffffff810516b6>] warn_slowpath_fmt+0x46/0x50
[89248.305723]  [<ffffffffa02c0490>] gen6_gt_check_fifodbg.isra.3+0x40/0x50 [i915]
[89248.305732]  [<ffffffffa02c081e>] __gen6_gt_force_wake_put+0x1e/0x20 [i915]
[89248.305742]  [<ffffffffa02c0d11>] i915_read32+0x131/0x150 [i915]
[89248.305755]  [<ffffffffa02ffe90>] intel_ring_get_active_head+0x30/0x40 [i915]
[89248.305766]  [<ffffffffa02ffee5>] gen6_ring_get_seqno+0x45/0x50 [i915]
[89248.305779]  [<ffffffffa02d5fba>] i915_gem_throttle_ioctl+0xba/0x240 [i915]
[89248.305786]  [<ffffffff811818a0>] ? __pollwait+0xf0/0xf0
[89248.305797]  [<ffffffffa01cf483>] drm_ioctl+0x4c3/0x570 [drm]
[89248.305809]  [<ffffffffa02d5f00>] ? i915_gem_busy_ioctl+0x170/0x170 [i915]
[89248.305817]  [<ffffffff81246864>] ? timerqueue_del+0x34/0x90
[89248.305824]  [<ffffffff81076f20>] ? __remove_hrtimer+0x60/0xc0
[89248.305830]  [<ffffffff81180ab7>] do_vfs_ioctl+0x97/0x530
[89248.305836]  [<ffffffff8105700c>] ? do_setitimer+0x1cc/0x260
[89248.305841]  [<ffffffff81180fe9>] sys_ioctl+0x99/0xa0
[89248.305848]  [<ffffffff8146aaa9>] system_call_fastpath+0x16/0x1b
[89248.305852] ---[ end trace 2e392e332536dc75 ]---
[89248.306807] ------------[ cut here ]------------

and finally, when it freezes, I get:
[90387.362163] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[90387.362167] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[90387.365284] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000
[90401.649006] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[90401.649226] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000000 tail 00000000 start 00000000
[90454.707303] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

There is another bug 53169 that seems to be related, but can't tell.

I'm attaching logs from 3 different times that this issue happened.

Comment 1 igaldino 2012-08-06 19:52:48 UTC

Created attachment 65206 [details]
3rd i915_error_state

Comment 2 igaldino 2012-08-06 19:54:32 UTC

Ok, added logs just for the last occurrence of the issue. Please let me know if you need more details.

Comment 3 Chris Wilson 2012-08-06 21:18:25 UTC

Looks to be a similar bug to bug 50545. Does the issue no longer manifest if you disable rc6 by adding i915.i915_enable_rc6=0 to your kernel parameters?

Comment 4 igaldino 2012-08-06 21:29:52 UTC

The dmesg for bug 50545 seems different.
Anyway, I'll try i915.i915_enable_rc6=0 and I'll let you know the results.

Thanks for the fast response.

Comment 5 Chris Wilson 2012-08-06 21:33:29 UTC

Hmm, did I quick grep for the rc6+dopped mmio bug... The precise bug I was looking for is no longer in that list (and you are right it is not bug 50545), Daniel is probably hiding it from me again...

Comment 6 Chris Wilson 2012-08-07 20:44:51 UTC

The actual bug I intended to reference was bug 50619.

Comment 7 igaldino 2012-08-08 12:42:12 UTC

I have added i915.i915_enable_rc6=0 to my kernel parameters and so far it seems the issue is gone:

[root@odin ~]# uptime
 09:35:55 up 22:08,  2 users,  load average: 3,10, 2,63, 2,60

Although I'm running a desktop, I understand this parameter will help to save some power, so I would like to know if/when this issue will be fixed.

I'm attaching the current dmesg and this is the current i915_error_state:

[root@odin etc]# cat /sys/kernel/debug/dri/0/i915_error_state
no error state collected

Thanks.

Comment 8 igaldino 2012-08-08 12:52:27 UTC

Created attachment 65280 [details]
4th dmesg

Comment 9 igaldino 2012-08-08 13:12:49 UTC

It seems that I have talked too early.

My box just hung. I'm attaching the logs.

Comment 10 igaldino 2012-08-08 13:14:11 UTC

Created attachment 65282 [details]
This is my current sysctl

Comment 11 igaldino 2012-08-08 13:16:09 UTC

Created attachment 65283 [details]
5th Xorg

Comment 12 igaldino 2012-08-08 13:16:54 UTC

Created attachment 65285 [details]
5th dmesg

Comment 13 igaldino 2012-08-08 13:25:52 UTC

I've added my sysctl.conf in order to check if there is anything there that could lead to this yet.

After I checked my grub setup and I found two options I added some time ago due to an error I was receiving related to mttr:

   enable_mtrr_cleanup
   mtrr_spare_reg_nr=1

Should I try to run without it?

Xorg version:
=============
X.Org X Server 1.12.3
Release Date: 2012-07-09
X Protocol Version 11, Revision 0
Build Operating System: Linux 3.4.4-3-ARCH x86_64 
Current Operating System: Linux odin 3.4.7-1-ARCH #1 SMP PREEMPT Sun Jul 29 22:02:56 CEST 2012 x86_64
Kernel command line: root=/dev/disk/by-uuid/8e2f80f0-c4ee-44b2-a446-39f0de4ff9a6 ro vga=773 enable_mtrr_cleanup mtrr_spare_reg_nr=1
Build Date: 09 July 2012  03:59:39PM
 
Current version of pixman: 0.26.2
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.

xorg-bdftopcf 1.0.3-2
xorg-font-util 1.3.0-1
xorg-font-utils 7.6-3
xorg-fonts-alias 1.0.2-2
xorg-fonts-encodings 1.0.4-3
xorg-fonts-misc 1.0.1-2
xorg-iceauth 1.0.5-1
xorg-mkfontdir 1.0.7-1
xorg-mkfontscale 1.1.0-1
xorg-server 1.12.3-1
xorg-server-common 1.12.3-1
xorg-server-utils 7.6-3
xorg-sessreg 1.0.7-1
xorg-setxkbmap 1.3.0-1
xorg-utils 7.6-8
xorg-xauth 1.0.7-1
xorg-xbacklight 1.1.2-3
xorg-xcmsdb 1.0.4-1
xorg-xdpyinfo 1.3.0-1
xorg-xdriinfo 1.0.4-3
xorg-xev 1.2.0-1
xorg-xgamma 1.0.5-1
xorg-xhost 1.0.5-1
xorg-xinit 1.3.2-1
xorg-xinput 1.6.0-1
xorg-xkbcomp 1.2.4-1
xorg-xlsatoms 1.1.1-1
xorg-xlsclients 1.1.2-2
xorg-xmessage 1.0.3-2
xorg-xmodmap 1.0.7-1
xorg-xprop 1.2.1-1
xorg-xrandr 1.3.5-1
xorg-xrdb 1.0.9-2
xorg-xrefresh 1.0.4-3
xorg-xset 1.2.2-1
xorg-xsetroot 1.1.0-3
xorg-xvinfo 1.1.1-3
xorg-xwininfo 1.1.2-1
intel-dri 8.0.4-2
libva-driver-intel 1.0.18-1
xf86-video-intel 2.20.2-2

i915_error_state:
=================
no error state collected

uptime:
=======
 10:00:04 up 22:32,  3 users,  load average: 1,22, 1,09, 1,77

Please let me know if you need anything else.

Thanks again.

Comment 14 Daniel Vetter 2012-08-08 13:29:06 UTC

Hm, I don't know whether you can change that with systctl, i915_enable_rc6 is a module option ... Maybe double-check in /sys/modules/i915/parameters/i915_enable_rc6 whether it works?

Comment 15 igaldino 2012-08-08 13:37:43 UTC

And, guess what? You are right :-S

I've changed my grub setup and let's see what will happen.

Comment 16 igaldino 2012-08-08 13:38:16 UTC

I forget to add:
[root@odin ~]# cat /sys/module/i915/parameters/i915_enable_rc6
-1

This is what was before.

Comment 17 igaldino 2012-08-08 13:51:00 UTC

and this is after setting up the kernel:

[root@odin ~]# cat /sys/module/i915/parameters/i915_enable_rc6
0

Comment 18 Chris Wilson 2012-08-08 19:17:24 UTC

Well that just sunk my best theory!

Comment 19 igaldino 2012-08-08 19:48:27 UTC

Chris, I've changed the parameter correctly this time and so far, after 6 hours, there is no single WARNING message in dmesg.

So, the boat is still floating.

Comment 20 Chris Wilson 2012-08-11 13:16:05 UTC

(In reply to comment #19)
> Chris, I've changed the parameter correctly this time and so far, after 6
> hours, there is no single WARNING message in dmesg.
> 
> So, the boat is still floating.

Can you please confirm that you do not see any hangs whilst disabling rc6?

Comment 21 Ben Widawsky 2012-08-12 06:59:26 UTC

With some odd exceptions, the error state seems to indicate the GPU was idle when the hangcheck elapsed. Or is it just me?

Comment 22 Chris Wilson 2012-08-12 08:03:38 UTC

Considering that some of the missed writes were to update the ring tail pointer, trying to guess what state the GPU is in seems fraught.

Comment 23 igaldino 2012-08-13 10:55:46 UTC

I confirm no errors since the driver parameter change:

isaque@odin:~$ uptime
 07:52:03 up 4 days,  7:22,  2 users,  load average: 0,62, 1,19, 1,57

Now, what I gain/loose having this parameter disabled?

Thx.

Comment 24 igaldino 2012-08-13 10:56:54 UTC

Created attachment 65494 [details]
6th dmesg

Just for your record, in case you want to check anything else.

Comment 25 Chris Wilson 2012-08-13 17:55:16 UTC

Sounds like we can safely coalesce this bug reports into the original "rc6 explodes randomly".

*** This bug has been marked as a duplicate of bug 50619 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.