60437 – [snb] rc6 causes system hang once a few days ending with corrupted screen

Bug 60437 - [snb] rc6 causes system hang once a few days ending with corrupted screen

Summary: [snb] rc6 causes system hang once a few days ending with corrupted screen

Status:	CLOSED NOTABUG

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-07 18:42 UTC by Jakub Luzny
Modified:	2017-07-24 22:58 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
Screenshot of hung system (737.24 KB, image/jpeg) 2013-02-07 18:42 UTC, Jakub Luzny	no flags	Details
Xorg log (109.16 KB, text/plain) 2013-02-08 12:38 UTC, Jakub Luzny	no flags	Details
Dmesg log (61.61 KB, text/plain) 2013-02-08 12:39 UTC, Jakub Luzny	no flags	Details
drm/i915: init hardware to a known state on resume v3 (58.81 KB, patch) 2013-05-06 14:10 UTC, Mika Kuoppala	no flags	Details \| Splinter Review
View All

Description Jakub Luzny 2013-02-07 18:42:30 UTC

Created attachment 74372 [details]
Screenshot of hung system

Hello,

I have problems with my Sandy Bridge GPU. It hangs randomly after some time. It happens every 2-3 days. When I disabled RC6, I had a 20 days uptime without a problem. Happens on 3.6, 3.7 and even 3.8rc6. The system is totally frozen when this happens.

I'm using a MSI CR640 laptop with i3-2310M and
00:02.0 VGA compatible controller [0300]: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller [8086:0116] (rev 09)

Comment 1 Chris Wilson 2013-02-07 21:18:47 UTC

Can you please attach your dmesg and Xorg.0.log?

Comment 2 Jakub Luzny 2013-02-08 12:38:23 UTC

Created attachment 74425 [details]
Xorg log

Comment 3 Jakub Luzny 2013-02-08 12:39:28 UTC

Created attachment 74426 [details]
Dmesg log

Comment 4 Ben Widawsky 2013-02-09 18:18:39 UTC

Does removing pcie_aspm=force have any effect?

Comment 5 Jakub Luzny 2013-02-11 17:14:30 UTC

Removing pcie_aspm=force disables ASPM, at least according to dmesg. Tried running with ASPM off, crashed again.

Comment 6 Ben Widawsky 2013-02-13 03:58:42 UTC

A bit of a long shot, but can you try disabling ppgtt instead of rc6, and see if it hangs?
from grub:
i915.i915_enable_ppgtt=0 
or cmdline:
modprobe i915 i915_enable_ppgtt=0

Also perhaps try to turn off the IOMMU if it's on.
from grub, it's something like:
intel-iommu=off

Comment 7 Jakub Luzny 2013-02-13 13:59:22 UTC

Tried both disabling iommu and ppgtt, crashed again. Also tried netconsole, but nothing was sent during the crash :(

Comment 8 Ben Widawsky 2013-02-14 05:43:24 UTC

Again a long shot, but recently a patch went in to drm-intel-next-queued which has some correlation:
http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=551a618f709768e373fc72bd5dde091632e2b695

Can you please try either that patch, or the branch itself. Whichever you are most comfortable with.

Comment 9 Jakub Luzny 2013-02-14 15:51:15 UTC

Applied this patch to 3.8rc6, crashed again.

A very looong shot, but is there any way to bump RC6 voltage to more than 450mV?

Comment 10 Ben Widawsky 2013-02-17 00:09:14 UTC

(In reply to comment #9)
> Applied this patch to 3.8rc6, crashed again.
> 
> A very looong shot, but is there any way to bump RC6 voltage to more than
> 450mV?

Yes sure. First you can verify the setting was correct with
cat /sys/kernel/debug/dri/0/i915_drpc_info

Since we're entering into unsafe territory, I'll attach this is a patch, but inline. Up to you to try it...

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 82b68fe..9126b7f 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -2664,6 +2664,10 @@ static void gen6_enable_rps(struct drm_device *dev)
                        DRM_ERROR("Couldn't fix incorrect rc6 voltage\n");
        }
 
+#define RC6_VOLTAGE 500
+       rc6vids &= 0xffff00;
+       rc6vids |= GEN6_ENCODE_RC6_VID(RC6_VOLTAGE);
+       BUG_ON(sandybridge_pcode_write(dev_priv, GEN6_PCODE_WRITE_RC6VIDS, rc6vids));
        gen6_gt_force_wake_put(dev_priv);
 }

Comment 11 Jakub Luzny 2013-02-22 09:25:43 UTC

No difference on 500mV, crashes too.

Comment 12 Jakub Luzny 2013-02-28 19:44:35 UTC

Okay, today it crashed with RC6 off. So it's much more likely to crash with RC6, but it looks like the problem isn't in RC6. Do you have any suggestions how could I figure out the problem? Thanks

Comment 13 Ben Widawsky 2013-03-06 02:34:19 UTC

(In reply to comment #12)
> Okay, today it crashed with RC6 off. So it's much more likely to crash with
> RC6, but it looks like the problem isn't in RC6. Do you have any suggestions
> how could I figure out the problem? Thanks

And it's still a hard hang, no way to get error state through ssh?

Comment 14 Jesse Barnes 2013-04-24 16:50:21 UTC

Random patch to try:

http://lists.freedesktop.org/archives/intel-gfx/2013-April/027140.html

Comment 15 Jakub Luzny 2013-04-24 17:01:33 UTC

Is it included in 3.9? Or are any possible fixes in 3.9? I'm considering to try it.

Comment 16 Mika Kuoppala 2013-05-06 14:10:35 UTC

Created attachment 78928 [details] [review]
drm/i915: init hardware to a known state on resume v3

Comment 17 Mika Kuoppala 2013-05-06 14:11:48 UTC

(In reply to comment #15)
> Is it included in 3.9? Or are any possible fixes in 3.9? I'm considering to
> try it.

It is not included on any kernel tree as far as I know.

Please grab a kernel from
http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued

and apply the patch on top of it and test if it makes a difference.

Comment 18 Chris Wilson 2013-06-12 09:31:33 UTC

Can you please try with this patch: https://patchwork.kernel.org/patch/2707341/ as it claims to fix some instability with rc6 on SandyBridge?

Comment 19 Daniel Vetter 2013-10-28 18:21:56 UTC

Please test Ken's snb blorp fixes from

http://cgit.freedesktop.org/~kwg/mesa/log/?h=snbfixes

Note that this is a mesa series, not kernel patches. But it could be that a gpu hang caused by mesa results in your gpu taking down the entire system ...

Comment 20 Ben Widawsky 2014-01-10 20:49:51 UTC

Ping OP to test latest everything. Will close if no update.

Comment 21 Jakub Luzny 2014-01-11 10:17:31 UTC

Well, I think the problem was caused by wrong memory. Memtest didn't find any problem even after 2 days, but I swapped the memory modules and I didn't have this problem for a long time. Sorry for the bugreport.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.