Created attachment 74372 [details] Screenshot of hung system Hello, I have problems with my Sandy Bridge GPU. It hangs randomly after some time. It happens every 2-3 days. When I disabled RC6, I had a 20 days uptime without a problem. Happens on 3.6, 3.7 and even 3.8rc6. The system is totally frozen when this happens. I'm using a MSI CR640 laptop with i3-2310M and 00:02.0 VGA compatible controller [0300]: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller [8086:0116] (rev 09)
Can you please attach your dmesg and Xorg.0.log?
Created attachment 74425 [details] Xorg log
Created attachment 74426 [details] Dmesg log
Does removing pcie_aspm=force have any effect?
Removing pcie_aspm=force disables ASPM, at least according to dmesg. Tried running with ASPM off, crashed again.
A bit of a long shot, but can you try disabling ppgtt instead of rc6, and see if it hangs? from grub: i915.i915_enable_ppgtt=0 or cmdline: modprobe i915 i915_enable_ppgtt=0 Also perhaps try to turn off the IOMMU if it's on. from grub, it's something like: intel-iommu=off
Tried both disabling iommu and ppgtt, crashed again. Also tried netconsole, but nothing was sent during the crash :(
Again a long shot, but recently a patch went in to drm-intel-next-queued which has some correlation: http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=551a618f709768e373fc72bd5dde091632e2b695 Can you please try either that patch, or the branch itself. Whichever you are most comfortable with.
Applied this patch to 3.8rc6, crashed again. A very looong shot, but is there any way to bump RC6 voltage to more than 450mV?
(In reply to comment #9) > Applied this patch to 3.8rc6, crashed again. > > A very looong shot, but is there any way to bump RC6 voltage to more than > 450mV? Yes sure. First you can verify the setting was correct with cat /sys/kernel/debug/dri/0/i915_drpc_info Since we're entering into unsafe territory, I'll attach this is a patch, but inline. Up to you to try it... diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index 82b68fe..9126b7f 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -2664,6 +2664,10 @@ static void gen6_enable_rps(struct drm_device *dev) DRM_ERROR("Couldn't fix incorrect rc6 voltage\n"); } +#define RC6_VOLTAGE 500 + rc6vids &= 0xffff00; + rc6vids |= GEN6_ENCODE_RC6_VID(RC6_VOLTAGE); + BUG_ON(sandybridge_pcode_write(dev_priv, GEN6_PCODE_WRITE_RC6VIDS, rc6vids)); gen6_gt_force_wake_put(dev_priv); }
No difference on 500mV, crashes too.
Okay, today it crashed with RC6 off. So it's much more likely to crash with RC6, but it looks like the problem isn't in RC6. Do you have any suggestions how could I figure out the problem? Thanks
(In reply to comment #12) > Okay, today it crashed with RC6 off. So it's much more likely to crash with > RC6, but it looks like the problem isn't in RC6. Do you have any suggestions > how could I figure out the problem? Thanks And it's still a hard hang, no way to get error state through ssh?
Random patch to try: http://lists.freedesktop.org/archives/intel-gfx/2013-April/027140.html
Is it included in 3.9? Or are any possible fixes in 3.9? I'm considering to try it.
Created attachment 78928 [details] [review] drm/i915: init hardware to a known state on resume v3
(In reply to comment #15) > Is it included in 3.9? Or are any possible fixes in 3.9? I'm considering to > try it. It is not included on any kernel tree as far as I know. Please grab a kernel from http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued and apply the patch on top of it and test if it makes a difference.
Can you please try with this patch: https://patchwork.kernel.org/patch/2707341/ as it claims to fix some instability with rc6 on SandyBridge?
Please test Ken's snb blorp fixes from http://cgit.freedesktop.org/~kwg/mesa/log/?h=snbfixes Note that this is a mesa series, not kernel patches. But it could be that a gpu hang caused by mesa results in your gpu taking down the entire system ...
Ping OP to test latest everything. Will close if no update.
Well, I think the problem was caused by wrong memory. Memtest didn't find any problem even after 2 days, but I swapped the memory modules and I didn't have this problem for a long time. Sorry for the bugreport.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.