Created attachment 42952 [details] i915-error-state after a hang 00:02.0 VGA compatible controller: Intel Corporation Sandy Bridge Integrated Graphics Controller (rev 09) I ran a Sandy Bridge system with 2.6.37 and a single DP monitor well. Then I changed it to 2.6.38-rc3 and added a second DP monitor. Since then I have regular hangs: [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung It usually recovers after some time, but not always (have to restart X) I updated to drm-intel-fixes commit 71a77e07d0e33b57d4a50c173e5ce4fabceddbec Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 2 12:13:49 2011 +0000 drm/i915: Invalidate TLB caches on SNB BLT/BSD rings Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org but that didn't help (and caused some other regressions, like breaking the fedora boot screen and showing a black bar over the bottom gnome toolbar)
Created attachment 42953 [details] Xorg.0.log
Does 2.6.37 work well with dual head? I want to know if this is regression.
I ran dual-head for a few hours with .37 now and so far no GPU hangs. I'll watch it further (I don't have a procedure for triggering them except for using it). But so far it looks like a regression indeed.
update: no GPU hangs on .37 so far, but the other monitor just went into low power mode (with the primary one still running and me typing etc.) I haven't figured out how to wake it up again. No messages in the kernel log and xrandr still thinks it's there.
I also tried 2.6.38-rc5, but i hung the X server with a GPU hang already during the login screen: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hungg [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 6476 at 6475, next 6477) The only stable configuration on this system I found so far is 2.6.37 with only a single monitor.
The two major changes for SNB were power and performance: enabling GPU semaphores and render P-states (along with enabling low power watermarks). One or the other of these patches may help: diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i index d2f445e..05b309e 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -773,7 +773,7 @@ i915_gem_execbuffer_sync_rings(struct drm_i915_gem_object *o return 0; /* XXX gpu semaphores are currently causing hard hangs on SNB mobile */ - if (INTEL_INFO(obj->base.dev)->gen < 6 || IS_MOBILE(obj->base.dev)) + if (1) return i915_gem_object_wait_rendering(obj, true); idx = intel_ring_sync_index(from, to); diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d index dcb8217..540ed10 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -6196,6 +6196,9 @@ void gen6_enable_rps(struct drm_i915_private *dev_priv) int cur_freq, min_freq, max_freq; int i; + if (!i915_enable_rc6) + return; + /* Here begins a magic sequence of register writes to enable * auto-downclocking. *
With that patch I have 38-rc5 running with a single monitor. Works so far for a few hours in normal usage. Do you want me to try one hunk over the other? I can try dual head later too.
(In reply to comment #7) > With that patch I have 38-rc5 running with a single monitor. > Works so far for a few hours in normal usage. > > Do you want me to try one hunk over the other? Please, they are quite different in cause and effect, so knowing which path is at fault is vital.
2.6.38-rc6 single head with just @@ -6196,6 +6196,9 @@ void gen6_enable_rps(struct drm_i915_private *dev_priv) int cur_freq, min_freq, max_freq; int i; + if (!i915_enable_rc6) + return; + gives [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring I'll try the other hunk later.
Got another hang with the saem hunk, haven't tried the other yet. [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
Didn't see any single head hangs for a day with only this hunk applied: /* XXX gpu semaphores are currently causing hard hangs on SNB mobile */ - if (INTEL_INFO(obj->base.dev)->gen < 6 || IS_MOBILE(obj->base.dev)) + if (1) return i915_gem_object_wait_rendering(obj, true); I'll try dual head again next.
First result is that hotplug for dual head still doesn't work. No messages in the kernel when I plug in the other monitor.
Did some testing with multi head now too. With the semaphores disabled it works good so far, no hangs. I'll watch it further.
So be it... commit 4cd5a1efff70f54b70ef598efca878a143a5f9d5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 4 18:48:03 2011 +0000 drm/i915: Disable GPU semaphores by default Andi Kleen narrowed his hard hangs on his Sugar Bay (SNB desktop) rev 09 down to the use of GPU semaphores, and we already know that they were broken up to Huron River (mobile) rev 08. However, use of semaphores is a massive performance improvement... Only as long as the system remains stable. Enable at your peril. Reported-by: Andi Kleen <andi-fd@firstfloor.org> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=33921 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Thanks. Small correction for the commit log: i got gpu hangs, not hard (system?) hangs Other than that it sounds good -Andi
Hopefully, this is actually another issue which is papered over by the regular stalls. I'm hoping this actually turns out to be the excessive GT writes during rc6...
Created attachment 44138 [details] [review] Poll the FIFO for free entries before writing the register Hopefully this is the real issue.
Thanks I'll try. But doesn't the loop need a timeout?
Hang if you do and hang if you don't...
It's GPU hang versus CPU hang isn't it? GPU hang seems less severe
Already made the change.
Can you please attach the updated patch? Thanks
Created attachment 44140 [details] [review] Poll the FIFO for free entries before writing the register
It seems that "Poll the FIFO for free entries before writing the register" patch does the trick and that's with GPU semaphors enabled (clean 2.6.38-rc7 with just this patch applied). i5 2400 here.
Either way, it is fixed in the upstream kernel. Hopefully we will be able to verify that the FIFO fix is sufficient for 2.6.38.1.
I have the new version running now on my workstation, but it'll take some time to verify.
I ran it for a few days now with FIFO fix only and didn't have a GPU hang. (I had one libdrm_intel segfault in compiz and one triple fault of the whole system under load, but I assume that's both something else) So for me it's fine to reenable GPU semaphores for .1
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.