33921 – hangs on sandy bridge with dual head

Bug 33921 - hangs on sandy bridge with dual head

Summary: hangs on sandy bridge with dual head

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-02-04 15:04 UTC by Andi Kleen
Modified:	2011-03-15 14:07 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
i915-error-state after a hang (676.86 KB, application/octet-stream) 2011-02-04 15:04 UTC, Andi Kleen	no flags	Details
Xorg.0.log (194.96 KB, text/x-log) 2011-02-04 15:04 UTC, Andi Kleen	no flags	Details
Poll the FIFO for free entries before writing the register (4.32 KB, patch) 2011-03-04 11:35 UTC, Chris Wilson	no flags	Details \| Splinter Review
Poll the FIFO for free entries before writing the register (8.21 KB, patch) 2011-03-04 13:36 UTC, Chris Wilson	no flags	Details \| Splinter Review
Show Obsolete (1) View All

Description Andi Kleen 2011-02-04 15:04:13 UTC

Created attachment 42952 [details]
i915-error-state after a hang

00:02.0 VGA compatible controller: Intel Corporation Sandy Bridge Integrated Graphics Controller (rev 09)

I ran a Sandy Bridge system with 2.6.37 and a single DP monitor well.
Then I changed it to 2.6.38-rc3 and added a second DP monitor.

Since then I have regular hangs:


[drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

It usually recovers after some time, but not always (have to restart X)

I updated to drm-intel-fixes 

commit 71a77e07d0e33b57d4a50c173e5ce4fabceddbec
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 2 12:13:49 2011 +0000

    drm/i915: Invalidate TLB caches on SNB BLT/BSD rings
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org

but that didn't help (and caused some other regressions, like 
breaking the fedora boot screen and showing a black bar over the bottom
gnome toolbar)

Comment 1 Andi Kleen 2011-02-04 15:04:44 UTC

Created attachment 42953 [details]
Xorg.0.log

Comment 2 Gordon Jin 2011-02-08 17:26:20 UTC

Does 2.6.37 work well with dual head? I want to know if this is regression.

Comment 3 Andi Kleen 2011-02-15 10:58:57 UTC

I ran dual-head for a few hours with .37 now and so far no GPU hangs.
I'll watch it further (I don't have a procedure for triggering them
except for using it). But so far it looks like a regression indeed.

Comment 4 Andi Kleen 2011-02-15 13:11:22 UTC

update: no GPU hangs on .37 so far, but the other monitor just went into low power mode (with the primary one still running and me typing etc.)
I haven't figured out how to wake it up again. No messages in the kernel log
and xrandr still thinks it's there.

Comment 5 Andi Kleen 2011-02-16 16:56:09 UTC

I also tried 2.6.38-rc5, but i hung the X server with a GPU hang already during the login screen:

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hungg
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 6476 at 6475, next 6477)

The only stable configuration on this system I found so far is 2.6.37 with only a single monitor.

Comment 6 Chris Wilson 2011-02-17 02:21:38 UTC

The two major changes for SNB were power and performance: enabling GPU semaphores and render P-states (along with enabling low power watermarks). One or the other of these patches may help:

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i
index d2f445e..05b309e 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -773,7 +773,7 @@ i915_gem_execbuffer_sync_rings(struct drm_i915_gem_object *o
                return 0;
 
        /* XXX gpu semaphores are currently causing hard hangs on SNB mobile */
-       if (INTEL_INFO(obj->base.dev)->gen < 6 || IS_MOBILE(obj->base.dev))
+       if (1)
                return i915_gem_object_wait_rendering(obj, true);
 
        idx = intel_ring_sync_index(from, to);

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
index dcb8217..540ed10 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -6196,6 +6196,9 @@ void gen6_enable_rps(struct drm_i915_private *dev_priv)
        int cur_freq, min_freq, max_freq;
        int i;
 
+       if (!i915_enable_rc6)
+               return;
+
        /* Here begins a magic sequence of register writes to enable
         * auto-downclocking.
         *

Comment 7 Andi Kleen 2011-02-17 13:21:01 UTC

With that patch I have 38-rc5 running with a single monitor.
Works so far for a few hours in normal usage.

Do you want me to try one hunk over the other?

I can try dual head later too.

Comment 8 Chris Wilson 2011-02-17 13:25:38 UTC

(In reply to comment #7)
> With that patch I have 38-rc5 running with a single monitor.
> Works so far for a few hours in normal usage.
> 
> Do you want me to try one hunk over the other?

Please, they are quite different in cause and effect, so knowing which path is at fault is vital.

Comment 9 Andi Kleen 2011-02-22 15:44:06 UTC

2.6.38-rc6 single head with just

@@ -6196,6 +6196,9 @@ void gen6_enable_rps(struct drm_i915_private *dev_priv)
        int cur_freq, min_freq, max_freq;
        int i;
 
+       if (!i915_enable_rc6)
+               return;
+


gives 

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring

I'll try the other hunk later.

Comment 10 Andi Kleen 2011-03-01 15:03:39 UTC

Got another hang with the saem hunk, haven't tried the other yet.

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring

Comment 11 Andi Kleen 2011-03-03 13:32:45 UTC

Didn't see any single head hangs for a day with only this hunk applied:

        /* XXX gpu semaphores are currently causing hard hangs on SNB mobile */
-       if (INTEL_INFO(obj->base.dev)->gen < 6 || IS_MOBILE(obj->base.dev))
+       if (1)
                return i915_gem_object_wait_rendering(obj, true);


I'll try dual head again next.

Comment 12 Andi Kleen 2011-03-03 13:34:30 UTC

First result is that hotplug for dual head still doesn't work. No messages in the kernel when I plug in the other monitor.

Comment 13 Andi Kleen 2011-03-03 16:42:27 UTC

Did some testing with multi head now too. 
With the semaphores disabled it works good so far, no hangs.
I'll watch it further.

Comment 14 Chris Wilson 2011-03-04 11:00:46 UTC

So be it...

commit 4cd5a1efff70f54b70ef598efca878a143a5f9d5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 4 18:48:03 2011 +0000

    drm/i915: Disable GPU semaphores by default
    
    Andi Kleen narrowed his hard hangs on his Sugar Bay (SNB desktop) rev 09
    down to the use of GPU semaphores, and we already know that they were
    broken up to Huron River (mobile) rev 08.
    
    However, use of semaphores is a massive performance improvement... Only
    as long as the system remains stable. Enable at your peril.
    
    Reported-by: Andi Kleen <andi-fd@firstfloor.org>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=33921
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 15 Andi Kleen 2011-03-04 11:03:16 UTC

Thanks.

Small correction for the commit log:

i got gpu hangs, not hard (system?) hangs

Other than that it sounds good

-Andi

Comment 16 Chris Wilson 2011-03-04 11:04:32 UTC

Hopefully, this is actually another issue which is papered over by the regular stalls. I'm hoping this actually turns out to be the excessive GT writes during rc6...

Comment 17 Chris Wilson 2011-03-04 11:35:43 UTC

Created attachment 44138 [details] [review]
Poll the FIFO for free entries before writing the register

Hopefully this is the real issue.

Comment 18 Andi Kleen 2011-03-04 12:37:28 UTC

Thanks I'll try. But doesn't the loop need a timeout?

Comment 19 Chris Wilson 2011-03-04 12:40:12 UTC

Hang if you do and hang if you don't...

Comment 20 Andi Kleen 2011-03-04 12:47:40 UTC

It's GPU hang versus CPU hang isn't it?
GPU hang seems less severe

Comment 21 Chris Wilson 2011-03-04 12:50:06 UTC

Already made the change.

Comment 22 Andi Kleen 2011-03-04 13:27:16 UTC

Can you please attach the updated patch?
Thanks

Comment 23 Chris Wilson 2011-03-04 13:36:09 UTC

Created attachment 44140 [details] [review]
Poll the FIFO for free entries before writing the register

Comment 24 Ivan Bulatovic 2011-03-07 18:24:36 UTC

It seems that "Poll the FIFO for free entries before writing the register" patch does the trick and that's with GPU semaphors enabled (clean 2.6.38-rc7 with just this patch applied).

i5 2400 here.

Comment 25 Chris Wilson 2011-03-10 10:55:47 UTC

Either way, it is fixed in the upstream kernel. Hopefully we will be able to verify that the FIFO fix is sufficient for 2.6.38.1.

Comment 26 Andi Kleen 2011-03-10 15:13:30 UTC

I have the new version running now on my workstation, but it'll take some
time to verify.

Comment 27 Andi Kleen 2011-03-15 14:07:12 UTC

I ran it for a few days now with FIFO fix only and didn't have a GPU hang.
(I had one libdrm_intel segfault in compiz and one triple fault of the whole system under load, but I assume that's both something else)

So for me it's fine to reenable GPU semaphores for .1

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.