Bug 32752 - [SNB] GPU semaphores disabled on HuronRiver (due to hard hangs in urbanterror)
[SNB] GPU semaphores disabled on HuronRiver (due to hard hangs in urbanterror)
Status: VERIFIED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Intel
XOrg git
x86 (IA32) Linux (All)
: medium major
Assigned To: Chris Wilson
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-31 00:35 UTC by meng
Modified: 2011-03-08 00:49 UTC (History)
3 users (show)

See Also:


Attachments
a patch (1.46 KB, patch)
2011-01-10 22:56 UTC, meng
no flags Details | Splinter Review
dmesg about stop-go (42.21 KB, text/plain)
2011-01-17 02:13 UTC, meng
no flags Details
i915_gem_interrupts information (2.10 KB, text/plain)
2011-01-25 23:22 UTC, meng
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description meng 2010-12-31 00:35:53 UTC
System Environment:
--------------------------
Arch:              i386
Platform:          Huronriver
Libdrm:            (master) 2.4.23-4
Mesa:              (master) 8d79765feb8fa003e629d4c5890af636324def9f
Xserver:     (master)xorg-server-1.9.99.901-49-gefcb63d0ce43f96d0ac02b6f4a480dfd2374fc84
Xf86_video_intel:   (master)2.13.902-11-g7667ad8432c032aec3a2aa004fc4dfc1877971b3
Kernel: (drm-intel-next) 608ca70d22c0ea0d52aa71f52b8e326055c274d1

Bug detailed description:
-------------------------
Start X. After running urbanterror several times, the PC will hang. It's kernel regression. The known good commit is Kernel:                                
 (drm-intel-next) b9e68670cc3a13166b389ce847af19b0d0d33c67


Reproduce steps:
----------------
1. start X
2. run urbanterror several times
Comment 1 Chris Wilson 2011-01-03 05:09:27 UTC
I've been using pts to run urbanterror (and ut2004-demo) in a loop over various resolutions without failure. Is the hang reproducible on your systems with pts or do you need manual interaction?
Comment 2 Chris Wilson 2011-01-03 05:10:22 UTC
And some information on the nature of the hang? Is the machine pingable? Anything strange in any of the logs, etc?
Comment 3 Chris Wilson 2011-01-03 12:41:22 UTC
Letting it run on, I've hit an EBUSY and a bug where it spins inside _XReply. Do either of those match the symptoms you've encountered?
Comment 4 Gordon Jin 2011-01-03 21:18:52 UTC
Demoting to P2 to unblock 2010Q4 release, since this happens on drm-intel-next.
Comment 5 Chris Wilson 2011-01-04 14:59:19 UTC
Attacked the problems I saw:

commit 7892e65a596d93c2fad8781214dd8bfff9735d76
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 4 17:34:02 2011 +0000

    drm/i915: Handle ringbuffer stalls when flushing
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit 9932e0bb5ec8f8a36401d6b491db6ec53b71f381
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 4 17:35:21 2011 +0000

    drm/i915: Mask USER interrupts on gen6 (until required)
    
    Otherwise we may consume 20% of the CPU just handling IRQs whilst
    rendering. Ouch.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit 8274c8f68dbbd5c2e273567c7be214491801be33
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 4 22:22:17 2011 +0000

    drm/i915/debugfs: Show the per-ring IMR
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit 74fe9bb76464289b2f994bcd188d906f30912326
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 4 22:22:56 2011 +0000

    drm/i915/ringbuffer: Simplify the ring irq refcounting
    
    ... and move it under the spinlock to gain the appropriate memory
    barriers.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32752
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Hopefully this clears up your symptoms as well.
Comment 6 meng 2011-01-05 03:28:01 UTC
Tested with following environment, this issue still exists. System hangs and it cannot be remote controlled.

System Environment:
--------------------------
Libdrm:         (master)2.4.23-4-gbad5242a59aa8e31cf10749e2ac69b3c66ef7da0
Mesa:           (master)90b7a4cc1a9ec6560fba337fb86be2a574498acb
Xserver:                (master)xorg-server-1.9.99.901-76-g261d0d16af797bb52d4c778e220296d7f2b28e14
Xf86_video_intel:               (master)2.13.902-22-ga7c7a9108f76aa312f3d5efa466052b914c81484
Kernel_unstable:                (drm-intel-next)a9ac4ef59da8cb3c2990bb0bd4fcbaa9094c6ac6
Comment 7 Chris Wilson 2011-01-05 03:59:04 UTC
Oops, you might need this as well:

commit 7fd1179aa58aeb239bddb669ddfe1326497e7012
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 5 10:32:24 2011 +0000

    drm/i915: Make the ring IMR handling private
    
    As the IMR for the USER interrupts are not modified elsewhere, we can
    separate the spinlock used for these from that of hpd and pipestats.
    Those two IMR are manipulated under an IRQ and so need heavier locking.
    
    Reported-and-tested-by: Alexey Fisher <bug-track@fisher-privat.net>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 8 zhao jian 2011-01-07 07:11:21 UTC
Hi Chris, even with the newest commit(7fd1179aa58aeb239bddb669ddfe1326497e7012) on -next branch, running urbanterror demo 2~5 times will still make system hang. We also tested with urbanterror got from pts with pts1.dm_68. The command I used is as following, can you reproduce it? 

vblank_mode=0 ./urbanterror +timedemo 1 +set demodone 'quit' +set demoloop1 'demo pts1; set nextdemo vstr demodone' +vstr demoloop1 +set r_customwidth 1024 +set r_customheight 768
Comment 9 Chris Wilson 2011-01-07 07:31:30 UTC
I'm looking at where 'x11perf -d :0 -copywinwin10 -copywinwin500' causes a hang currently.
Comment 10 Chris Wilson 2011-01-07 13:41:58 UTC
The demo runs happily in a loop here. But I found the cause of my copywinwin regression:

commit 1398261a2e84c537c409259cfe9db3d0abcd9f99
Author: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Date:   Wed Dec 15 15:42:31 2010 +0800

    drm/i915: Add self-refresh support on Sandybridge
    
    Add the support of memory self-refresh on Sandybridge, which is now
    support 3 levels of watermarks and the source of the latency values
    for watermarks has changed.
    
    On Sandybridge, the LP0 WM value is not hardcoded any more. All the
    latency value is now should be extracted from MCHBAR SSKPD register.
    And the MCHBAR base address is changed, too.
    
    For the WM values, if any calculated watermark values is larger than
    the maximum value that can be programmed into the associated watermark
    register, that watermark must be disabled.
    
    Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
    [ickle: remove duplicate compute routines and fixup for checkpatch]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

This is sufficient to make my system stable again for x11perf:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
index 365b47c..17ffb55 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -6554,7 +6554,7 @@ static void intel_init_display(struct drm_device *dev)
                                              "Disable CxSR\n");
                                dev_priv->display.update_wm = NULL;
                        }
-               } else if (IS_GEN6(dev)) {
+               } else if (IS_GEN6(dev) && 0) {
                        if (SNB_READ_WM0_LATENCY()) {
                                dev_priv->display.update_wm = sandybridge_update
                        } else {

Can you try that to see if we have the same bug?
Comment 11 Chris Wilson 2011-01-07 15:19:40 UTC
And this is the magic that worked:

@@ -3654,7 +3662,9 @@ static void sandybridge_update_wm(struct drm_device *dev,
 				 &sandybridge_cursor_wm_info, latency,
 				 &plane_wm, &cursor_wm)) {
 		I915_WRITE(WM0_PIPEA_ILK,
-			   (plane_wm << WM0_PIPE_PLANE_SHIFT) | cursor_wm);
+			   ((sandybridge_display_wm_info.fifo_size - plane_wm) << WM0_PIPE_PLANE_SHIFT) |
+			   (2 << WM0_PIPE_SPRITE_SHIFT) |
+			   cursor_wm);
 		DRM_DEBUG_KMS("FIFO watermarks For pipe A -"
 			      " plane %d, " "cursor: %d\n",
 			      plane_wm, cursor_wm);
@@ -3666,7 +3676,9 @@ static void sandybridge_update_wm(struct drm_device *dev,
 				 &sandybridge_cursor_wm_info, latency,
 				 &plane_wm, &cursor_wm)) {
 		I915_WRITE(WM0_PIPEB_ILK,
-			   (plane_wm << WM0_PIPE_PLANE_SHIFT) | cursor_wm);
+			   ((sandybridge_display_wm_info.fifo_size - plane_wm) << WM0_PIPE_PLANE_SHIFT) |
+			   (2 << WM0_PIPE_SPRITE_SHIFT) |
+			   cursor_wm);
 		DRM_DEBUG_KMS("FIFO watermarks For pipe B -"
 			      " plane %d, cursor: %d\n",
 			      plane_wm, cursor_wm);
Comment 12 meng 2011-01-10 02:07:11 UTC
Hi Chris, with the commit fbf4b94f7d01550799ff7669c65ebb0c053e015c after patch that you gave on 7 Jan, the system hangs too.
Comment 13 Chris Wilson 2011-01-10 02:25:17 UTC
It appears we are seeing completely different bugs (I only ever saw GPU hangs and not system hangs) and I don't seem to be able to reproduce your bug. You may need to do some investigation as well. Bisection is tricky since we have a few nasty bugs that cause random hangs, but I would first disable the sandybridge_update_wm, fbc and rc6 since they are most likely suspects.
Comment 14 meng 2011-01-10 02:32:52 UTC
With the commit fbf4b94f7d01550799ff7669c65ebb0c053e015c without patch
, the system hangs too.
Comment 15 meng 2011-01-10 22:56:18 UTC
Created attachment 41869 [details] [review]
a patch
Comment 16 meng 2011-01-10 22:59:51 UTC
Hi, with Kernel: (drm-intel-next)34da1327c3814781925396fa10c42f596588ff76 after patch that you gave on, the system hangs too.
-----------------------------------------------------------
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 5ca0663..8a81c62 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -168,7 +168,7 @@ static const struct intel_device_info intel_sandybridge_d_info = {
 static const struct intel_device_info intel_sandybridge_m_info = {
 	.gen = 6, .is_mobile = 1,
 	.need_gfx_hws = 1, .has_hotplug = 1,
-	.has_fbc = 1,
+	.has_fbc = 0,
 	.has_bsd_ring = 1,
 	.has_blt_ring = 1,
 };
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 14ac352..db3be33 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -6561,7 +6561,7 @@ static void intel_init_display(struct drm_device *dev)
 					      "Disable CxSR\n");
 				dev_priv->display.update_wm = NULL;
 			}
-		} else if (IS_GEN6(dev)) {
+		} else if (IS_GEN6(dev) && 0) {
 			if (SNB_READ_WM0_LATENCY()) {
 				dev_priv->display.update_wm = sandybridge_update_wm;
 			} else {
@@ -6739,7 +6739,7 @@ void intel_modeset_init(struct drm_device *dev)
 		intel_init_emon(dev);
 	}
 
-	if (IS_GEN6(dev))
+	if (IS_GEN6(dev) && 0)
 		gen6_enable_rps(dev_priv);
 
 	if (IS_IRONLAKE_M(dev)) {
@@ -6790,7 +6790,7 @@ void intel_modeset_cleanup(struct drm_device *dev)
 
 	if (IS_IRONLAKE_M(dev))
 		ironlake_disable_drps(dev);
-	if (IS_GEN6(dev))
+	if (IS_GEN6(dev) && 0)
 		gen6_disable_rps(dev);
 
 	if (IS_IRONLAKE_M(dev))
Comment 17 meng 2011-01-11 00:44:54 UTC
Oh.sorry.The patch is given by Liu,Yuanhan.
Comment 18 Chris Wilson 2011-01-11 16:10:50 UTC
I've the tree that I'm currently running on my SNB to drm-intel-staging (basically it is drm-intel-fixes + snb wm patch). Can you please try that kernel?
Comment 19 meng 2011-01-11 22:17:32 UTC
Hi Chris,Kernel: (drm-intel-staging)403c5307152ee5be73800d699902a327976f5d1d that I try, the system hangs too.
Comment 20 Chris Wilson 2011-01-12 02:48:04 UTC
I am at a complete loss, I can't seem to reproduce this (aside from the erratic wm fixed in -staging) but as the bug is now upstream, this needs to be P1.

Is it possible for you to narrow the range of known good/bad kernels?
Comment 21 Chris Wilson 2011-01-12 03:02:57 UTC
Ah, I've just realised I'm on a desktop (SugarBay) vs your mobile (HuronRiver).

Can you confirm that -staging works on SugarBay? (Just for my sanity!)
Comment 22 zhao jian 2011-01-12 06:08:04 UTC
(In reply to comment #21)
> Ah, I've just realised I'm on a desktop (SugarBay) vs your mobile (HuronRiver).
> Can you confirm that -staging works on SugarBay? (Just for my sanity!)

Yes. Chris. This issue only happens on our HuronRiver Rev 08. On our SugarBay Rev 09, there is no such issue with drm-intel-fixes, drm-intel-next. I guess with the -staging branch it also work well. I will have a try with the -staging branch tomorrow. 
And we are bisecting to only 2 commit remaining now, but I forgot which commit they are for now, I think we can give you an update tomorrow. BTW, when bisect it if it can run the demo 15 times consecutively we take it as pass, else we take it as fail. So some bad commit may be mistaken as good when it can run more than 15 times.
Comment 23 meng 2011-01-12 19:39:12 UTC
Kernel:1ec14ad3132702694f2e1a90b30641cf111183b9 is the first bad commit.
And its parent commit is 340479aac697bc73e225c122a9753d4964eeda3f.
When bisect it if it can run the demo 15 times consecutively take it as pass, else take it as fail.Run the demo on this bad commit 2-4 times,system hangs.
Run the demo on good commit (340479aac697bc73e225c122a9753d4964eeda3f) many times (run 15 times,after a while,go on running 15 times),system is OK .

----------------------------------------------------------------------------
commit 1ec14ad3132702694f2e1a90b30641cf111183b9
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Dec 4 11:30:53 2010 +0000

    drm/i915: Implement GPU semaphores for inter-ring synchronisation on SNB

    The bulk of the change is to convert the growing list of rings into an
    array so that the relationship between the rings and the semaphore sync
    registers can be easily computed.

    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 24 Chris Wilson 2011-01-13 02:49:09 UTC
Presumably then

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i
index e698343..28bdfe0 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -770,7 +770,7 @@ i915_gem_execbuffer_sync_rings(struct drm_i915_gem_object *o
        if (from == NULL || to == from)
                return 0;
 
-       if (INTEL_INFO(obj->base.dev)->gen < 6)
+       if (INTEL_INFO(obj->base.dev)->gen < 6 || IS_MOBILE(obj->base.dev))
                return i915_gem_object_wait_rendering(obj, true);
 
        idx = intel_ring_sync_index(from, to);

is a sufficient patch for us to run stably on both desktop + mobile until somebody has a chance to debug the semaphore implementation.
Comment 25 meng 2011-01-13 22:46:54 UTC
Trying 3-5 times, it still fails on Kernel: (drm-intel-next 6fe4f14044f181e146cdc15485428f95fa541ce8. But with your patch on this commit, it works well(Run 18 times).
Comment 26 Chris Wilson 2011-01-14 01:54:27 UTC
I've applied the workaround to -fixes. Hopefully I can get a HuronRiver machine in the near future, or someone else may find some time to debug the implementation.

commit 1591192d3a17adeebd03be0ce5888b88bddfaf89
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 14 09:46:38 2011 +0000

    drm/i915: Disable GPU semaphores on SandyBridge mobile
    
    Hopefully, this is a temporary measure whilst the root cause is
    understood. At the moment, we experience a hard hang whilst looping
    urbanterror that has been identified as a result of the use of
    semaphores, but so far only on SNB mobile.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32752
    Tested-by: mengmeng.meng@intel.com
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 27 meng 2011-01-17 00:08:48 UTC
The problem only exist in drm-intel-next.It worked fine in drm-intel-fixes.But on HuronRiver with the
commit(drm-intel-next)22ab70d3262ddb6e69b3c246a34e2967ba5eb1e8,the screen stops only for a while,then go on running ,although it can run 15 times continuously.
Comment 28 meng 2011-01-17 00:49:48 UTC
Sorry,there is a serious mistake above.It's a commit(drm-intel-fixes)22ab70d3262ddb6e69b3c246a34e2967ba5eb1e8, not "drm-intel-next".
Comment 29 Chris Wilson 2011-01-17 01:09:36 UTC
Anything written to dmesg at the time of the hang? Is the CPU busy? Is the GPU?
Comment 30 meng 2011-01-17 02:12:56 UTC
The screen will stop-go about every 12s only on gnome-desktop with commit(drm-intel-fixes)22ab70d3262ddb6e69b3c246a34e2967ba5eb1e8.The CPU is about 42%-92% in runnig demo,in particular to 92% when screen stops.I'm sorry I don't know whether GPU is busy or not.Could you tell me how to view GPU? Pls see the dmesg in attachment.It's worth noting that it works well without gnome-desktop.
Comment 31 meng 2011-01-17 02:13:40 UTC
Created attachment 42117 [details]
dmesg about stop-go
Comment 32 Chris Wilson 2011-01-17 02:22:31 UTC
In intel-gpu-tools, there is a little tool called intel_gpu_top which queries the ring status and INSTDONE.

Also can you enable CONFIG_PRINTK_TIME in your kernel builds (for all machines)?

It looks like we missing the interrupts. So watching /sys/kernel/debug/dri/0/i915_gem_interrupts to check if we are indeed rising BLT interrupts.
Comment 33 meng 2011-01-21 00:44:01 UTC
Wiht the Kernel: (drm-intel-fixes)4efe070896e1f7373c98a13713e659d1f5dee52a,it works fine.
Comment 34 meng 2011-01-23 21:30:36 UTC
Testing,in (drm-intel-next)fe4402931e43e81a4129eba41d05cf8907603af5
1.System hang (solved)
  The system is OK(run the demo 25 times).
2.But stop-go still exist
 It's a compiz problem. In gnome without compiz,it works fine.
Comment 35 meng 2011-01-23 22:53:00 UTC
Verified with Kernel: (drm-intel-next)fe4402931e43e81a4129eba41d05cf8907603af5.
And pls see the bug 33394 about screen stuttered when running the demo of 3D games wiht compiz enabled.
Comment 36 Chris Wilson 2011-01-24 02:12:47 UTC
The bug isn't fixed. We are incurring a 2-3x glyph performance penalty working around the issue.
Comment 37 meng 2011-01-25 23:22:43 UTC
Created attachment 42511 [details]
i915_gem_interrupts information

In gnome with compiz,i915_gem_interrupts information before running urbanterror and after
Comment 38 meng 2011-01-25 23:53:04 UTC
Testing,on Huronriver  
  in (drm-intel-next)5d6135012e9a7aa8a9128145ed9315eb916feea2
running urbanterror with compiz:
1.dmesg:
   [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... blt ring idle [waiting on 605867, at 605867], missed IRQ?
   [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... blt ring idle [waiting on 606790, at 606790], missed IRQ?
2.intel_gpu_top 
        running normally                 Screen stuttered :
        render    busy: 12-20%              render    busy:35-45%
        bitstream busy: 0%                  bitstream busy:0%
        blitter   busy:2-3                  blitter   busy:5-12%
Comment 39 meng 2011-02-14 00:35:03 UTC
In our new Huronriver (Intel Corporation Device 0116 (rev 09)),screen still stuttered when running the demo of 3D games(openarena urbanterror) with compiz enabled with (drm-intel-next)commit 9db4a9c7b2a3bd5b4952846bc0c2f58daa80ddd7.
Comment 40 Zou Nan hai 2011-02-14 23:27:02 UTC
Is is hang or just black screen on the new system?
Comment 41 meng 2011-02-14 23:59:58 UTC
(In reply to comment #40)
> Is is hang or just black screen on the new system?

Not hang or black screen.It's just screen stuttered (Bug 33394).
Comment 42 Gordon Jin 2011-02-15 00:55:30 UTC
Meng, if you want to check if this problem exists on the new Huron River machine (with rev09), you'd better use a kernel without the patch mentioned in https://bugs.freedesktop.org/show_bug.cgi?id=32752#c26, e.g. you may use the kernel mentioned in comment 23 or 25.

If you want to track the stuttered issue, you should track in Bug#33394.
Comment 43 Gordon Jin 2011-02-28 23:25:08 UTC
Chris, is the workaround patch (mentioned in comment#26) still in upstream? Do you need us testing Huron River rev09 which we got recently? (previously we were testing with Huron River rev08)
Comment 44 Chris Wilson 2011-03-01 01:51:53 UTC
The workaround is still in upstream. I haven't had long enough to be sure that my rev09 HuronRiver is stable. I've not experienced any of the umpteen OOPS we have filed against SNB, so it might well be... I just hesitate to reintroduce another potential random hang whilst we have random hangs unresolved!
Comment 45 Chris Wilson 2011-03-02 15:18:34 UTC
The semaphores are enabled again in -next, and we are tracking the stuttering in another bug #33394.

So close this pending and hope that the hangs related to GPU semaphores are truly fixed...
Comment 46 meng 2011-03-08 00:49:13 UTC
Verified with commit 467cffba85791cdfce38c124d75bd578f4bb8625,it works fine when running the demo of urbanterror(16 times) a Huron River(0126 rev08).