Summary: | [SNB] GPU semaphores disabled on HuronRiver (due to hard hangs in urbanterror) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | meng <mengmeng.meng> | ||||||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||
Severity: | major | ||||||||||
Priority: | medium | CC: | brian, jbarnes, xunx.fang | ||||||||
Version: | XOrg git | ||||||||||
Hardware: | x86 (IA32) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
meng
2010-12-31 00:35:53 UTC
I've been using pts to run urbanterror (and ut2004-demo) in a loop over various resolutions without failure. Is the hang reproducible on your systems with pts or do you need manual interaction? And some information on the nature of the hang? Is the machine pingable? Anything strange in any of the logs, etc? Letting it run on, I've hit an EBUSY and a bug where it spins inside _XReply. Do either of those match the symptoms you've encountered? Demoting to P2 to unblock 2010Q4 release, since this happens on drm-intel-next. Attacked the problems I saw: commit 7892e65a596d93c2fad8781214dd8bfff9735d76 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 4 17:34:02 2011 +0000 drm/i915: Handle ringbuffer stalls when flushing Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> commit 9932e0bb5ec8f8a36401d6b491db6ec53b71f381 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 4 17:35:21 2011 +0000 drm/i915: Mask USER interrupts on gen6 (until required) Otherwise we may consume 20% of the CPU just handling IRQs whilst rendering. Ouch. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> commit 8274c8f68dbbd5c2e273567c7be214491801be33 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 4 22:22:17 2011 +0000 drm/i915/debugfs: Show the per-ring IMR Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> commit 74fe9bb76464289b2f994bcd188d906f30912326 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 4 22:22:56 2011 +0000 drm/i915/ringbuffer: Simplify the ring irq refcounting ... and move it under the spinlock to gain the appropriate memory barriers. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32752 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Hopefully this clears up your symptoms as well. Tested with following environment, this issue still exists. System hangs and it cannot be remote controlled. System Environment: -------------------------- Libdrm: (master)2.4.23-4-gbad5242a59aa8e31cf10749e2ac69b3c66ef7da0 Mesa: (master)90b7a4cc1a9ec6560fba337fb86be2a574498acb Xserver: (master)xorg-server-1.9.99.901-76-g261d0d16af797bb52d4c778e220296d7f2b28e14 Xf86_video_intel: (master)2.13.902-22-ga7c7a9108f76aa312f3d5efa466052b914c81484 Kernel_unstable: (drm-intel-next)a9ac4ef59da8cb3c2990bb0bd4fcbaa9094c6ac6 Oops, you might need this as well: commit 7fd1179aa58aeb239bddb669ddfe1326497e7012 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jan 5 10:32:24 2011 +0000 drm/i915: Make the ring IMR handling private As the IMR for the USER interrupts are not modified elsewhere, we can separate the spinlock used for these from that of hpd and pipestats. Those two IMR are manipulated under an IRQ and so need heavier locking. Reported-and-tested-by: Alexey Fisher <bug-track@fisher-privat.net> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Hi Chris, even with the newest commit(7fd1179aa58aeb239bddb669ddfe1326497e7012) on -next branch, running urbanterror demo 2~5 times will still make system hang. We also tested with urbanterror got from pts with pts1.dm_68. The command I used is as following, can you reproduce it? vblank_mode=0 ./urbanterror +timedemo 1 +set demodone 'quit' +set demoloop1 'demo pts1; set nextdemo vstr demodone' +vstr demoloop1 +set r_customwidth 1024 +set r_customheight 768 I'm looking at where 'x11perf -d :0 -copywinwin10 -copywinwin500' causes a hang currently. The demo runs happily in a loop here. But I found the cause of my copywinwin regression: commit 1398261a2e84c537c409259cfe9db3d0abcd9f99 Author: Yuanhan Liu <yuanhan.liu@linux.intel.com> Date: Wed Dec 15 15:42:31 2010 +0800 drm/i915: Add self-refresh support on Sandybridge Add the support of memory self-refresh on Sandybridge, which is now support 3 levels of watermarks and the source of the latency values for watermarks has changed. On Sandybridge, the LP0 WM value is not hardcoded any more. All the latency value is now should be extracted from MCHBAR SSKPD register. And the MCHBAR base address is changed, too. For the WM values, if any calculated watermark values is larger than the maximum value that can be programmed into the associated watermark register, that watermark must be disabled. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> [ickle: remove duplicate compute routines and fixup for checkpatch] Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> This is sufficient to make my system stable again for x11perf: diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d index 365b47c..17ffb55 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -6554,7 +6554,7 @@ static void intel_init_display(struct drm_device *dev) "Disable CxSR\n"); dev_priv->display.update_wm = NULL; } - } else if (IS_GEN6(dev)) { + } else if (IS_GEN6(dev) && 0) { if (SNB_READ_WM0_LATENCY()) { dev_priv->display.update_wm = sandybridge_update } else { Can you try that to see if we have the same bug? And this is the magic that worked: @@ -3654,7 +3662,9 @@ static void sandybridge_update_wm(struct drm_device *dev, &sandybridge_cursor_wm_info, latency, &plane_wm, &cursor_wm)) { I915_WRITE(WM0_PIPEA_ILK, - (plane_wm << WM0_PIPE_PLANE_SHIFT) | cursor_wm); + ((sandybridge_display_wm_info.fifo_size - plane_wm) << WM0_PIPE_PLANE_SHIFT) | + (2 << WM0_PIPE_SPRITE_SHIFT) | + cursor_wm); DRM_DEBUG_KMS("FIFO watermarks For pipe A -" " plane %d, " "cursor: %d\n", plane_wm, cursor_wm); @@ -3666,7 +3676,9 @@ static void sandybridge_update_wm(struct drm_device *dev, &sandybridge_cursor_wm_info, latency, &plane_wm, &cursor_wm)) { I915_WRITE(WM0_PIPEB_ILK, - (plane_wm << WM0_PIPE_PLANE_SHIFT) | cursor_wm); + ((sandybridge_display_wm_info.fifo_size - plane_wm) << WM0_PIPE_PLANE_SHIFT) | + (2 << WM0_PIPE_SPRITE_SHIFT) | + cursor_wm); DRM_DEBUG_KMS("FIFO watermarks For pipe B -" " plane %d, cursor: %d\n", plane_wm, cursor_wm); Hi Chris, with the commit fbf4b94f7d01550799ff7669c65ebb0c053e015c after patch that you gave on 7 Jan, the system hangs too. It appears we are seeing completely different bugs (I only ever saw GPU hangs and not system hangs) and I don't seem to be able to reproduce your bug. You may need to do some investigation as well. Bisection is tricky since we have a few nasty bugs that cause random hangs, but I would first disable the sandybridge_update_wm, fbc and rc6 since they are most likely suspects. With the commit fbf4b94f7d01550799ff7669c65ebb0c053e015c without patch , the system hangs too. Created attachment 41869 [details] [review] a patch Hi, with Kernel: (drm-intel-next)34da1327c3814781925396fa10c42f596588ff76 after patch that you gave on, the system hangs too. ----------------------------------------------------------- diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 5ca0663..8a81c62 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -168,7 +168,7 @@ static const struct intel_device_info intel_sandybridge_d_info = { static const struct intel_device_info intel_sandybridge_m_info = { .gen = 6, .is_mobile = 1, .need_gfx_hws = 1, .has_hotplug = 1, - .has_fbc = 1, + .has_fbc = 0, .has_bsd_ring = 1, .has_blt_ring = 1, }; diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 14ac352..db3be33 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -6561,7 +6561,7 @@ static void intel_init_display(struct drm_device *dev) "Disable CxSR\n"); dev_priv->display.update_wm = NULL; } - } else if (IS_GEN6(dev)) { + } else if (IS_GEN6(dev) && 0) { if (SNB_READ_WM0_LATENCY()) { dev_priv->display.update_wm = sandybridge_update_wm; } else { @@ -6739,7 +6739,7 @@ void intel_modeset_init(struct drm_device *dev) intel_init_emon(dev); } - if (IS_GEN6(dev)) + if (IS_GEN6(dev) && 0) gen6_enable_rps(dev_priv); if (IS_IRONLAKE_M(dev)) { @@ -6790,7 +6790,7 @@ void intel_modeset_cleanup(struct drm_device *dev) if (IS_IRONLAKE_M(dev)) ironlake_disable_drps(dev); - if (IS_GEN6(dev)) + if (IS_GEN6(dev) && 0) gen6_disable_rps(dev); if (IS_IRONLAKE_M(dev)) Oh.sorry.The patch is given by Liu,Yuanhan. I've the tree that I'm currently running on my SNB to drm-intel-staging (basically it is drm-intel-fixes + snb wm patch). Can you please try that kernel? Hi Chris,Kernel: (drm-intel-staging)403c5307152ee5be73800d699902a327976f5d1d that I try, the system hangs too. I am at a complete loss, I can't seem to reproduce this (aside from the erratic wm fixed in -staging) but as the bug is now upstream, this needs to be P1. Is it possible for you to narrow the range of known good/bad kernels? Ah, I've just realised I'm on a desktop (SugarBay) vs your mobile (HuronRiver). Can you confirm that -staging works on SugarBay? (Just for my sanity!) (In reply to comment #21) > Ah, I've just realised I'm on a desktop (SugarBay) vs your mobile (HuronRiver). > Can you confirm that -staging works on SugarBay? (Just for my sanity!) Yes. Chris. This issue only happens on our HuronRiver Rev 08. On our SugarBay Rev 09, there is no such issue with drm-intel-fixes, drm-intel-next. I guess with the -staging branch it also work well. I will have a try with the -staging branch tomorrow. And we are bisecting to only 2 commit remaining now, but I forgot which commit they are for now, I think we can give you an update tomorrow. BTW, when bisect it if it can run the demo 15 times consecutively we take it as pass, else we take it as fail. So some bad commit may be mistaken as good when it can run more than 15 times. Kernel:1ec14ad3132702694f2e1a90b30641cf111183b9 is the first bad commit. And its parent commit is 340479aac697bc73e225c122a9753d4964eeda3f. When bisect it if it can run the demo 15 times consecutively take it as pass, else take it as fail.Run the demo on this bad commit 2-4 times,system hangs. Run the demo on good commit (340479aac697bc73e225c122a9753d4964eeda3f) many times (run 15 times,after a while,go on running 15 times),system is OK . ---------------------------------------------------------------------------- commit 1ec14ad3132702694f2e1a90b30641cf111183b9 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Dec 4 11:30:53 2010 +0000 drm/i915: Implement GPU semaphores for inter-ring synchronisation on SNB The bulk of the change is to convert the growing list of rings into an array so that the relationship between the rings and the semaphore sync registers can be easily computed. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Presumably then diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i index e698343..28bdfe0 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -770,7 +770,7 @@ i915_gem_execbuffer_sync_rings(struct drm_i915_gem_object *o if (from == NULL || to == from) return 0; - if (INTEL_INFO(obj->base.dev)->gen < 6) + if (INTEL_INFO(obj->base.dev)->gen < 6 || IS_MOBILE(obj->base.dev)) return i915_gem_object_wait_rendering(obj, true); idx = intel_ring_sync_index(from, to); is a sufficient patch for us to run stably on both desktop + mobile until somebody has a chance to debug the semaphore implementation. Trying 3-5 times, it still fails on Kernel: (drm-intel-next 6fe4f14044f181e146cdc15485428f95fa541ce8. But with your patch on this commit, it works well(Run 18 times). I've applied the workaround to -fixes. Hopefully I can get a HuronRiver machine in the near future, or someone else may find some time to debug the implementation. commit 1591192d3a17adeebd03be0ce5888b88bddfaf89 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 14 09:46:38 2011 +0000 drm/i915: Disable GPU semaphores on SandyBridge mobile Hopefully, this is a temporary measure whilst the root cause is understood. At the moment, we experience a hard hang whilst looping urbanterror that has been identified as a result of the use of semaphores, but so far only on SNB mobile. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32752 Tested-by: mengmeng.meng@intel.com Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> The problem only exist in drm-intel-next.It worked fine in drm-intel-fixes.But on HuronRiver with the commit(drm-intel-next)22ab70d3262ddb6e69b3c246a34e2967ba5eb1e8,the screen stops only for a while,then go on running ,although it can run 15 times continuously. Sorry,there is a serious mistake above.It's a commit(drm-intel-fixes)22ab70d3262ddb6e69b3c246a34e2967ba5eb1e8, not "drm-intel-next". Anything written to dmesg at the time of the hang? Is the CPU busy? Is the GPU? The screen will stop-go about every 12s only on gnome-desktop with commit(drm-intel-fixes)22ab70d3262ddb6e69b3c246a34e2967ba5eb1e8.The CPU is about 42%-92% in runnig demo,in particular to 92% when screen stops.I'm sorry I don't know whether GPU is busy or not.Could you tell me how to view GPU? Pls see the dmesg in attachment.It's worth noting that it works well without gnome-desktop. Created attachment 42117 [details]
dmesg about stop-go
In intel-gpu-tools, there is a little tool called intel_gpu_top which queries the ring status and INSTDONE. Also can you enable CONFIG_PRINTK_TIME in your kernel builds (for all machines)? It looks like we missing the interrupts. So watching /sys/kernel/debug/dri/0/i915_gem_interrupts to check if we are indeed rising BLT interrupts. Wiht the Kernel: (drm-intel-fixes)4efe070896e1f7373c98a13713e659d1f5dee52a,it works fine. Testing,in (drm-intel-next)fe4402931e43e81a4129eba41d05cf8907603af5 1.System hang (solved) The system is OK(run the demo 25 times). 2.But stop-go still exist It's a compiz problem. In gnome without compiz,it works fine. Verified with Kernel: (drm-intel-next)fe4402931e43e81a4129eba41d05cf8907603af5. And pls see the bug 33394 about screen stuttered when running the demo of 3D games wiht compiz enabled. The bug isn't fixed. We are incurring a 2-3x glyph performance penalty working around the issue. Created attachment 42511 [details]
i915_gem_interrupts information
In gnome with compiz,i915_gem_interrupts information before running urbanterror and after
Testing,on Huronriver in (drm-intel-next)5d6135012e9a7aa8a9128145ed9315eb916feea2 running urbanterror with compiz: 1.dmesg: [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... blt ring idle [waiting on 605867, at 605867], missed IRQ? [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... blt ring idle [waiting on 606790, at 606790], missed IRQ? 2.intel_gpu_top running normally Screen stuttered : render busy: 12-20% render busy:35-45% bitstream busy: 0% bitstream busy:0% blitter busy:2-3 blitter busy:5-12% In our new Huronriver (Intel Corporation Device 0116 (rev 09)),screen still stuttered when running the demo of 3D games(openarena urbanterror) with compiz enabled with (drm-intel-next)commit 9db4a9c7b2a3bd5b4952846bc0c2f58daa80ddd7. Is is hang or just black screen on the new system? (In reply to comment #40) > Is is hang or just black screen on the new system? Not hang or black screen.It's just screen stuttered (Bug 33394). Meng, if you want to check if this problem exists on the new Huron River machine (with rev09), you'd better use a kernel without the patch mentioned in https://bugs.freedesktop.org/show_bug.cgi?id=32752#c26, e.g. you may use the kernel mentioned in comment 23 or 25. If you want to track the stuttered issue, you should track in Bug#33394. Chris, is the workaround patch (mentioned in comment#26) still in upstream? Do you need us testing Huron River rev09 which we got recently? (previously we were testing with Huron River rev08) The workaround is still in upstream. I haven't had long enough to be sure that my rev09 HuronRiver is stable. I've not experienced any of the umpteen OOPS we have filed against SNB, so it might well be... I just hesitate to reintroduce another potential random hang whilst we have random hangs unresolved! The semaphores are enabled again in -next, and we are tracking the stuttering in another bug #33394. So close this pending and hope that the hangs related to GPU semaphores are truly fixed... Verified with commit 467cffba85791cdfce38c124d75bd578f4bb8625,it works fine when running the demo of urbanterror(16 times) a Huron River(0126 rev08). Closing old verified+fixed. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.