Summary: | [SNB] drm:i915_hangcheck_ring_idle introduced with xf86-video-intel commit after Jan 18. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | nkalkhof | ||||||||||
Component: | DRM/Intel | Assignee: | Daniel Vetter <daniel> | ||||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||||
Severity: | critical | ||||||||||||
Priority: | high | CC: | ben, chris, daniel, jbarnes | ||||||||||
Version: | unspecified | ||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||
OS: | Linux (All) | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | i915 features: | ||||||||||||
Attachments: |
|
Bumping to drm until proven otherwise as this is a very worrisome message that needs wider discussion. Is this reproducible? Can we see the /sys/kernel/debug/dri/0/i915_error_state? Can you please try out the my-next branch available at: http://cgit.freedesktop.org/~danvet/drm/log/?h=my-next That one contains a workaround that helped greatly on similar issue for ivb. Fyi, the i915_error_state file will only contain interesting stuff if you use the kernel branch I've pointed you at. @Chris: To reproduce the issue use the latest mplayer vaapi-patch (git://gitorious.org/vaapi/mplayer.git) together with the latest vaapi/intel-driver git: libva: VA-API version 0.32.0 libva: va_getDriverName() returns 0 libva: Trying to open /usr/local/lib/dri/i965_drv_video.so libva: va_openDriver() returns 0 vainfo: VA-API version: 0.32 (libva 1.0.16.pre1) vainfo: Driver version: Intel i965 driver - 1.0.16.pre1 vainfo: Supported profile and entrypoints VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointVLD VAProfileH264Baseline : VAEntrypointVLD VAProfileH264Baseline : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD Play an x264 stream with high bitrate (i.e. bluray) with the following command: mplayer -va vaapi -vo vaapi -cache 8192 stream.mkv then hop through the stream a couple of times, that'll do it - at least on my box. @Daniel: Ok I've used your latest my-next branch to get a detailed i915_error_state (see attachment). Again: The issue was introduced with commits to xf86-video-intel between Jan 18 and now. I'm not saying that there is a bug in the xf86 driver but maybe now it forces an existing drm issue to occur more often. btw: my kernel params are: root=/dev/sda7 idle=mwait pcie_aspm=force i915.powersave=1 i915.i915_enable_rc6=1 Hope this helps somehow. Regards Nic Created attachment 56108 [details]
i915_error_state packed
Just to check: Do you still get the missed IRQ warnings with the new kernel? Yepp: Jan 24 19:47:22 gonzo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Jan 24 19:47:22 gonzo kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state Jan 24 19:47:22 gonzo kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6963 at 6955, next 6968) > --- Comment #8 from nkalkhof@web.de 2012-01-24 11:17:46 UTC ---
> Yepp:
> Jan 24 19:47:22 gonzo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck
> timer elapsed... GPU hung
> Jan 24 19:47:22 gonzo kernel: [drm] capturing error event; look for more
> information in /debug/dri/0/i915_error_state
> Jan 24 19:47:22 gonzo kernel: [drm:i915_wait_request] *ERROR* i915_wait_request
> returns -11 (awaiting 6963 at 6955, next 6968)
Actually no, that's a bog-standard gpu hang. While I try to find out what
went wrong, can you try to reproduce with rc6 disabled?
It's not that bog standard, since the GPU is showing the signs of the semaphore hang (blt waiting upon a completed render seqno, render waiting for outstanding blt). Ok Sorry for the IRQ confusion. Disabling RC6 does NOT keep the gpu from hanging - so we're probably not looking at another rc6 issue here. Regards Nic Please test the for-nkalkhof branch from the same repository: http://cgit.freedesktop.org/~danvet/drm/log/?h=for-nkalkhof It's based on an 3.3 somewhere in the middle of the merge window, if that blows up due to unrelated breakage I can backport the patches to 3.2 stable. Daniel, thanks for the patches. Looks pretty good so far. I've experienced a brief hang during some tests playing 4 vaapi streams parallel - normally I'd never do this ;) anyway. the gpu resumes normal operation however. the kernel seems a little messed up though. sensors and wlan seems broken. Is a backport to 3.2 stable possible? Any chance the patch will be accepted in kernel 3.3rc2 since the merge window has already being closed? Regards Nic Your enthusiasm is great, but this was just a quick test patch to check whether my theory is right. Patches still need to be properly submitted and merged into official trees. We close issues only when the solution is merged in Linus' master tree. For the patches, can you supply your tested-by? I.e. Tested-by: <name> mail@address.com Hi Daniel, I'm aware that your patch was just sort of a quick hack to find the "culprit". It fixes the issue I've described thats why I closed this ticket. Hope to see it submitted/merged into the official tree soon :). As for the Tested-by: Nicolas Kalkhof nkalkhof()at()web.de If you need me for further testing, please let me know. Cheers Nic Thanks for testing, currently there's nothing new to be tested. But _really_ don't frob around with the status of the bug, I rely on that to keep track of things. If you close it before the fix is properly merged, it might get lost somewhere (yep, bad, but bad things sometimes just happen with all the stuff going on here). Understood :-) Cheers Nic Hi Daniel, according to kernel 3.3-rc4 logs the fix hit the mainstream kernel tree. While your fix in the for-nkalkhof branch works fine for me, the 3.3-rc4 kernel release freezes my system again with: .... Feb 19 22:48:41 gonzo kernel: [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 391686, at 391685], missed IRQ? .... Any clues? Regards Nic We have another report that a snb machine regressed with new 'missed IRQ' issues, see bug #46145. You can also try the patch on that bug report. I'll attach a few other things to play around with here shortly. Created attachment 57315 [details] [review] enable readback w/a from ivb on gen6, too Another patch for you to try. Created attachment 57317 [details] [review] patches from for-nkalkhof branch Yet another thing to try is to check where this issue got introduced. This patch here contains the 4 patches from the for-nkalkhof branch. It looks like somewhere between v3.2 and now another issue got introduced that makes these no longer effective. It would be very interesting to find out where exactly. You can do this by bisecting - you just have to manually apply this patch every time when testing a kernel version which doesn't yet have them (and remove them again before doing git bisect good|bad). This way we can find out where exactly these 4 patches stopped working. I suggest you start with 3.2 and check whether these patches work there, then start the git bisect run between v3.2 your current kernel. Hi Daniel, your patch (attachment 57315 [details] [review]) has no effect. I took a closer look at the changes in drivers/gpu/drm/i915/i915_irq.c and drivers/gpu/drm/i915/intel_ringbuffer.c between your for-nkalkhof branch and the current 3.3-rc4 release. I'm not sure about your patch from the for-nkalkhof branch made it into 3.3-rc4. For example 3.3-rc4 shows for a/drivers/gpu/drm/i915/intel_ringbuffer.c: @@ -795,6 +813,12 @@ gen6_ring_get_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag) if (!dev->irq_enabled) return false; + /* It looks like we need to prevent the gt from suspending while waiting + * for an notifiy irq, otherwise irqs seem to get lost on at least the + * blt/bsd rings on ivb. */ + if (IS_GEN7(dev)) + gen6_gt_force_wake_get(dev_priv); + spin_lock(&ring->irq_lock); if (ring->irq_refcount++ == 0) { ring->irq_mask &= ~rflag; @@ -819,6 +843,9 @@ gen6_ring_put_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag) ironlake_disable_irq(dev_priv, gflag); } spin_unlock(&ring->irq_lock); + + if (IS_GEN7(dev)) + gen6_gt_force_wake_put(dev_priv); } while your patch in the for-nkalkhkof branch explicitly calls gen6_gt_force_wake_put/get without the if(IS_GEN7(dev)) check: +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -790,8 +790,7 @@ gen6_ring_get_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag) /* It looks like we need to prevent the gt from suspending while waiting * for an notifiy irq, otherwise irqs seem to get lost on at least the * blt/bsd rings on ivb. */ - if (IS_GEN7(dev)) - gen6_gt_force_wake_get(dev_priv); + gen6_gt_force_wake_get(dev_priv); spin_lock(&ring->irq_lock); if (ring->irq_refcount++ == 0) { @@ -818,8 +817,7 @@ gen6_ring_put_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag) } spin_unlock(&ring->irq_lock); - if (IS_GEN7(dev)) - gen6_gt_force_wake_put(dev_priv); + gen6_gt_force_wake_put(dev_priv); } To me, it looks like the patch didn't makt it into the current kernel after all or the patch has been rewritten. Am I mistaken here or what? *headscratch* :( Regards, Nic Oops, total confusion on my part. 3.3-rc4 contains my patch, but _only_ enabled for ivb (the IS_GEN7 checks). We figured it's a bit too risky to frob around with that kind of stuff on sandybridge this late the in -rc cycle, so the voodoo patch for snb has been merged to drm-intel-next branch at http://cgit.freedesktop.org/~danvet/drm-intel We plan to backport this when it's merged upstream (into 3.4 kernels) and doesn't blow up there. Hence closing as fixed again, please reopen if it -next doesn't work for you. A patch referencing this bug report has been merged in Linux v3.4-rc1: commit 99ffa1629d737295e569267cf5940758139f9ddb Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Jan 25 14:04:00 2012 +0100 drm/i915: enable forcewake voodoo also for gen6 A patch referencing this bug report has been merged in Linux v3.4-rc1: commit 53d227f282eb9fa4c7cdbfd691fa372b7ca8c4c3 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Jan 25 16:32:49 2012 +0100 drm/i915: fixup seqno allocation logic for lazy_request |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 56088 [details] dmesg log Hello, some commit in xf86-video-intel causes my system to stall without gpu recovery producing syslog traffic with repeating message: Jan 24 16:57:50 gonzo kernel: [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 8181, at 8180], missed IRQ? The issue seems to be triggered when I play a vaapi accelerated stream. Although this looks like a typical drm issue, this was introduced with a xf86-video-intel commit after Jan 18. No kernel/drm/vaapi updates on my system since then. Git from Jan 18. runs just fine without any hangs. dmesg and xorg.log attached. Any Ideas??? Regards Nic