Bug 45181

Summary: [SNB] drm:i915_hangcheck_ring_idle introduced with xf86-video-intel commit after Jan 18.
Product: DRI Reporter: nkalkhof
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact:
Severity: critical    
Priority: high CC: ben, chris, daniel, jbarnes
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg log
none
i915_error_state packed
none
enable readback w/a from ivb on gen6, too
none
patches from for-nkalkhof branch none

Description nkalkhof 2012-01-24 08:11:13 UTC
Created attachment 56088 [details]
dmesg log

Hello,

some commit in xf86-video-intel causes my system to stall without gpu recovery producing syslog traffic with repeating message:

Jan 24 16:57:50 gonzo kernel: [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 8181, at 8180], missed IRQ?

The issue seems to be triggered when I play a vaapi accelerated stream. Although this looks like a typical drm issue, this was introduced with a xf86-video-intel commit after Jan 18. No kernel/drm/vaapi updates on my system since then. Git from Jan 18. runs just fine without any hangs.

dmesg and xorg.log attached.

Any Ideas???

Regards
Nic
Comment 1 Chris Wilson 2012-01-24 08:45:19 UTC
Bumping to drm until proven otherwise as this is a very worrisome message that needs wider discussion.
Comment 2 Chris Wilson 2012-01-24 08:46:47 UTC
Is this reproducible? Can we see the /sys/kernel/debug/dri/0/i915_error_state?
Comment 3 Daniel Vetter 2012-01-24 08:51:33 UTC
Can you please try out the my-next branch available at:

http://cgit.freedesktop.org/~danvet/drm/log/?h=my-next

That one contains a workaround that helped greatly on similar issue for ivb.
Comment 4 Daniel Vetter 2012-01-24 08:53:11 UTC
Fyi, the i915_error_state file will only contain interesting stuff if you use the kernel branch I've pointed you at.
Comment 5 nkalkhof 2012-01-24 10:58:45 UTC
@Chris:
To reproduce the issue use the latest mplayer vaapi-patch (git://gitorious.org/vaapi/mplayer.git) together with the latest vaapi/intel-driver git:

libva: VA-API version 0.32.0
libva: va_getDriverName() returns 0
libva: Trying to open /usr/local/lib/dri/i965_drv_video.so
libva: va_openDriver() returns 0
vainfo: VA-API version: 0.32 (libva 1.0.16.pre1)
vainfo: Driver version: Intel i965 driver - 1.0.16.pre1
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264Baseline           : VAEntrypointVLD
      VAProfileH264Baseline           : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD

Play an x264 stream with high bitrate (i.e. bluray) with the following command:
mplayer -va vaapi -vo vaapi -cache 8192 stream.mkv
then hop through the stream a couple of times, that'll do it - at least on my box.

@Daniel: Ok I've used your latest my-next branch to get a detailed i915_error_state (see attachment).

Again: The issue was introduced with commits to xf86-video-intel between Jan 18 and now. I'm not saying that there is a bug in the xf86 driver but maybe now it forces an existing drm issue to occur more often. 

btw: my kernel params are: root=/dev/sda7 idle=mwait pcie_aspm=force i915.powersave=1 i915.i915_enable_rc6=1

Hope this helps somehow.

Regards
Nic
Comment 6 nkalkhof 2012-01-24 10:59:23 UTC
Created attachment 56108 [details]
i915_error_state packed
Comment 7 Daniel Vetter 2012-01-24 11:14:08 UTC
Just to check: Do you still get the missed IRQ warnings with the new kernel?
Comment 8 nkalkhof 2012-01-24 11:17:46 UTC
Yepp:
Jan 24 19:47:22 gonzo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jan 24 19:47:22 gonzo kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Jan 24 19:47:22 gonzo kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6963 at 6955, next 6968)
Comment 9 Daniel Vetter 2012-01-24 11:22:22 UTC
> --- Comment #8 from nkalkhof@web.de 2012-01-24 11:17:46 UTC ---
> Yepp:
> Jan 24 19:47:22 gonzo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck
> timer elapsed... GPU hung
> Jan 24 19:47:22 gonzo kernel: [drm] capturing error event; look for more
> information in /debug/dri/0/i915_error_state
> Jan 24 19:47:22 gonzo kernel: [drm:i915_wait_request] *ERROR* i915_wait_request
> returns -11 (awaiting 6963 at 6955, next 6968)

Actually no, that's a bog-standard gpu hang. While I try to find out what
went wrong, can you try to reproduce with rc6 disabled?
Comment 10 Chris Wilson 2012-01-24 11:30:16 UTC
It's not that bog standard, since the GPU is showing the signs of the semaphore hang (blt waiting upon a completed render seqno, render waiting for outstanding blt).
Comment 11 nkalkhof 2012-01-24 13:33:49 UTC
Ok Sorry for the IRQ confusion.
Disabling RC6 does NOT keep the gpu from hanging - so we're probably not looking at another rc6 issue here.

Regards
Nic
Comment 12 Daniel Vetter 2012-01-25 01:53:25 UTC
Please test the for-nkalkhof branch from the same repository:

http://cgit.freedesktop.org/~danvet/drm/log/?h=for-nkalkhof

It's based on an 3.3 somewhere in the middle of the merge window, if that blows up due to unrelated breakage I can backport the patches to 3.2 stable.
Comment 13 nkalkhof 2012-01-25 03:30:09 UTC
Daniel,

thanks for the patches. Looks pretty good so far. I've experienced a brief hang during some tests playing 4 vaapi streams parallel - normally I'd never do this ;) anyway. the gpu resumes normal operation however.
the kernel seems a little messed up though. sensors and wlan seems broken. Is a backport to 3.2 stable possible? Any chance the patch will be accepted in kernel 3.3rc2 since the merge window has already being closed? 

Regards
Nic
Comment 14 Daniel Vetter 2012-01-25 03:49:56 UTC
Your enthusiasm is great, but this was just a quick test patch to check whether my theory is right. Patches still need to be properly submitted and merged into official trees. We close issues only when the solution is merged in Linus' master tree.

For the patches, can you supply your tested-by? I.e.

Tested-by: <name> mail@address.com
Comment 15 nkalkhof 2012-01-25 04:11:56 UTC
Hi Daniel,

I'm aware that your patch was just sort of a quick hack to find the "culprit". It fixes the issue I've described thats why I closed this ticket. Hope to see it submitted/merged into the official tree soon :).
As for the Tested-by: Nicolas Kalkhof nkalkhof()at()web.de
If you need me for further testing, please let me know.

Cheers
Nic
Comment 16 Daniel Vetter 2012-01-25 05:31:28 UTC
Thanks for testing, currently there's nothing new to be tested.

But _really_ don't frob around with the status of the bug, I rely on that to keep track of things. If you close it before the fix is properly merged, it might get lost somewhere (yep, bad, but bad things sometimes just happen with all the stuff going on here).
Comment 17 nkalkhof 2012-01-25 07:39:09 UTC
Understood :-)

Cheers
Nic
Comment 18 nkalkhof 2012-02-19 13:56:02 UTC
Hi Daniel,

according to kernel 3.3-rc4 logs the fix hit the mainstream kernel tree. While your fix in the for-nkalkhof branch works fine for me, the 3.3-rc4 kernel release freezes my system again with: 
....
Feb 19 22:48:41 gonzo kernel: [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 391686, at 391685], missed IRQ?
....

Any clues?

Regards
Nic
Comment 19 Daniel Vetter 2012-02-20 04:53:41 UTC
We have another report that a snb machine regressed with new 'missed IRQ' issues, see bug #46145. You can also try the patch on that bug report. I'll attach a few other things to play around with here shortly.
Comment 20 Daniel Vetter 2012-02-20 05:00:08 UTC
Created attachment 57315 [details] [review]
enable readback w/a from ivb on gen6, too

Another patch for you to try.
Comment 21 Daniel Vetter 2012-02-20 05:07:34 UTC
Created attachment 57317 [details] [review]
patches from for-nkalkhof branch

Yet another thing to try is to check where this issue got introduced. This patch here contains the 4 patches from the for-nkalkhof branch. It looks like somewhere between v3.2 and now another issue got introduced that makes these no longer effective.

It would be very interesting to find out where exactly. You can do this by bisecting - you just have to manually apply this patch every time when testing a kernel version which doesn't yet have them (and remove them again before doing git bisect good|bad). This way we can find out where exactly these 4 patches stopped working. I suggest you start with 3.2 and check whether these patches work there, then start the git bisect run between v3.2 your current kernel.
Comment 22 nkalkhof 2012-02-20 12:13:48 UTC
Hi Daniel,

your patch (attachment 57315 [details] [review]) has no effect. I took a closer look at the changes in drivers/gpu/drm/i915/i915_irq.c and drivers/gpu/drm/i915/intel_ringbuffer.c between your for-nkalkhof branch and the current 3.3-rc4 release. I'm not sure about your patch from the for-nkalkhof branch made it into 3.3-rc4.

For example 3.3-rc4 shows for a/drivers/gpu/drm/i915/intel_ringbuffer.c:

@@ -795,6 +813,12 @@ gen6_ring_get_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag)
if (!dev->irq_enabled)
return false;
+ /* It looks like we need to prevent the gt from suspending while waiting
+ * for an notifiy irq, otherwise irqs seem to get lost on at least the
+ * blt/bsd rings on ivb. */
+ if (IS_GEN7(dev))
+ gen6_gt_force_wake_get(dev_priv);
+
spin_lock(&ring->irq_lock);
if (ring->irq_refcount++ == 0) {
ring->irq_mask &= ~rflag;
@@ -819,6 +843,9 @@ gen6_ring_put_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag)
ironlake_disable_irq(dev_priv, gflag);
}
spin_unlock(&ring->irq_lock);
+
+ if (IS_GEN7(dev))
+ gen6_gt_force_wake_put(dev_priv);
} 

while your patch in the for-nkalkhkof branch explicitly calls gen6_gt_force_wake_put/get without the if(IS_GEN7(dev)) check:

+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -790,8 +790,7 @@ gen6_ring_get_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag)
/* It looks like we need to prevent the gt from suspending while waiting
* for an notifiy irq, otherwise irqs seem to get lost on at least the
* blt/bsd rings on ivb. */
- if (IS_GEN7(dev))
- gen6_gt_force_wake_get(dev_priv);
+ gen6_gt_force_wake_get(dev_priv);
spin_lock(&ring->irq_lock);
if (ring->irq_refcount++ == 0) {
@@ -818,8 +817,7 @@ gen6_ring_put_irq(struct intel_ring_buffer *ring, u32 gflag, u32 rflag)
}
spin_unlock(&ring->irq_lock);
- if (IS_GEN7(dev))
- gen6_gt_force_wake_put(dev_priv);
+ gen6_gt_force_wake_put(dev_priv);
}

To me, it looks like the patch didn't makt it into the current kernel after all or the patch has been rewritten. Am I mistaken here or what? *headscratch* :(

Regards,
Nic
Comment 23 Daniel Vetter 2012-02-20 13:16:45 UTC
Oops, total confusion on my part. 3.3-rc4 contains my patch, but _only_ enabled for ivb (the IS_GEN7 checks). We figured it's a bit too risky to frob around with that kind of stuff on sandybridge this late the in -rc cycle, so the voodoo patch for snb has been merged to drm-intel-next branch at

http://cgit.freedesktop.org/~danvet/drm-intel

We plan to backport this when it's merged upstream (into 3.4 kernels) and doesn't blow up there. Hence closing as fixed again, please reopen if it -next doesn't work for you.
Comment 24 Florian Mickler 2012-04-05 06:49:44 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 99ffa1629d737295e569267cf5940758139f9ddb
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jan 25 14:04:00 2012 +0100

    drm/i915: enable forcewake voodoo also for gen6
Comment 25 Florian Mickler 2012-04-05 06:58:25 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 53d227f282eb9fa4c7cdbfd691fa372b7ca8c4c3
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jan 25 16:32:49 2012 +0100

    drm/i915: fixup seqno allocation logic for lazy_request

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.