Bug 18609

Summary: Recent GEM kernels eventually start dropping interrupts on intel
Product: DRI Reporter: Ben Gamari <bgamari>
Component: GeneralAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: ascii79, freedesktop-bugzilla, keithp, mailinglists.fredi, pierre, pva, tom, vbraun, wwoods, zdenek.kabelac
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kern.log part with error and /proc/config.gz
none
Keith's patch disabling MSI workaround
none
Keith's patch disabling MSI workaround
none
Patch on top of Keiths fixing the irq-storm for me
none
Collection of IRQ Disabled errors from kern.log
none
Small hack making the stuck irq go away none

Description Ben Gamari 2008-11-18 14:37:58 UTC
After a few minutes in Xorg on Eric's for-airlied branch, the kernel produces the following,


irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.27.5-100.fc10.x86_64 #1

Call Trace:
 <IRQ>  [<ffffffff810832a7>] __report_bad_irq+0x38/0x7c
 [<ffffffff810834f3>] note_interrupt+0x208/0x26d
 [<ffffffff81083c20>] handle_fasteoi_irq+0xbb/0xeb
 [<ffffffff8101309e>] do_IRQ+0xf7/0x169
 [<ffffffff81010933>] ret_from_intr+0x0/0x2e
 <EOI>  [<ffffffff81332650>] ? _spin_unlock_irqrestore+0x33/0x3e
 [<ffffffff8105d675>] ? tick_broadcast_oneshot_control+0xf4/0xfd
 [<ffffffff8105cf53>] ? tick_notify+0x22a/0x37b
 [<ffffffff813354e6>] ? notifier_call_chain+0x33/0x5b
 [<ffffffff81058bb0>] ? raw_notifier_call_chain+0xf/0x11
 [<ffffffff8105c95d>] ? clockevents_notify+0x2b/0x63
 [<ffffffff811bc4e2>] ? acpi_state_timer_broadcast+0x41/0x43
 [<ffffffff811bcd1c>] ? acpi_idle_enter_simple+0x197/0x1b4
 [<ffffffff81286103>] ? cpuidle_idle_call+0x95/0xc9
 [<ffffffff8100f279>] ? cpu_idle+0xb2/0x10b
 [<ffffffff8131f3dd>] ? rest_init+0x61/0x63

handlers:
[<ffffffffa03d73d3>] (i915_driver_irq_handler+0x0/0x19d [i915])
Disabling IRQ #16

After this Xorg performance degrades significantly, in my case requiring mouse movement to force redraws. In other cases, this results in far more severe effects (e.g. file system corruption)
Comment 1 Mateusz Kaduk 2008-11-18 14:51:51 UTC
Created attachment 20430 [details]
kern.log part with error and /proc/config.gz

I also experience huge performance drop after 15-20min of work and fsck every startup which might be due to file system corruption.

I get kernel by
git clone git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel linux-2.6-gem-patched
git-checkout --track -b drm-intel-next origin/drm-intel-next
git-checkout -b for-airlied origin/for-airlied

git-log --date-order
shows sha1a
commit 81d5e9671c887dd53f6bbbd539efb34022c45e4d

I have GM965 Lenovo Thinkpad T61

This patch fixes problem and works from 5 days.

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 0d215e3..ec4509f 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -849,7 +849,8 @@ int i915_driver_load(struct drm_device *dev, unsigned long f
         * According to chipset errata, on the 965GM, MSI interrupts may
         * be lost or delayed
         */
-       if (!IS_I945G(dev) && !IS_I945GM(dev) && !IS_I965GM(dev))
+          //if (!IS_I945G(dev) && !IS_I945GM(dev) && !IS_I965GM(dev))
+          if (!IS_I945G(dev) && !IS_I945GM(dev))
                pci_enable_msi(dev->pdev);
 
        intel_opregion_init(dev);
Comment 2 Ben Gamari 2008-11-18 14:55:16 UTC
Created attachment 20431 [details] [review]
Keith's patch disabling MSI workaround

This patch seems to have worked in my case however there have been several reports of continued issues by people in #intel-gfx. (e.g. kaduk and Mononoke, et al)
Comment 3 Ben Gamari 2008-11-18 15:12:30 UTC
Comment on attachment 20431 [details] [review]
Keith's patch disabling MSI workaround

I'm an idiot. This patch has nothing to do with this bug. One moment
Comment 4 Ben Gamari 2008-11-18 15:32:45 UTC
Created attachment 20432 [details] [review]
Keith's patch disabling MSI workaround

This is the patch I intended on posting earlier.
Comment 5 Will Woods 2008-11-18 17:31:01 UTC
That patch doesn't help on my system, unfortunately.

Background: this is a Cantiga (8086:2a42 (rev 03)) chipset machine, running Fedora kernel-2.6.27.5-119.fc10.x86_64, which includes the patch you listed:
http://cvs.fedoraproject.org/viewvc/rpms/kernel/F-10/drm-next-intel-irq-test.patch

Shortly after starting X, something causes it to freak out (xrandr mode switch or VT switch seem to trigger it pretty good) and start generating ridiculous number of interrupts. The kernel then has to disable the card's IRQ.

Watching /proc/dri/0/i915_gem_interrupt etc. show that it's generating something like 60,000 interrupts per second:
[wwoods@test1102 ~]$ while true; do grep 'Interrupts' /proc/dri/0/i915_gem_interrupt ; sleep 1; done
Interrupts received: 14635063
Interrupts received: 14697264
Interrupts received: 14754781
(etc.)

Kernel trace looks like this:

irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.27.5-118.fc10.x86_64 #1

Call Trace:
<IRQ>  [<ffffffff81083207>] __report_bad_irq+0x38/0x7c
[<ffffffff81083453>] note_interrupt+0x208/0x26d
[<ffffffff81083b80>] handle_fasteoi_irq+0xbb/0xeb
[<ffffffff8101309e>] do_IRQ+0xf7/0x169
[<ffffffff81010933>] ret_from_intr+0x0/0x2e
<EOI>  [<ffffffff8105e59d>] ? tick_nohz_stop_sched_tick+0x2ec/0x301
[<ffffffff8100f1f1>] ? cpu_idle+0x2a/0x10b
[<ffffffff8131ed7d>] ? rest_init+0x61/0x63

handlers:
[<ffffffff8123ae36>] (usb_hcd_irq+0x0/0xb3)
[<ffffffff8123ae36>] (usb_hcd_irq+0x0/0xb3)
[<ffffffffa0090334>] (e1000_intr+0x0/0x13b [e1000e])
Disabling IRQ #16
irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.27.5-118.fc10.x86_64 #1

Call Trace:
<IRQ>  [<ffffffff81083207>] __report_bad_irq+0x38/0x7c
[<ffffffff81083453>] note_interrupt+0x208/0x26d
[<ffffffff81083b80>] handle_fasteoi_irq+0xbb/0xeb
[<ffffffff8101309e>] do_IRQ+0xf7/0x169
[<ffffffff81010933>] ret_from_intr+0x0/0x2e
<EOI>  [<ffffffff8103e2ae>] ? finish_task_switch+0x39/0xc9
[<ffffffff8103e2a6>] ? finish_task_switch+0x31/0xc9
[<ffffffff813304a0>] ? thread_return+0x3d/0xd9
[<ffffffff8105e814>] ? tick_nohz_restart_sched_tick+0x171/0x179
[<ffffffff8100f2cd>] ? cpu_idle+0x106/0x10b
[<ffffffff8131ed7d>] ? rest_init+0x61/0x63

handlers:
[<ffffffff8123ae36>] (usb_hcd_irq+0x0/0xb3)
[<ffffffff8123ae36>] (usb_hcd_irq+0x0/0xb3)
[<ffffffffa0090334>] (e1000_intr+0x0/0x13b [e1000e])
[<ffffffffa02ac455>] (i915_driver_irq_handler+0x0/0x1a7 [i915])
Disabling IRQ #16

So, yeah. Not fixed.
Comment 6 Will Woods 2008-11-18 17:38:52 UTC
See also these downstream bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=471162 - i945GME, seems fixed now

https://bugzilla.redhat.com/show_bug.cgi?id=471937 - Cantiga / "Integrated Graphics Controller" devices
Comment 7 Eric Anholt 2008-11-25 17:26:11 UTC
Better patches have landed in for-airlied now and should be getting pulled soon.  There was a race condition, but the effects of failure were quite different from what the spec says should have happened in that case.  Also, never managed to reproduce on GM965 here.

commit f560d6b932e4ac067188d071195875e2cb143bfa
Author: Keith Packard <keithp@keithp.com>
Date:   Wed Nov 19 14:03:05 2008 -0800

    drm/i915: Always read pipestat in irq_handler
    
    Because we write pipestat before iir, it's possible that a pipestat
    interrupt will occur between the pipestat write and the iir write. This
    leaves pipestat with an interrupt status not visible in iir. This may cause
    an interrupt flood as we never clear the pipestat event.
Comment 8 Pierre Willenbrock 2008-11-26 14:25:43 UTC
I can still reproduce this on i965 using kde4 opengl compositing window manager. It runs for some time at 7-180 irqs/s only to suddenly jump to >60000 irqs/s. 

I am using 2d7748e0c968da5e8ed3dab61d40b9909e3d5c7e of drm-intel(which is currently what for-airlied points to).

On the plus-side, the interrupt-handler feels responsible for the interrupt, so the kernel does not disable it, until under heavy load(i.E. loading a complex website with konqueror leads to the kernel disabling the interrupt).
Comment 9 Pierre Willenbrock 2008-11-28 14:34:46 UTC
Created attachment 20677 [details] [review]
Patch on top of Keiths fixing the irq-storm for me

This is my try at fixing the irq storm on my system. 

The patch moves the spin_unlock to a later position. At all Positions before the one in the patch, i eventually got an irq storm. This may just reduce the probability of the problem, as i don't really understand what triggers it in the first place. My only guess is that both cores of my system simultaneously enter the irq handler and then bad things happen.

I totally missed that this bug was closed at the time of my last posting. reopening it now.
Comment 10 Pierre Willenbrock 2008-11-28 14:35:48 UTC
As per comment #9: I totally missed that this bug was closed at the time of my last posting. Reopening it now.
Comment 11 Pierre Willenbrock 2008-11-29 05:12:56 UTC
Comment on attachment 20677 [details] [review]
Patch on top of Keiths fixing the irq-storm for me

Moving the spinlock around does not help. I could trigger the irq-storm while all of i915_driver_irq_handler was protected.
Comment 12 Mateusz Kaduk 2008-12-02 11:09:00 UTC
Created attachment 20747 [details]
Collection of IRQ Disabled errors from kern.log

For 2.6.28-rc6-for-airlied
git-log --date-order
commit 728ced8c47f99a2287cdd0d3e77f5ae1a3d410e6

I experience

irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.28-rc6-for-airlied #1
Call Trace:
 <IRQ>  [<ffffffff80276b4e>] __report_bad_irq+0x1e/0x90
 [<ffffffff80276d58>] note_interrupt+0x198/0x1e0
 [<ffffffff80277615>] handle_fasteoi_irq+0xd5/0x100
 [<ffffffff8020e8f4>] do_IRQ+0xc4/0x110
 [<ffffffff8020bf26>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80540090>] menu_reflect+0x0/0x90
 [<ffffffff804178ee>] acpi_idle_enter_simple+0x1c7/0x237
 [<ffffffff804178e4>] acpi_idle_enter_simple+0x1bd/0x237
 [<ffffffff8053f31a>] cpuidle_idle_call+0xba/0x120
 [<ffffffff8020aafe>] cpu_idle+0x5e/0xc0
handlers:
[<ffffffff804e8cb0>] (usb_hcd_irq+0x0/0x70)
Disabling IRQ #16

Just after switching to VT. Then switching back to X is possible but the system is really slow. Its impossible to work, restart is needed.

Previously Keith's patch fixed IRQ Disabled problem.
https://bugs.freedesktop.org/attachment.cgi?id=20432
Now patch does not apply and buggy behaviour is back.
Comment 13 Pierre Willenbrock 2008-12-02 13:32:21 UTC
Created attachment 20750 [details] [review]
Small hack making the stuck irq go away

It seems like i found a band aid for the stuck IRQ problem. If i put 100 into the for-loop, after some time the IRQ does get stuck, but with 1000 loops, i have not seen that again.

I tested this loop at these other positions, and it always worked:
* before the spin_unlock
* after the posting read
Making i915_enable_pipestat do nothing had a similar effect.

I could reproduce the stuck IRQ using either APIC or XT-PIC(which annoyingly just hangs in an IRQ-loop forever), using a single CPU, and without any GL activity.

My guess is this is a hardware timing problem, where the IRQ-state-machine gets into a bad state because two unknown events happen in too rapid succession.
Comment 14 Frederik 2008-12-07 11:31:23 UTC
This happens in the latest git vanilla kernels too (2.6.28-rc7-00167-g24920a7)

Dec  6 17:52:11 kotys irq 16: nobody cared (try booting with the "irqpoll" option)
Dec  6 17:52:11 kotys Pid: 6239, comm: kio_pop3 Not tainted 2.6.28-rc7-00167-g24920a7 #21
Dec  6 17:52:11 kotys Call Trace:
Dec  6 17:52:11 kotys <IRQ>  [<ffffffff80262a99>] __report_bad_irq+0x3d/0x8c
Dec  6 17:52:11 kotys [<ffffffff80262bfb>] note_interrupt+0x113/0x178
Dec  6 17:52:11 kotys [<ffffffff802632f9>] handle_fasteoi_irq+0xa6/0xca
Dec  6 17:52:11 kotys [<ffffffff8020d8ab>] do_IRQ+0x7b/0xec
Dec  6 17:52:11 kotys [<ffffffff8020b9f6>] ret_from_intr+0x0/0xa
Dec  6 17:52:11 kotys <EOI> <3>handlers:
Dec  6 17:52:11 kotys [<ffffffffa0365497>] (i915_driver_irq_handler+0x0/0x1f5 [i915])
Dec  6 17:52:11 kotys Disabling IRQ #16

Booting with the irqpoll option seems to make everything slower. For now i'm using 2.6.28-rc3 which does not have this issue. Unfortunately isn't easy to bissect as the problem happens after some hours of usage ... probably related to system load.
Comment 15 Eric Anholt 2008-12-10 09:55:56 UTC
commit 8ffd652bec134f6961c20329260fd643fb478e1a
Author: Keith Packard <keithp@keithp.com>
Date:   Mon Dec 8 11:12:28 2008 -0800

    drm/i915: Disable the GM965 MSI errata workaround.
    
    Since applying the fix suggested by the errata (disabling MSI), we've had
    issues with interrupts being stuck on despite IIR being 0 on GM965 hardware.
    Most reporters of the issue have confirmed that turning MSI back on fixes
    things, and given the difficulties experienced in getting reliable MSI worki
    on Linux, it's believable that the errata was about software issues and not
    actual hardware issues.
Comment 16 Peter 2009-01-03 11:22:23 UTC
Guys, do you have this bug fixed? Seems that all patches/fixes mentioned here entered 2.6.28 but still I have this problem on my thinkpad X61:

irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 4587, comm: cc1plus Not tainted 2.6.28-gentoo-noswap #1
Call Trace:
 <IRQ>  [<ffffffff802559b8>] __report_bad_irq+0x30/0x7d
 [<ffffffff80255b0a>] note_interrupt+0x105/0x16b
 [<ffffffff8025618e>] handle_fasteoi_irq+0xa6/0xcf
 [<ffffffff8020dcb0>] do_IRQ+0x75/0xe5
 [<ffffffff8020b866>] ret_from_intr+0x0/0xa
 <EOI> <3>handlers:
[<ffffffff80440d68>] (ahci_interrupt+0x0/0x45c)
[<ffffffff803e3672>] (i915_driver_irq_handler+0x0/0x1e2)
Disabling IRQ #16

 $ uname -a
Linux tablet 2.6.28-gentoo-noswap #1 SMP PREEMPT Wed Dec 31 23:23:19 MSK 2008 x86_64 Intel(R) Core(TM)2 Duo CPU L7500 @ 1.60GHz GenuineIntel GNU/Linu

This is not vanila sources, but none of our patches touch this part of the tree. If don't believe you can review our patches here: http://dev.gentoo.org/~dsd/genpatches/patches-2.6.28-1.htm And if required I can retest with vanilla.

And my tablet PC has the following video-card:

00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c)

Reopening again since it's still broken here.
Comment 17 Volker Braun 2009-01-08 03:01:51 UTC
I see this problem with a 2.6.28 kernel, too (2.6.28-3.fc11.x86_64 on Fedora 10). After a few days (about one day if I'm using compiz), interrupts start at a rate of about 60kHz:

[root@t61 0]# while true; do grep 'Interrupts' /proc/dri/0/i915_gem_interrupt ; sleep 1; done
Interrupts received: 193091543
Interrupts received: 193159286
Interrupts received: 193229997
Interrupts received: 193295878
[...]

There is a noticeable degradation in performance; mouse pointer is jumpy. Unlike the first reporter, I do not have to force redraws, though.

Eventually I get the kernel trace with "Disabling IRQ #16". I'm seeing this on my thinkpad t61 and x61, both having the intel X3100 graphics. On my x61, irq 16 is shared with ahci and disabling it is, well, not good.

Comment 18 Zdenek Kabelac 2009-03-26 13:43:12 UTC
Maybe it could be unrelate, but as I'm using T61 laptop myself and I've been having some interrupt issue in the past - maybe you should check if you have some recent BIOS installed in your machine  (http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-67989) - it's been fixing some irq routing issues - thought USB related.

 31:     402674     420176   PCI-MSI-edge      i915@pci:0000:00:02.0
Comment 19 Volker Braun 2009-03-26 14:16:06 UTC
#18: Old news. The T61 bluetooth irq storm was fun while it lasted (NOT), but has been fixed since bios 2.09-1.08 (>1 year ago). 

The problem here is another IRQ. This bug is fixed in kernels 2.6.28 if you have MSI enabled. As far as I know there is no reliable way to use GM965 without MSI.

By default, Fedora currently uses 2.6.27 and turns MSI off. However, you can upgrade to a testing kernel and force enable MSI with the "pci=msi" kernel option. For more Fedora discussion, see https://bugzilla.redhat.com/show_bug.cgi?id=474624



Comment 20 Zdenek Kabelac 2009-03-27 03:23:32 UTC
Ahh - ok - thanks for clarification - as I'm using MSI as far as I remember and I'm not having IRQ issues problems I've though that maybe there could be some other problem as from this bugzilla it's not 100% clear to me whether users are complaining it does  work with or without MSI :) - so let's state once again - user should check whether his /proc/interrupts  file contain similar line:

   31:       9865       9862   PCI-MSI-edge      i915@pci:0000:00:02.0

If there is no PCI-MSI-edge - he is going to have problems.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.