Bug 87244 - [NV94] X hangs, logs show kernel: nouveau E[ PFIFO][0000:01:00.0] still angry after 101 spins, halt followed by an X trace
Summary: [NV94] X hangs, logs show kernel: nouveau E[ PFIFO][0000:01:00.0] still ang...
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) All
: medium critical
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
: 88822 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-12-11 19:58 UTC by Adam Williamson
Modified: 2015-03-25 16:21 UTC (History)
6 users (show)

See Also:
i915 platform:
i915 features:


Attachments
journal extract from the crash (with all the X tracebacks) (34.49 KB, text/plain)
2014-12-11 19:58 UTC, Adam Williamson
no flags Details
Revert-drm-nouveau-fifo-g84-ack-non-stall-interrupt (2.25 KB, patch)
2014-12-26 12:52 UTC, Zlatko Calusic
no flags Details | Splinter Review
fifo-nv04-remove-the-loop-from-the-interrupt-handler (4.30 KB, text/plain)
2015-01-27 08:09 UTC, Zlatko Calusic
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Williamson 2014-12-11 19:58:05 UTC
I just upgraded my desktop to Fedora Rawhide, and since then X has twice crashed with the same symptoms. Oddly, I was using kernel 3.18 from kernel-rawhide-nodebug before upgrading from F21 to Rawhide, and 21 and Rawhide seem to have similar versions of the nouveau driver and Xorg components, so I'm not sure what's changed - libdrm or mesa, perhaps?

Anyway, the non-debug log I have so far is:

Dec 11 11:42:05 adam.happyassassin.net kernel: nouveau E[   PFIFO][0000:01:00.0] still angry after 101 spins, halt
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE)
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) Backtrace:
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 0: /usr/libexec/Xorg.bin (mieqEnqueue+0x24b) [0x5795ab]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 1: /usr/libexec/Xorg.bin (QueuePointerEvents+0x52) [0x450af2]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 2: /usr/lib64/xorg/modules/input/evdev_drv.so (_init+0x2eff) [0x7fa8c403295f]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 3: /usr/lib64/xorg/modules/input/evdev_drv.so (_init+0x3645) [0x7fa8c4033c25]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 4: /usr/libexec/Xorg.bin (DPMSSupported+0xe8) [0x4774c8]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 5: /usr/libexec/Xorg.bin (xf86SerialModemClearBits+0x277) [0x4a1f17]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 6: /lib64/libc.so.6 (__restore_rt+0x0) [0x7fa8cf438e7f]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 7: /lib64/libc.so.6 (ioctl+0x7) [0x7fa8cf4fde07]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 8: /lib64/libdrm.so.2 (drmIoctl+0x28) [0x7fa8d07e96c8]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 9: /lib64/libdrm.so.2 (drmCommandWrite+0x1b) [0x7fa8d07ebefb]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 10: /lib64/libdrm_nouveau.so.2 (nouveau_bo_wait+0x99) [0x7fa8c9f6f779]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 11: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (_init+0x2a3b) [0x7fa8ca17e80b]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 12: /usr/lib64/xorg/modules/libexa.so (exaMoveOutPixmap+0x123b) [0x7fa8c993289b]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 13: /usr/lib64/xorg/modules/libexa.so (exaMoveOutPixmap+0x39df) [0x7fa8c993780f]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 14: /usr/lib64/xorg/modules/libexa.so (exaEnableDisableFBAccess+0x493b) [0x7fa8c9941ffb]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 15: /usr/lib64/xorg/modules/libexa.so (exaEnableDisableFBAccess+0x1690) [0x7fa8c993bde0]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 16: /usr/libexec/Xorg.bin (DamageRegionAppend+0x541) [0x51ef81]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 17: /usr/libexec/Xorg.bin (AddTraps+0x4154) [0x518824]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 18: /usr/libexec/Xorg.bin (SendErrorToClient+0x2f7) [0x4391b7]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 19: /usr/libexec/Xorg.bin (remove_fs_handlers+0x416) [0x43d316]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 20: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7fa8cf4240e0]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 21: /usr/libexec/Xorg.bin (_start+0x29) [0x4276f9]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) 22: ? (?+0x29) [0x29]
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE)
Dec 11 11:44:13 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.

then there's a bunch of similar X traces, I'll attach the whole thing. I'll try and get drm.debug logs and attach those also.

There is one EE line during initial X start:

Dec 11 10:52:16 adam.happyassassin.net gdm-Xorg-:0[1540]: (EE) NOUVEAU(0): [COPY] failed to allocate class.

but that line seems to be present in older boots where I didn't encounter this problem, too.

I'm running GNOME Shell and have dual monitors attached to DVI, in portrait orientation.

xorg-x11-drv-nouveau-1.0.11-1.fc22.x86_64
kernel-3.18.0-1.fc22.x86_64
xorg-x11-server-common-1.16.2.901-1.fc22.x86_64
mesa-dri-drivers-10.5.0-0.devel.3.29c7cf2.fc22.x86_64
libdrm-2.4.58-3.fc22.x86_64

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation G94 [GeForce 9600 GT] [10de:0622] (rev a1)
Comment 1 Adam Williamson 2014-12-11 19:58:59 UTC
Created attachment 110752 [details]
journal extract from the crash (with all the X tracebacks)
Comment 2 Adam Williamson 2014-12-23 16:55:12 UTC
Just saw this again, but between occurrences I had a boot running for like three days with drm.debug=15 and it didn't happen, eventually I had to give up because I needed to see something else in my logs :( I'll try another boot with drm.debug=14 and hope to catch it...
Comment 3 Zlatko Calusic 2014-12-26 12:52:15 UTC
Created attachment 111369 [details] [review]
Revert-drm-nouveau-fifo-g84-ack-non-stall-interrupt

Also G94 here, and the same bug. It works for a while then GPU locks up with "still angry after 101 spins". But, I bisected and eventually 19a10828814aa commit (drm/nouveau/fifo/g84-: ack non-stall interrupt before handling it) came out as culprit. It's the last nouveau commit before 3.18 was releases.

If you can compile your own kernel, I'm attaching the patch that reverses that commit. It fixes the issue for me completely, no more lockups.
Comment 4 Adam Williamson 2014-12-26 16:53:16 UTC
awesome - I love it when someone else does the hard work. I'll build with that revert and confirm if I can, thanks! (it doesn't happen very often to me, so a bit hard to confirm a negative, but I'll do my best).
Comment 5 Zlatko Calusic 2014-12-26 19:23:10 UTC
Ha ha ha, you made me laugh with "hard to confirm a negative" comment, been there - done that. :) This time I got a bit lucky, because I ran 3.18-rc-something the whole time (fixing some other kernel bugs I reported), and when 3.18 got released and suddenly become unstable there were only about a dozen nouveau commits to bisect. It was quite easy to find the faulty one.

Also, I was getting lockups pretty fast, about 6-12 hours after reboot, before I reverted that commit. And now my system has been rock solid for 2 weeks, so I'm pretty confident the revert is the right fix.
Comment 6 Adam Williamson 2015-01-07 01:44:37 UTC
So I didn't hit this for like two weeks (12-26 till today) but then I hit it twice today. I'm now running a kernel with the reversion applied, so I guess I'll have to see if that survives three weeks? :)
Comment 7 Maxim Britov 2015-01-20 08:01:55 UTC
Nouveau still angry in 3.19-rc5
Filed issue on kernel.org: https://bugzilla.kernel.org/show_bug.cgi?id=91581
Comment 8 Adam Williamson 2015-01-20 17:04:12 UTC
I've been up on the patched kernel for 2 weeks now without seeing the issue.
Comment 9 Ben Skeggs 2015-01-27 05:54:34 UTC
It'd be worth giving this patch a try:

http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=c869d99187a356b886bdecc757caa0038d142844
Comment 10 Zlatko Calusic 2015-01-27 08:09:23 UTC
Created attachment 112875 [details]
fifo-nv04-remove-the-loop-from-the-interrupt-handler

Would not apply cleanly to 3.19.0-rc6+, so I modified it and attached the patch that I have currently applied and testing. I'd give it at least 2 days before we can be reasonably sure it fixes the issue.
Comment 11 Zlatko Calusic 2015-01-29 20:16:03 UTC
Good news! 2 days and 13 hours later, no lockups, all fine. I'm pretty confident that this patch resolves the issue. It would be great if it could be applied before 3.19 stable comes out.
Comment 12 Adam Williamson 2015-02-11 21:21:28 UTC
So Ben told me on IRC he'd committed a change that should address this, but I guess it didn't make it in time for 3.19 because I just hit it with 3.19 stable :( Ben, any chance it could be backported for 3.19 updates? It's a pretty icky bug.
Comment 13 Aleksander Morgado 2015-02-18 13:30:55 UTC
(In reply to Adam Williamson from comment #12)
> So Ben told me on IRC he'd committed a change that should address this, but
> I guess it didn't make it in time for 3.19 because I just hit it with 3.19
> stable :( Ben, any chance it could be backported for 3.19 updates? It's a
> pretty icky bug.

If it's not yet pulled by Linus, what's the kernel git tree where that was submitted?
Comment 14 Aleksander Morgado 2015-02-18 15:52:23 UTC
Just for reference, I cherry-picked that fix for my 3.18.6 and I no longer get the X stuck. Hoping it hits the stables soon...
Comment 15 K1 2015-02-19 22:57:22 UTC
I have the same problem and it is usually triggered within minutes of using PyCharm IDE. When I work with Eclipse and other things it seems fine but as soon as I have PyCharm running (even in the background while I am not working with it) this happens. I am using 3.18.7-200.fc21.x86_64.
Comment 16 Arif Saleem 2015-02-20 10:39:01 UTC
Hi All,
We're having this issue on a whole bunch of F21 x86_64 machines. I took the 3.18.7-200 F21 source rpm, and applied Zlatko's patch from Comment 10, and rebuilt the kernel. The important rpms are here :
http://arif.easyss.net/stuff/kernel-3.18.7-nouveau/

I haven't actually rebooted any of the machines onto this kernel yet, but this might help some people.
Comment 17 Zlatko Calusic 2015-02-20 12:32:57 UTC
JFTR, I've been running with the same patch applied for 3 weeks now and the system is rock solid. If you use 3.18 or 3.19 and experience lockups, the patch will fix it. Hopefully it finds its way to 3.20^H^H4.0 soon. :P
Comment 18 Adam Williamson 2015-02-20 17:19:56 UTC
The patch is in Ben's personal tree already:

http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=c869d99187a356b886bdecc757caa0038d142844

but hasn't yet made its way from there into the nouveau tree from which it'd get merged to linus':

http://cgit.freedesktop.org/nouveau/linux-2.6/log/?h=linux-3.20

I have, however, prevailed on the Fedora kernel maintainers to put it in at least the F22/Rawhide kernels; the builds of both that are running as I type this will include it. I'm hoping we can get it backported from there to 21 as well.
Comment 19 Pierre Moreau 2015-03-11 19:59:56 UTC
*** Bug 88822 has been marked as a duplicate of this bug. ***
Comment 20 aebenjam 2015-03-18 19:24:57 UTC
This sounds horribly familiar.  Seeing the same behaviour after I hit the 3.18.x kernels with Fedora 20.  I'm using a "NVIDIA Corporation G86 [Quadro NVS 290] (rev a1)".  Not sure if the additional card information helps or if the bug and fix are definitely the one posted.  Here's hoping it's brought into the mainline kernel soon... (but I will likely try recompiling my own kernel with the offered patch sometime soon.)
Comment 21 Adam Williamson 2015-03-18 19:29:00 UTC
The fix is already in mainline, just hasn't been backported.
Comment 22 Sergio Pascual 2015-03-25 16:21:01 UTC
(In reply to Adam Williamson from comment #21)
> The fix is already in mainline, just hasn't been backported.

Is the fix going to appear in F21 any time soon?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.