Created attachment 90035 [details]
Frequent GPU lockups, happens anywhere from 10 minutes to 2 hours after booting. No common cause that I've noticed when it happens. Occasionally I can get out to another tty but often completely unresponsive. Some system information:
Linux jordans-pc 3.11.9-200.fc19.x86_64 #1 SMP Wed Nov 20 21:22:24 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Attached log of messages from Nouveau from time of boot until lockup.
Created attachment 90036 [details]
Created attachment 90037 [details]
Sometimes, shortly before the lockup, graphics will become highly corrupted.
I have this problem on a Geforce GTX 660 (NVE6, GK106), too. This problem only occures when desktop effects are turned on in KDE SC. Therefore I think this problem is 3d related or maybe drm. Steam (32bit) is sometimes at fault, too.
Sometimes befor the grafics are unresponsive, some parts of the grafical ui only show flicering white an black rectangles. And sometimes the computer (or perhaps only grafics) become very slow.
After a lockup, I can sometimes use my keyboard to issue commands. Sometimes my computer is complete unresponsive.
I use archlinux (x86_64) with the newest Nouveau releases and this bug existes as long as I use Nouveau. (3-4 month)
My personal guess, based on roughly 0 real information, is that the graph firmware is "wrong". Could one of you try to extract the graph firmware from the blob and use it with nouveau to see if it improves things?
(You don't need the video firmware bits.)
Don't forget to add nouveau.config=NvGrUseFw=1 in order for nouveau to actually load the external firmware.
Created attachment 90048 [details]
nvidia installer log
I had to build a custom kernel to enable mmiotrace, and I'm having trouble building the nvidia module on that kernel so I can run the trace. I'm using the installer from the nvidia website, version 331.17.
331.20 should have support for all recent kernels.
Okay, that driver worked and I got a trace, now I'm trying to boot my regular kernel with nouveau but fails to start x. After splash screen it claims there is an error it can't recover from. Upon killing X I see it's complaining about GLX being missing.
you need to switch back to mesa's opengl impl (including, but not limited to, glx)
I'm not sure how I would do such a thing?
(In reply to comment #10)
> I'm not sure how I would do such a thing?
Nevermind, sorted it. The trace seems useless though, none of the values in the "NVC0 Firmware" link appear.
Upload the trace somewhere?
Created attachment 90061 [details]
dmesg nouveau log
Okay I took another one, this time doing some more stuff I thought could have a chance of triggering it. This one came out as expected. The downside is loading it causes nouveau to fail. Attached the log from dmesg with the relevant nouveau bits.
The files need to be in /lib/firmware/nouveau and need to be called nve6_fuc409c (and so on for the other ones) -- is that what you did? If so, is /lib/firmware/nouveau available when the nouveau driver is being loaded? If nouveau is being loaded off an initrd, make sure that the firmware files are in the initrd as well.
I had to make a new initramfs with the files, booted correctly this time. Will report back if it cures the lockups or not.
5 hours now without a lockup, not even a peep in dmesg from nouveau. I would be locking up every 30 minutes to an hour on average before. What now?
Sit back and enjoy the lockup-free graphics?
You might also upload a copy of your vbios (/sys/kernel/debug/dri/0/vbios.rom) as well as the files you extracted. Perhaps a clue will lie there.
Created attachment 90180 [details]
firmware and vbios
I suspected to have the same problem and tried to use the orginal firmware, but failed to do so. I did not try to excract my "own" firmware but tried to use the firmware from the zip file that was attached by Jordan as we have the same card (both GTX 660).
But if I boot with the nouveau.config="NvGrUseFw=1" kernel option, the kernel stops directly on the attemp to load the nouveau fb driver.
I must mention that I use UEFI boot and my kernel resides on the EFI partion (/dev/sda1) at /boot/EFI/Boot/gentoo/bzImage-3.12.5.efi (the mount point is /boot). On the other hand my firmware is at /lib/firmware/nouveau/ on partition /dev/sda5, which is mounted at /.
Let me guess: I need an initrd file in order to make it work? :-( Then I must figure out how to make this work together with UEFI boot. Until now, I tried to avoid initrd and UEFI.
Whether the firmware needs to be in the initrd or not depends on how you have it set up. It needs to be there when the nouveau code is initialized (if you have a module, that means when the module is loaded, if it's built-in, then it needs to be added to the kernel image itself with the ADDITIONAL_FIRMWARE thing or whatever it's called).
e.g. the way I set up my initrd is that it does next to nothing, just asks for a password, decrypts the partition, mounts it, and swaps it in as the 'new' root. I never need to touch it (unless I want to make changes to the decryption logic). Modules are loaded with my regular '/' in place.
Most distros prefer the complex route and have their initrd's load everything, which in turn means that the firmware needs to be in the intird, and the initrd needs to be updated for every kernel. [Aside: This makes sense when you're creating something that must work on every hardware combination ever imagined (and not) whereby you don't want to build every driver in, but you do want to support various esoteric devices that are may be required for booting, like disk or network. And once you do that, might as well do everything there. But it's very rare that I solder some crazy raid controller into my laptop, so it doesn't really make sense for more tailored setups.]
I went the easy way now. I built nouveau as a module that is loaded later and I see
[ 5.323010] nouveau [ PGRAPH][0000:01:00.0] using external firmware
in my dmesg output :-) I will stay with the "nouveau as a module" solution until I am sure, that the external firmware solves the problem. Then I can still figure out how to make UEFI and initrd work together.
I will come back and report my results, but this may take its time, because the crash only occurs once in a while.
As announced in my previous post I wanted you to inform about my findings. The GPU lock-ups do not occur with the NVIDIA binary PGRAPH fw.
*** Bug 69882 has been marked as a duplicate of this bug. ***
I suffered from the same bug. I was able to fix it with some help from users on IRC as well as the files presented here. Figured I'd provide explicit instructions for others. (I'm using Ubuntu 13.10).
1. First make sure you have NV6 firmware
$dmesg | grep nouveau | grep Chipset
You should see something that looks like:
[ 2.318701] nouveau [ DEVICE][0000:01:00.0] Chipset: GK106 (NVE6)
2. Download and move the files from Jordan Bass (thanks!) in comment 19
sudo mkdir /lib/firmware/nouveau
sudo /path/to/extractedfiles/* /lib/firmware/nouveau
3. Update initramfs
sudo update-initramfs -c -k <YOUR_KERNEL>
4. Update GRUB
sudo nano /etc/default/grub
Add nouveau.config=NvGrUseFW=1 to GRUB_CMDLINE_LINUX_DEFAULT so that it looks like:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nouveau.config=NvGrUseFW=1"
5. Update grub
6. Restart and verify
sudo shutdown -r now
$dmesg | grep external
[ 2.484773] nouveau [ PGRAPH][0000:01:00.0] using external firmware
If you see PGRAPH using external firmware, you're done.
It's been about 5 hours since I rebooted using the blob firmware. No problems to report, it seems to be a work-around for this bug.
Thank you previous commenters!
For me, using the firmware attached is a bit hit-and-miss. Sometimes it works fine, but most of the time I get a lockup just as soon as KDE starts, while the KDE boot animation is playing. The logs have nothing unusual in them, the only line that shows something is this in dmesg:
nouveau E[ DRM] GPU lockup - switching to software fbcon
The Xorg log shows that everything is fine, and there are no further messages. I can switch to another console or kill the X server when that happens (but logging in again results in a lockup). Pretty odd.
I'll try extracting the firmware myself (although I do use a GTX 660 as well) and see if that helps.
Now I tried extracting the firmware myself, but the dump doesn't contain the register addresses that the firmware page states it should. But I think I followed all the steps correctly (start mmiotrace, cat into dump file, modprobe nvidia, Xorg, DISPLAY=":0" xterm, kill everything and stop mmiotrace), so this is rather puzzling. The dump itself is 23 MB in size, and through that xterm I also checked glxinfo to confirm that the NVIDIA drivers were running (they were, including direct rendering).
With some help I extracted the firmware myself, but the result is the same as before, GPU still hangs often, especially when OpenGL compositing is on.
Update: GPU locks up still, but less frequently than before I used the blob. Happens about once every 24-48 hours of uptime.
*** Bug 63165 has been marked as a duplicate of this bug. ***
Is this still an issue with 3.17 or later? i.e. are any issues fixed by using proprietary ctxsw firmware over the nouveau-provided one?
Still the same. I have been used to use the proprietary PGRAPH firmware until now. Because of your last comment, I gave the nouveau-provided one a new trial. A few seconds ago I had my first GPU lockup again. Hence, I have to stay sticked with the proprietary PGRAPH.
(In reply to Matthias Nagel from comment #33)
> Still the same. I have been used to use the proprietary PGRAPH firmware
> until now. Because of your last comment, I gave the nouveau-provided one a
> new trial. A few seconds ago I had my first GPU lockup again. Hence, I have
> to stay sticked with the proprietary PGRAPH.
Please confirm which kernel version you had this experience with?
Sorry, I am running 3.16.5. This is the latest stable kernel for Gentoo. This kernel still has the bug: proprietary fw works fine, nouveau-provided fw causes random GPU lockups. (Mostly while scrolling in firefox.)
(In reply to Matthias Nagel from comment #35)
> Sorry, I am running 3.16.5. This is the latest stable kernel for Gentoo.
> This kernel still has the bug: proprietary fw works fine, nouveau-provided
> fw causes random GPU lockups. (Mostly while scrolling in firefox.)
The fixes I had in mind only came into 3.17. Specifically 3d9e3921f4d77bcaeea913c48b894d1208f0cb06 and one or two other ones.
Sorry, my fault. I am compiling 3.17.4 right now (from Gentoo unstable branch). Give me some days, because the lockup is random. I will come back and report.
And I am back. I had three GPU lockups during the last 30min. The first occured while browsing the internet with Firefox, the second one happened during video playback with KPlayer and the last one happened, when I tried to re-compiled my kernel (3.17.4) with nvidia fw included. There was only a X terminal running that showed the build output when nouveau crashed. So yes, the bug still exists in 3.17.4. This is my dmesg output in reverse order:
Dec 01 21:10:22 matthias-pc kernel: nouveau E[ PFIFO][0000:01:00.0] read fault at 0x0000013000 [PTE] from PBDMA0/HOST_CPU on channel 0x023f7b0000 [unknown]
Dec 01 21:10:22 matthias-pc kernel: nouveau E[kwin] failed to idle channel 0xcccc0000 [kwin]
Dec 01 21:10:07 matthias-pc kernel: nouveau E[kwin] failed to idle channel 0xcccc0000 [kwin]
Dec 01 21:09:52 matthias-pc org.kde.kuiserver: kuiserver: Fatal IO error: client killed
Dec 01 21:09:52 matthias-pc kdm: X server for display :0 terminated unexpectedly
Dec 01 21:09:52 matthias-pc kernel: nouveau E[ X] failed to idle channel 0xcccc0000 [X]
Dec 01 21:09:37 matthias-pc kernel: nouveau E[ X] failed to idle channel 0xcccc0000 [X]
Dec 01 21:09:22 matthias-pc kernel: nouveau E[ DRM] GPU lockup - switching to software fbcon
Dec 01 21:07:40 matthias-pc kernel: nouveau E[ PIBUS][0000:01:00.0] GPC0: 0x504610 0x00000402 (0x0502020b)
Dec 01 21:07:40 matthias-pc kernel: nouveau E[ PFIFO][0000:01:00.0] PGRAPH engine fault on channel 2, recovering...
Dec 01 21:07:40 matthias-pc kernel: nouveau E[ PFIFO][0000:01:00.0] write fault at 0x00002d2000 [PTE] from GR/GPC2/GPCCS on channel 0x023f94f000 [X]
I also have this random GPU lookups on my GT660.
I tried to get the proprietary PGRAPH working, but without success.
Im'getting an error at boot:
Failed to loader firmware with error -2
Fallback to user helper
Im'using UEFI on debian jessie.
I've tried to howto in this forum
@Cedric: Well this means, that the kernel could not find the firmware. There are several ways how to do it and it depends on wether you have a initramfs or not, if the kernel has access to /lib/fimware when it tries to load the fw, if the the fw is directly included in the kernel or not, if you use an additional boot manager, etc. With UEFI: Does your UEFI directly load the Linux kernel or is there a boot manager in between. For example, I use "rEFInd" as boot manager, but I guess Debian uses grub2 by default.
My setup is the following:
(a) Boot manager "rEFInd"
(b) No initramfs
(c) FW directly included in kernel. This requires "CONFIG_FIRMWARE_IN_KERNEL=y" and "CONFIG_EXTRA_FIRMWARE=\"nouveau/...\"" to be set.
In my experience, putting the fw directly into the kernel is the easiest way, because one does not need to bother with correct pathes and mount points during boot. Unfortunately, I cannot give you more detailed advices, because I turned my back to Debian long time ago.
Hi, i'm back again
I found the solution here for ubuntu and debian (it's working for BIOS and UEFI)
Could people who still had issues with 3.17 retest with 4.0+? There was another fix in commit 404ba3f79089a01c1ebacccafa08a5db4a4cd2af which among other things fixed an issue on GK110.
I am definitely having this same bug all the way up to kernel version 4.4
Was using proprietary drivers for this reason on Fedora 22, and since the release of 23 I've been attempting diligently to only use nouveau.
With firmware that I extracted from 340 it will run for a while so long as nothing GL intensive is done. One way I can always reproduce the issue is to load Kodi, play a video, then stop the video and just let it sit open.
VDPAU seems to be an aggrivating factor here, as that nearly always crashes the drier. Even after removing the firmware from /lib/firmware though, the problem still happens. (even if your player doesn't attempt to use VDPAU)
The following message is from the latest 4.4 kernel in rawhide, running on top of Fedora 23:
[ 288.888692] nouveau 0000:01:00.0: gr: TRAP ch 5 [023f929000 kodi.bin]
[ 288.888704] nouveau 0000:01:00.0: gr: GPC1/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 000d [GPR_OUT_OF_BOUNDS]
[ 288.888710] nouveau 0000:01:00.0: gr: GPC1/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 4000d [GPR_OUT_OF_BOUNDS]
[ 288.888726] nouveau 0000:01:00.0: fifo: write fault at 0003846000 engine 00 [GR] client 0f [GPC1/PROP_0] reason 02 [PTE] on channel 5 [023f929000 kodi.bin]
[ 288.888728] nouveau 0000:01:00.0: fifo: gr engine fault on channel 5, recovering...
While that might not seem related, remember that my initial issues were identical to this aging bug. Without proprietary firmware loading, it usually is some kind of PGRAPH error, with it loaded it is usually some kind of PFIFO error; they vary. The consistent part is that it _always_ crashes with nouveau.
I have a GTX660SC from EVGA - and I would look more closely at EVGA cards if I were you. My cursory scouring of the internet to try and fix this has led me to believe that i'm seeing EVGA cards more often than not, and I'm not sure anyone has yet made that connection yet. Also, the (SC) stands for super-clock, as in factory overclocked. I suspect this may have something to do with it.
The version of the nouveau module is as follows, and I also upgraded 'mesa-*' and 'linux-firmware' and 'xorg-x11-drv-nouveau' to the latest rawhide versions to, thinking that might have some additional benefits, but the same issues occur. This is simply NOT fixed in the most recent version of code anywhere.
license: GPL and additional rights
description: nVidia Riva/TNT/GeForce/Quadro/Tesla
author: Nouveau Project
vermagic: 4.4.0-0.rc0.git2.1.fc24.x86_64 SMP mod_unload
mesa versions tried: up to 1.11.1-devel
I will also being attaching my vbios.rom
Created attachment 119468 [details]
EVGA GTX-660SC vbios.rom
Mine's Gainward and not OC.
(In reply to xenith from comment #43)
> I am definitely having this same bug all the way up to kernel version 4.4
If you're definitely having this same bug, that means that all your problems go away when using blob pgraph context switching firmware. Yes? [Totally unrelated to video decoding fw btw.]
I'm inclined to believe that you've incorrectly understood this bug, and misdiagnosed a kodi + nouveau issue (which is well known to me... use mplayer, works great) as an obscure context switching bug that we fixed long ago.
As per Comment 29, it's not always solved by using the blob firmware. And for me it always locks up, by just idling on the desktop or browsing the web, not playing videos.
[ 877.299847] kactivitymanage: segfault at 7fbe3fdcb7d0 ip 00007fbe4c073731 sp 00007ffec63a7c48 error 4 in libQt5Sql.so.5.5.0[7fbe4c05f000+3f000]
[ 1081.974756] kactivitymanage: segfault at 7f374fb897d0 ip 00007f374fdd2731 sp 00007fff901deeb8 error 4 in libQt5Sql.so.5.5.0[7f374fdbe000+3f000]
[ 1181.195139] nouveau 0000:01:00.0: Direct firmware load for nouveau/nve6_fuc084 failed with error -2
[ 1181.195196] nouveau 0000:01:00.0: Direct firmware load for nouveau/nve6_fuc084d failed with error -2
[ 1181.195200] nouveau 0000:01:00.0: msvld: unable to load firmware data
[ 1181.195203] nouveau 0000:01:00.0: msvld: init failed, -19
[ 1261.448630] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[ 1261.448642] nouveau 0000:01:00.0: fifo: sw engine fault on channel 7, recovering...
[ 1263.448760] nouveau 0000:01:00.0: fifo: runlist 0 update timeout
[ 1265.744229] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[ 1270.039568] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[ 1274.334908] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[ 1278.630248] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[ 1282.925589] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
[ 1287.220929] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
(In reply to Dainius Masiliūnas from comment #47)
> As per Comment 29, it's not always solved by using the blob firmware. And
> for me it always locks up, by just idling on the desktop or browsing the
> web, not playing videos.
So then you don't have this problem either. Look, there are like 1000000 things that can go wrong. This is one of them. If you have one of the other 999999, file a new bug.
>If you're definitely having this same bug, that means that all your problems go away when using blob pgraph context switching firmware. Yes? [Totally unrelated to video decoding fw btw
They used to, but not anymore. The bug has evolved.
(In reply to xenith from comment #50)
> >If you're definitely having this same bug, that means that all your problems go away when using blob pgraph context switching firmware. Yes? [Totally unrelated to video decoding fw btw
> They used to, but not anymore. The bug has evolved.
So then there's some kernel where it all worked fine for you with blob ctxsw firmware? Based on your paste, you don't even *have* the ctxsw firmware, so I'm a little doubtful.
Read the subject of this bug -- does it apply to you? If so, stay here. If not, file a new bug.
(In reply to Ilia Mirkin from comment #49)
> So then you don't have this problem either. Look, there are like 1000000
> things that can go wrong. This is one of them. If you have one of the other
> 999999, file a new bug.
I did. You closed it as duplicate of this one. So feel free to reopen bug #69882 then.
I've been screwing around with it all day, I currently don't have the firmware, no. I did about an hour ago. I'm in the process of extracting it again
I'm trying to tell you that I don't think this title applies to anyone anymore. I don't think the firmware will fix all stability issues for anyone; especially KDE Plasma 5 users like myself.
(In reply to xenith from comment #54)
> I'm trying to tell you that I don't think this title applies to anyone
> anymore. I don't think the firmware will fix all stability issues for
> anyone; especially KDE Plasma 5 users like myself.
Great! That means this bug is fixed/gone?
OK, this bug has grown into an unmanageable monster. I believe the original issue has been finally fixed in kernel 4.1. Our ctxsw is now fairly reliable. Any additional issues you might have should be filed as new bugs. Even if you can demonstrate that your issue is fixable by using blob ctxsw, I don't care -- file a new bug. This one has been corrupted by people saying "me too".