Created attachment 86731 [details]
Sometimes the X server crashes due to a GPU lockup, caused by a page fault. It happens seemingly randomly, at irregular intervals (sometimes it takes several hours, sometimes it crashes in half an hour).
Before that happens, I see a small amount of corruption (noise around the cursor), then everything but the mouse hangs. After a while, the mouse also hangs, the screen becomes black with a "_" symbol in the upper right corner of the screen (but the mouse is still displayed), and after some more time the whole screen becomes corrupt in vertical blocks. If I press Ctrl+Alt+F1 fast enough, I can switch out of X and use the console for a while, otherwise the whole PC hangs and I need to do a hard reboot.
This issue may or may not be related to bug #69029 (the symptoms seem similar, but the errors are different).
I am using a GeForce 660 card on openSUSE 13.1 x86_64 Beta. I also reported the bug downstream.
Created attachment 86732 [details]
Xorg crash log
Attached the Xorg crash log. It seems to be fairly consistent during different crash instances.
Created attachment 86733 [details]
Second kernel log
Attached two kernel logs. The first one happened at the same time as the attached Xorg crash log (if the timestamps are important). The second log is /dev/kmsg during another crash instance, which seems to have caused different errors, but the same outcome.
Created attachment 86735 [details]
Third kernel log
Attached another kernel log. It seems it has elements from both the previous logs.
What version of mesa are you using? Could you try with mesa-git?
Mesa 9.2.0. And I suppose I can try the git version, although I've never tried that before, so I'm not entirely sure if I can get everything working correctly.
Tried the git version of Mesa, and the issue is still there, it just triggers less often.
However, I found a reliable way to reproduce the problem, on both 9.2 and git versions of Mesa. On KDE 4.11, setting the KWin compositing method to OpenGL 3.1 causes a lockup every time. With XRender I don't seem to hit this issue at all, and I think on OpenGL 2.0 the lockups happen randomly (but I need to do some more testing to make sure).
Actually, I think the lockups on KWin switch were induced by some openSUSE update. After another update, I could no longer reproduce that behaviour, and it's back to random lockups at any given time, no matter the compositing settings. Though it might still be notable that this issue can also be induced by certain bugs elsewhere in the system.
I have the same problem on Gentoo with the following software components
with a GTX660 card. But it also sounds very similar to bug #72180.
Created attachment 90899 [details]
Kernel log on gentoo 3.12.5
Created attachment 90900 [details]
lspci on gentoo 3.12.5
One quick way to check if you have the same problem as bug 72180 is to use the blob fw. If that works, then you have the same issue. I guess I didn't make the connection originally...
I tried to use the blob firmware, but failed to do so. See my comment at bug # 72180 for more.
*** This bug has been marked as a duplicate of bug 72180 ***
Reopened as per bug #72180 suggestions.
To make it clear, this is about random GPU lockups of GTX 660 (mine's Gainward), where using PGRAPH firmware from the blob does not fix the issue.
Interestingly enough, looks like there is an equivalent (albeit also messy) bug opened for Fedora (see See Also), and it appears to be a race condition. So trying the patch in that bug might be a good idea. Alternatively they suggest booting with nouveau.noaccel=1. I'll see if I can test this.
Please refresh this issue with new information. Make sure you're using at least kernel 4.3 and Mesa 11.0.4. Both have had important fixes which may affect your situation.
Right. I retested now with both kernel 4.3 and Mesa 11.0.4, and... well, it locks up, but with the kernel warning "../include/drm/drm_crtc.h:1577 drm_helper_choose_encoder_dpms" which seems to point to http://lists.freedesktop.org/archives/dri-devel/2015-September/091091.html and isn't actually a nouveau issue.
This prevents me from testing for the nouveau issue until the kernel gets fixed...
Created attachment 119484 [details]
Journal (fifo read fault and drm_crtc.h)
Reading a bit more into the kernel log, I see that the drm_crtc.h warning might have been triggered by nouveau after all, because above that I have:
nouveau 0000:01:00.0: fifo: read fault at 6ff792f000 engine 07 [PBDMA0] client 06 [HOST] reason 00 [PDE] on channel 31 [023e0c9000 xembedsniproxy]
nouveau 0000:01:00.0: fifo: fifo engine fault on channel 31, recovering...
------------[ cut here ]------------
WARNING: CPU: 0 PID: 4 at ../drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.h:73 gk104_fifo_recover_work+0x22a/0x290 [nouveau]()
Attached the systemd journal of this. The warning above is at line 1834. The Xorg.0.log file does not have any errors or warnings at all.
I'm not sure if this should be a yet another bug report?
Testing it a few more times, it is indeed the read fault by nouveau that's causing the lockup in this case. The general DRM error does not appear during all boots, but the nouveau read fault does. When waiting around for a long time, the kernel log also has this:
INFO: task kworker/0:4:956 blocked for more than 480 seconds.
Tainted: G W O 4.3.0-1-default #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/0:4 D 0000000000000000 0 956 2 0x00000080
Workqueue: events gk104_fifo_recover_work [nouveau]
ffff8800d9d8bbc8 0000000000000046 ffff8801fd2b2080 ffff880214b0e040
ffff8800d9d8c000 ffff8800d9d8bd18 ffff8800d9d8bd10 ffff880214b0e040
ffff8802142d8810 ffff8800d9d8bbe0 ffffffff8166a1aa 7fffffffffffffff
[<ffffffffa02b79dd>] gk104_fifo_fini+0x1d/0x50 [nouveau]
[<ffffffffa02b443c>] nvkm_fifo_fini+0x1c/0x30 [nouveau]
[<ffffffffa02546a0>] nvkm_engine_fini+0x20/0x30 [nouveau]
[<ffffffffa0258511>] nvkm_subdev_fini+0x61/0x1e0 [nouveau]
[<ffffffffa02b8d3b>] gk104_fifo_recover_work+0xeb/0x290 [nouveau]
DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
Leftover inexact backtrace:
[<ffffffff81086bb0>] ? kthread_worker_fn+0x170/0x170
I'm still not sure if this should be a separate bug report.
Is always xembedsniproxy involved in the crash? If so, it might be worth to do a mmt until it crashes and check what it is actually doing.
Also having random lockups on a GTX 660 Ti (NVE4 according to glxinfo), since kernel 4.1 I guess, using DRI2.
[ 0.267666] nouveau 0000:02:00.0: NVIDIA GK104 (0e4030a2)
[ 0.378583] nouveau 0000:02:00.0: bios: version 80.04.4b.00.1a
[ 0.379302] nouveau 0000:02:00.0: fb: 2048 MiB GDDR5
Now on gentoo ~amd64 using:
Should I make a new entry for this card?
Just tried karolherbst nouveau reclocking tree: https://github.com/karolherbst/nouveau/tree/stable_reclocking_kepler_v4
Using this module and reclocking to pstate 07 fixed the hangs I was having before. Maybe this fixes 660 hangs too.
Forget what I just said, the hangs still happen. Opened a new bug #95031
Still hangs on kernel 4.7.1. This time the journal didn't actually have anything in it concerning the hang... Very odd.
Also, when I set the driver to modesetting in xorg.conf.d, it seems to work without hanging (but on llvmpipe). So it seems to get triggered by something with regards to 3D...