Bug 69882 - [NVE6] GPU lockups
Summary: [NVE6] GPU lockups
Status: REOPENED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-27 16:33 UTC by Dainius Masiliūnas
Modified: 2016-08-27 18:41 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Kernel log (83.25 KB, text/plain)
2013-09-27 16:33 UTC, Dainius Masiliūnas
no flags Details
Xorg crash log (57.15 KB, text/plain)
2013-09-27 16:35 UTC, Dainius Masiliūnas
no flags Details
Second kernel log (100.93 KB, text/plain)
2013-09-27 16:39 UTC, Dainius Masiliūnas
no flags Details
Third kernel log (86.33 KB, text/plain)
2013-09-27 17:12 UTC, Dainius Masiliūnas
no flags Details
Kernel log on gentoo 3.12.5 (1.16 MB, text/plain)
2013-12-17 20:57 UTC, Matthias Nagel
no flags Details
lspci on gentoo 3.12.5 (9.60 KB, text/plain)
2013-12-17 21:00 UTC, Matthias Nagel
no flags Details
Journal (fifo read fault and drm_crtc.h) (219.65 KB, text/plain)
2015-11-08 17:54 UTC, Dainius Masiliūnas
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dainius Masiliūnas 2013-09-27 16:33:45 UTC
Created attachment 86731 [details]
Kernel log

Sometimes the X server crashes due to a GPU lockup, caused by a page fault. It happens seemingly randomly, at irregular intervals (sometimes it takes several hours, sometimes it crashes in half an hour).

Before that happens, I see a small amount of corruption (noise around the cursor), then everything but the mouse hangs. After a while, the mouse also hangs, the screen becomes black with a "_" symbol in the upper right corner of the screen (but the mouse is still displayed), and after some more time the whole screen becomes corrupt in vertical blocks. If I press Ctrl+Alt+F1 fast enough, I can switch out of X and use the console for a while, otherwise the whole PC hangs and I need to do a hard reboot.

This issue may or may not be related to bug #69029 (the symptoms seem similar, but the errors are different).

I am using a GeForce 660 card on openSUSE 13.1 x86_64 Beta. I also reported the bug downstream.
Comment 1 Dainius Masiliūnas 2013-09-27 16:35:17 UTC
Created attachment 86732 [details]
Xorg crash log

Attached the Xorg crash log. It seems to be fairly consistent during different crash instances.
Comment 2 Dainius Masiliūnas 2013-09-27 16:39:04 UTC
Created attachment 86733 [details]
Second kernel log

Attached two kernel logs. The first one happened at the same time as the attached Xorg crash log (if the timestamps are important). The second log is /dev/kmsg during another crash instance, which seems to have caused different errors, but the same outcome.
Comment 3 Dainius Masiliūnas 2013-09-27 17:12:34 UTC
Created attachment 86735 [details]
Third kernel log

Attached another kernel log. It seems it has elements from both the previous logs.
Comment 4 Ilia Mirkin 2013-09-27 17:19:58 UTC
What version of mesa are you using? Could you try with mesa-git?
Comment 5 Dainius Masiliūnas 2013-09-27 17:29:16 UTC
Mesa 9.2.0. And I suppose I can try the git version, although I've never tried that before, so I'm not entirely sure if I can get everything working correctly.
Comment 6 Dainius Masiliūnas 2013-10-04 13:19:32 UTC
Tried the git version of Mesa, and the issue is still there, it just triggers less often.

However, I found a reliable way to reproduce the problem, on both 9.2 and git versions of Mesa. On KDE 4.11, setting the KWin compositing method to OpenGL 3.1 causes a lockup every time. With XRender I don't seem to hit this issue at all, and I think on OpenGL 2.0 the lockups happen randomly (but I need to do some more testing to make sure).
Comment 7 Dainius Masiliūnas 2013-10-20 19:02:47 UTC
Actually, I think the lockups on KWin switch were induced by some openSUSE update. After another update, I could no longer reproduce that behaviour, and it's back to random lockups at any given time, no matter the compositing settings. Though it might still be notable that this issue can also be induced by certain bugs elsewhere in the system.
Comment 8 Matthias Nagel 2013-12-17 20:55:30 UTC
I have the same problem on Gentoo with the following software components

x11-base/xorg-x11-7.4-r2
sys-kernel/gentoo-sources-3.12.5
kde-base/kdelibs-4.11.2-r1

with a GTX660 card. But it also sounds very similar to bug #72180.
Comment 9 Matthias Nagel 2013-12-17 20:57:08 UTC
Created attachment 90899 [details]
Kernel log on gentoo 3.12.5
Comment 10 Matthias Nagel 2013-12-17 21:00:34 UTC
Created attachment 90900 [details]
lspci on gentoo 3.12.5
Comment 11 Ilia Mirkin 2013-12-17 21:11:38 UTC
One quick way to check if you have the same problem as bug 72180 is to use the blob fw. If that works, then you have the same issue. I guess I didn't make the connection originally...
Comment 12 Matthias Nagel 2013-12-17 22:01:42 UTC
I tried to use the blob firmware, but failed to do so. See my comment at bug # 72180 for more.
Comment 13 Ilia Mirkin 2014-01-08 05:10:54 UTC

*** This bug has been marked as a duplicate of bug 72180 ***
Comment 14 Dainius Masiliūnas 2015-11-08 11:33:12 UTC
Reopened as per bug #72180 suggestions.

To make it clear, this is about random GPU lockups of GTX 660 (mine's Gainward), where using PGRAPH firmware from the blob does not fix the issue.

Interestingly enough, looks like there is an equivalent (albeit also messy) bug opened for Fedora (see See Also), and it appears to be a race condition. So trying the patch in that bug might be a good idea. Alternatively they suggest booting with nouveau.noaccel=1. I'll see if I can test this.
Comment 15 Ilia Mirkin 2015-11-08 11:36:09 UTC
Please refresh this issue with new information. Make sure you're using at least kernel 4.3 and Mesa 11.0.4. Both have had important fixes which may affect your situation.
Comment 16 Dainius Masiliūnas 2015-11-08 17:35:21 UTC
Right. I retested now with both kernel 4.3 and Mesa 11.0.4, and... well, it locks up, but with the kernel warning "../include/drm/drm_crtc.h:1577 drm_helper_choose_encoder_dpms" which seems to point to http://lists.freedesktop.org/archives/dri-devel/2015-September/091091.html and isn't actually a nouveau issue.

This prevents me from testing for the nouveau issue until the kernel gets fixed...
Comment 17 Dainius Masiliūnas 2015-11-08 17:54:54 UTC
Created attachment 119484 [details]
Journal (fifo read fault and drm_crtc.h)

Reading a bit more into the kernel log, I see that the drm_crtc.h warning might have been triggered by nouveau after all, because above that I have:

nouveau 0000:01:00.0: fifo: read fault at 6ff792f000 engine 07 [PBDMA0] client 06 [HOST] reason 00 [PDE] on channel 31 [023e0c9000 xembedsniproxy[2833]]
nouveau 0000:01:00.0: fifo: fifo engine fault on channel 31, recovering...
------------[ cut here ]------------
WARNING: CPU: 0 PID: 4 at ../drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.h:73 gk104_fifo_recover_work+0x22a/0x290 [nouveau]()

Attached the systemd journal of this. The warning above is at line 1834. The Xorg.0.log file does not have any errors or warnings at all.

I'm not sure if this should be a yet another bug report?
Comment 18 Dainius Masiliūnas 2015-11-10 16:56:16 UTC
Testing it a few more times, it is indeed the read fault by nouveau that's causing the lockup in this case. The general DRM error does not appear during all boots, but the nouveau read fault does. When waiting around for a long time, the kernel log also has this:

INFO: task kworker/0:4:956 blocked for more than 480 seconds.
      Tainted: G        W  O    4.3.0-1-default #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/0:4     D 0000000000000000     0   956      2 0x00000080
Workqueue: events gk104_fifo_recover_work [nouveau]
ffff8800d9d8bbc8 0000000000000046 ffff8801fd2b2080 ffff880214b0e040
ffff8800d9d8c000 ffff8800d9d8bd18 ffff8800d9d8bd10 ffff880214b0e040
ffff8802142d8810 ffff8800d9d8bbe0 ffffffff8166a1aa 7fffffffffffffff
Call Trace:
[<ffffffff8166a1aa>] schedule+0x3a/0x90
[<ffffffff8166cfb7>] schedule_timeout+0x197/0x260
[<ffffffff8166b526>] wait_for_completion+0x96/0x100
[<ffffffff8108019d>] flush_work+0xed/0x180
[<ffffffffa02b79dd>] gk104_fifo_fini+0x1d/0x50 [nouveau]
[<ffffffffa02b443c>] nvkm_fifo_fini+0x1c/0x30 [nouveau]
[<ffffffffa02546a0>] nvkm_engine_fini+0x20/0x30 [nouveau]
[<ffffffffa0258511>] nvkm_subdev_fini+0x61/0x1e0 [nouveau]
[<ffffffffa02b8d3b>] gk104_fifo_recover_work+0xeb/0x290 [nouveau]
[<ffffffff81080c89>] process_one_work+0x159/0x470
[<ffffffff81080fe8>] worker_thread+0x48/0x4a0
[<ffffffff81086c79>] kthread+0xc9/0xe0
[<ffffffff8166e80f>] ret_from_fork+0x3f/0x70
DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

Leftover inexact backtrace:
[<ffffffff81086bb0>] ? kthread_worker_fn+0x170/0x170

I'm still not sure if this should be a separate bug report.
Comment 19 Karol Herbst 2016-04-04 20:39:35 UTC
Is always xembedsniproxy involved in the crash? If so, it might be worth to do a mmt until it crashes and check what it is actually doing.
Comment 20 Lucas Ribeiro 2016-04-19 21:54:05 UTC
Also having random lockups on a GTX 660 Ti (NVE4 according to glxinfo), since kernel 4.1 I guess, using DRI2.
[    0.267666] nouveau 0000:02:00.0: NVIDIA GK104 (0e4030a2)
[    0.378583] nouveau 0000:02:00.0: bios: version 80.04.4b.00.1a
[    0.379302] nouveau 0000:02:00.0: fb: 2048 MiB GDDR5

Now on gentoo ~amd64 using:

sys-kernel/gentoo-sources-4.5.1
x11-base/xorg-server-1.18.3
x11-drivers/xf86-video-nouveau-1.0.12


Should I make a new entry for this card?
Comment 21 Lucas Ribeiro 2016-04-20 03:33:38 UTC
Just tried karolherbst nouveau reclocking tree: https://github.com/karolherbst/nouveau/tree/stable_reclocking_kepler_v4

Using this module and reclocking to pstate 07 fixed the hangs I was having before. Maybe this fixes 660 hangs too.
Comment 22 Lucas Ribeiro 2016-04-20 07:58:20 UTC
Forget what I just said, the hangs still happen. Opened a new bug #95031
Comment 23 Dainius Masiliūnas 2016-08-27 18:41:27 UTC
Still hangs on kernel 4.7.1. This time the journal didn't actually have anything in it concerning the hang... Very odd.

Also, when I set the driver to modesetting in xorg.conf.d, it seems to work without hanging (but on llvmpipe). So it seems to get triggered by something with regards to 3D...


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.