Created attachment 72569 [details] /var/log/kern.log After upgrading from Ubuntu precise to quantal there are lots of messages like these (attachment has /var/log/kern.log of the time when that happened) Jan 4 12:18:12 koli kernel: [ 3705.970720] nouveau E[ DRM] fail ttm_validate Jan 4 12:18:12 koli kernel: [ 3705.970726] nouveau E[ DRM] validate vram_list Jan 4 12:18:12 koli kernel: [ 3705.970760] nouveau E[ DRM] validate: -12 and there is graphics corruption. After a while it gets really bad so that I have to stop X. After restarting X the graphics adapter is still unusable, graphics remain distorted. By that time the following messages can be seen in /var/log/kern.log Jan 4 15:08:23 koli kernel: [13917.531124] nouveau [ PGRAPH][0000:01:00.0] ERROR nsource: LIMIT_COLOR nstatus: PROTECTION_FAULT Jan 4 15:08:23 koli kernel: [13917.531137] nouveau E[ PGRAPH][0000:01:00.0] ch 3 [0x0013c000] subc 7 class 0x4097 mthd 0x0204 data 0x00180000 Jan 4 15:08:26 koli kernel: [13920.488008] nouveau E[ DRM] reloc wait_idle failed: -16 Jan 4 15:08:26 koli kernel: [13920.488013] nouveau E[ DRM] reloc apply: -16 Jan 4 15:08:26 koli kernel: [13920.492005] [sched_delayed] sched: RT throttling activated Jan 4 15:08:29 koli kernel: [13923.600790] nouveau E[ DRM] fail ttm_validate Jan 4 15:08:29 koli kernel: [13923.600796] nouveau E[ DRM] validate vram_list Jan 4 15:08:29 koli kernel: [13923.600803] nouveau E[ DRM] validate: -16 Jan 4 15:08:35 koli kernel: [13929.600009] nouveau E[ DRM] reloc wait_idle failed: -16 Jan 4 15:08:35 koli kernel: [13929.600014] nouveau E[ DRM] reloc apply: -16 Jan 4 15:09:42 koli kernel: [13993.332007] nouveau E[ 1701] failed to idle channel 0xcccc0001 Jan 4 15:09:42 koli kernel: [13996.332006] nouveau E[ 1701] failed to idle channel 0xcccc0000 Jan 4 15:09:59 koli kernel: [14010.792008] nouveau E[ PGRAPH][0000:01:00.0] idle timed out with status 0x0be80001 Jan 4 15:10:01 koli kernel: [14013.614181] nouveau E[ PGRAPH][0000:01:00.0] idle timed out with status 0x0be80001 Jan 4 15:10:03 koli kernel: [14016.824509] [TTM] Failed to expire sync object before buffer eviction
Created attachment 72570 [details] output of lspci -vv
Created attachment 72571 [details] Xorg.0.log Notice that I have two 24" screens attached both at 1660x1050
(In reply to comment #2) > Created attachment 72571 [details] > Xorg.0.log > > Notice that I have two 24" screens attached both at 1660x1050 1680x1050
mesa version?
Mesa version 9.0 (i.e. on Ubuntu quantal it's called 9.0-0ubuntu1)
Created attachment 72588 [details] [review] better logging Please attach dmesg with this patch applied (on top of 3.7).
*** Bug 56718 has been marked as a duplicate of this bug. ***
Created attachment 72616 [details] dmesg with logging patch applied
Created attachment 72680 [details] Another dmesg with an early PROTECTION_FAULT but still functional X Today no reboot boot yet. The X server is still up, but for a moment I thought it got stuck. Last messages are: [20361.350868] nouveau E[ DRM] fail ttm_validate [20361.350875] nouveau E[ DRM] validate vram_list, vram_list_size: 183160832, gart_list_size: 4780032, both_list_size: 0 [20361.350959] nouveau E[ DRM] validate: -12 [compiz[5399]] [20364.892611] nouveau E[ DRM] reloc wait_idle failed: -16 [20364.892618] nouveau E[ DRM] reloc apply: -16 [compiz[5399]] [20365.091459] nouveau E[ DRM] fail ttm_validate [20365.091465] nouveau E[ DRM] vram [20365.091491] nouveau E[ DRM] validate: -16 [compiz[5399]] [20374.260008] nouveau E[ DRM] reloc wait_idle failed: -16 [20374.260015] nouveau E[ DRM] reloc apply: -16 [Xorg[3542]] [20377.261006] nouveau E[ DRM] reloc wait_idle failed: -16 [20377.261013] nouveau E[ DRM] reloc apply: -16 [Xorg[3542]] [20377.262000] [sched_delayed] sched: RT throttling activated [20380.951008] nouveau E[ DRM] reloc wait_idle failed: -16 [20380.951015] nouveau E[ DRM] reloc apply: -16 [Xorg[3542]]
Compiz (through Mesa) asks for 180MB of VRAM and the card has 256MB. So, there seems to be 2 bugs here: - 3D driver asks for too much VRAM (180MB for compositor?) - kernel should handle applications asking for 180MB out of 256MB (pinned buffers should not take 76MB) Let's figure out what's wrong on the kernel side first.
Created attachment 72932 [details] [review] better logging v2 please attach kernel log with this patch applied
(In reply to comment #11) > Created attachment 72932 [details] [review] [review] > better logging v2 > > please attach kernel log with this patch applied Hi Marcin, Please be patient since I have replaced the video adapter last week. It's the PC at work and I couldn't get much work done with that setup. We do have a few other PC's with that same video adapter, but I'm sure they don't have the two 1680x1050 monitors attached. So it may influence triggering the bug. I'll see what I can do to help you chase that bug.
I'm experiencing the same bug on a laptop with Nvidia 8400M GS. So I could help with the log, if necessary.
(In reply to comment #13) > I'm experiencing the same bug on a laptop with Nvidia 8400M GS. > So I could help with the log, if necessary. Yes please do, that would be great. Thanks
(In reply to comment #11) > Created attachment 72932 [details] [review] [review] > better logging v2 > > please attach kernel log with this patch applied Hi Marcin, I'm on Raring (Ubuntu +1), which is now based on 3.8.0 kernel. The bug is the same, and it's anyway reproducible. My question is: does your patch apply to 3.8 ? Cause patch -p1 gives me some "Hunk FAILED".
I get the same warning/errors when switching the KDE graphics settings in systemsetting. Sometimes the systemlocks up and I need a hard reset and sometimes just X segfaults and it restarts (but here sometimes the GPU is locked and I need to reboot). I'm running 00:0d.0 VGA compatible controller: NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2)
Created attachment 73073 [details] my lspci -vv
Created attachment 73074 [details] nouveau output stored in /var/log/messages I found all (maybe only most) kernel error messages in /var/log/messages. I forgot to add that I'm running: 3.6.11-gentoo. If I shall test any patches, feel free to ask.
Created attachment 73165 [details] dmesg with logging v2 patch I successfully built v2 patch on top of latest 3.7.0.x available kernel for Raring. This patch seems to cause an XServer crash, whenever I launch the Dash (talking about Unity desktop environment).
Created attachment 73179 [details] dmesg with loggingv2 patch applied Here is the dmesg from another system, with the same (G73, NV40) video adapter, with two smaller monitors (1240x1024). Notice that there are a lot of CACHE ERROR message. So many that the printk buffer overflowed. [70448.465702] nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 3 mthd 0x0184 data 0xbeef0201 [70448.465725] nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 3 mthd 0x0188 data 0xbeef0201 [70448.465745] nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 3 mthd 0x0300 data 0x0000000b ... [70448.475032] nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 7 mthd 0x1dac data 0x00000000 [70448.475053] nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 7 mthd 0x1dac data 0x00000000 [70448.475072] nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 7 mthd 0x1dac data 0x00000000 BTW The user of this system is not doing much with it, so the logging is preliminary. I'm waiting for more details to show up. In the mean time, this is what we get after the first login.
Created attachment 75140 [details] A trace of this bug
(In reply to comment #21) > Created attachment 75140 [details] > A trace of this bug I can confirm this bug. I've this trace (attachment 75140 [details]) in my HP workstation usually after 6/7 days of nonstop use. Some information of my system: Debian testing kernel: 3.7.7 (Vanilla) xserver-xorg-video-nouveau: 1.0.1-5 40:00.0 VGA compatible controller: NVIDIA Corporation NV44 [Quadro NVS 285] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 0334 Physical Slot: 2 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f8000000 (32-bit, non-prefetchable) [size=16M] Memory at f0000000 (64-bit, prefetchable) [size=128M] Memory at f9000000 (64-bit, non-prefetchable) [size=16M] Expansion ROM at <unassigned> [disabled] Capabilities: [60] Power Management version 2 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [128] Power Budgeting <?> Kernel driver in use: nouveau If you want I can send to you some other info or apply a patch to try to solve the issue.
Created attachment 75214 [details] another dmesg with logging v2 patch applied Here is another dmesg with v2 patch built on top of 3.7 kernel.
Hello, I have a similar problem. Everything seems ok until I launch Xbmc (full screen). I then have lots of "ttm_validate" messges on syslog, and then display becomes crazy. May 6 19:39:16 mercure dbus[725]: [system] Successfully activated service 'org.freedesktop.UDisks' May 6 19:39:24 mercure kernel: [ 9001.240755] nouveau E[ DRM] fail ttm_validate May 6 19:39:24 mercure kernel: [ 9001.240765] nouveau E[ DRM] validate vram_list May 6 19:39:24 mercure kernel: [ 9001.240781] nouveau E[ DRM] validate: -12 May 6 19:39:26 mercure kernel: [ 9004.097548] nouveau E[ DRM] fail ttm_validate May 6 19:39:26 mercure kernel: [ 9004.097558] nouveau E[ DRM] validate vram_list May 6 19:39:26 mercure kernel: [ 9004.097577] nouveau E[ DRM] validate: -12 May 6 19:39:30 mercure kernel: [ 9007.536331] nouveau E[ DRM] fail ttm_validate May 6 19:39:30 mercure kernel: [ 9007.536340] nouveau E[ DRM] validate vram_list May 6 19:39:30 mercure kernel: [ 9007.536359] nouveau E[ DRM] validate: -12 May 6 19:39:33 mercure kernel: [ 9010.661699] nouveau E[ DRM] fail ttm_validate May 6 19:39:33 mercure kernel: [ 9010.661710] nouveau E[ DRM] validate vram_list May 6 19:39:33 mercure kernel: [ 9010.661729] nouveau E[ DRM] validate: -12 ... May 6 19:42:15 mercure kernel: [ 9173.135847] BUG: Bad page map in process xbmc.bin pte:800000002bc20067 pmd:21da7067 May 6 19:42:15 mercure kernel: [ 9173.135862] page:ffffea0000af0800 count:-1 mapcount:-1 mapping: (null) index:0x0 May 6 19:42:15 mercure kernel: [ 9173.135866] page flags: 0x14(referenced|dirty) May 6 19:42:15 mercure kernel: [ 9173.135878] addr:00007f30f007f000 vm_flags:002000fb anon_vma: (null) mapping:ffff880036790e48 index:80 May 6 19:42:15 mercure kernel: [ 9173.135890] vma->vm_ops->fault: shmem_fault+0x0/0xa0 May 6 19:42:15 mercure kernel: [ 9173.135896] vma->vm_file->f_op->mmap: shmem_mmap+0x0/0x30 May 6 19:42:15 mercure kernel: [ 9173.135903] Pid: 4223, comm: xbmc.bin Tainted: GF 3.8.0-19-generic #29-Ubuntu May 6 19:42:15 mercure kernel: [ 9173.135907] Call Trace: May 6 19:42:15 mercure kernel: [ 9173.135922] [<ffffffff8115477d>] print_bad_pte+0x1dd/0x250 May 6 19:42:15 mercure kernel: [ 9173.135930] [<ffffffff81156f62>] unmap_page_range+0x692/0x750 May 6 19:42:15 mercure kernel: [ 9173.135939] [<ffffffff8105fb8a>] ? current_fs_time+0x1a/0x60 May 6 19:42:15 mercure kernel: [ 9173.135947] [<ffffffff81155e73>] ? do_wp_page+0x393/0x7f0 May 6 19:42:15 mercure kernel: [ 9173.135955] [<ffffffff811570aa>] unmap_single_vma+0x8a/0x100 May 6 19:42:15 mercure kernel: [ 9173.135962] [<ffffffff81157909>] unmap_vmas+0x49/0x90 May 6 19:42:15 mercure kernel: [ 9173.135970] [<ffffffff8115c894>] unmap_region+0xa4/0x120 May 6 19:42:15 mercure kernel: [ 9173.135979] [<ffffffff8115ebca>] do_munmap+0x2ba/0x410 May 6 19:42:15 mercure kernel: [ 9173.135987] [<ffffffff8115ed6e>] vm_munmap+0x4e/0x70 May 6 19:42:15 mercure kernel: [ 9173.135994] [<ffffffff8115fc4b>] sys_munmap+0x2b/0x40 May 6 19:42:15 mercure kernel: [ 9173.136005] [<ffffffff816d379d>] system_call_fastpath+0x1a/0x1f I am using: Xubuntu 13.04 x86_64 Kernel 3.8.0-19-generic xserver-xorg 1:7.7+1ubuntu4 00:05.0 VGA compatible controller: NVIDIA Corporation C51PV [GeForce 6150] (rev a2) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. A8N-VM CSM Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 16 Memory at fb000000 (32-bit, non-prefetchable) [size=16M] Memory at e0000000 (64-bit, prefetchable) [size=256M] Memory at fc000000 (64-bit, non-prefetchable) [size=16M] [virtual] Expansion ROM at 40000000 [disabled] [size=128K] Capabilities: [48] Power Management version 2 Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Kernel driver in use: nouveau Regards
Anything happening on this front? Are more logs needed? I've been suffering from this bug for years, and with the upgrade to Fedora 19 it has reached a point where the system is almost unusable. I originally reported it to Fedora, but I did not get much feedback there: https://bugzilla.redhat.com/show_bug.cgi?id=699551
Please confirm that this is still an issue with the latest and greatest (i.e. kernel 3.10+, xf86-video-nouveau 1.0.9, mesa git or at the very least, 9.1.6) Also... a summary of what the issue is would be great. There are a lot of comments, and they potentially seem to talk about different things. But perhaps not, it's all a bit unclear. Repro steps (that don't start with step 1: install fedora/ubuntu/whatever) would be fantastic too.
No response to re-test request in over a month. Closing as invalid. Also, this bug appears to have been a hodge-podge of potentially unrelated issues, I can't even really tell what this issue was about. If things persist, open new separate issues, and follow the advice on http://nouveau.freedesktop.org/wiki/Bugs/ for what information to provide.
Yes, I agree
Hi, the bug is alive and kicking on Ubuntu 15.04 kernel 3.19 with both mesa 10.6.x and 11.0.0devel. It seams to specifically affect GeForce 61xx and GeForce 70xx Attaching new logs, if possible Mauro
Created attachment 117543 [details] dmesg kernel log
Created attachment 117544 [details] Xorg log
Created attachment 120530 [details] kern.log from 4.2.0-19-generic ubuntu-15.10 with dual-head NV44 Working nicely until woken from overnight screen blanking, then the classic ttm_validate issue. This is with ppa.launchpad.net/graphics-drivers/'s xserver-xorg-video-nouveau amd64 1:1.0.11-1ubuntu3
my issue was resolved by update yesterday, to a version which I can't confirm at the moment. Twice gone thru the once-fatal screen-blank/resume sequence on same dual-monitor setup
(In reply to peter swain from comment #33) > my issue was resolved by update yesterday, by xserver-xorg-video-nouveau 1:1.0.12+git1512080732.b18bc0~gd~w from deb http://ppa.launchpad.net/oibaf/graphics-drivers/ubuntu wily main Peeking into logs, it looks like this commit was responsible, as the issue looks similar ... author Mario Kleiner <mario.kleiner.de@gmail.com> 2015-06-28 00:33:49 (GMT) committer Ben Skeggs <bskeggs@redhat.com> 2015-11-17 05:55:42 (GMT) commit 6e6d8ac1c7b4ee047a7b40b95dea1e65a7c3211a "Take shift in crtc positions for ZaphodHeads configs into account"
I too am seeing sporadic "fail ttm_validate" messages, though mine say "validating bo list" (as in Comment 32) rather than "validate vram_list" (as in Comment 0 et al). [Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kscreenlocker_g[15847]: fail ttm_validate [Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kscreenlocker_g[15847]: validating bo list [Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kscreenlocker_g[15847]: validate: -12 [Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kwin_x11[2268]: fail ttm_validate [Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kwin_x11[2268]: validating bo list [Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kwin_x11[2268]: validate: -12 01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce 8600 GT] (rev a1) (prog-if 00 [VGA controller]) Subsystem: Gigabyte Technology Co., Ltd G84 [GeForce 8600 GT] Flags: bus master, fast devsel, latency 0, IRQ 27 Memory at e4000000 (32-bit, non-prefetchable) [size=16M] Memory at d0000000 (64-bit, prefetchable) [size=256M] Memory at e2000000 (64-bit, non-prefetchable) [size=32M] I/O ports at 3000 [size=128] Expansion ROM at <ignored> [disabled] Capabilities: <access denied> Kernel driver in use: nouveau Linux version 4.5.1-gentoo (root@Crushinator) (gcc version 5.3.0 (Gentoo 5.3.0 p1.0, pie-0.6.5) ) #3 SMP Wed Apr 20 10:39:09 EDT 2016
This is still a problem in Linux 4.5.3. I might add that I also see kernel log lines like this (not new in 4.5.3): [Sun May 8 12:23:15 2016] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 8 [chrome[2561]] subc 0 mthd 0060 data beef0201 The "beef0201" seems suspicious. I have seen other values there, but most often it's "beef0201". Doesn't this seem like a sentinel value?
This is still an issue with linux 4.5.4 as well. I'm using a GeForce 7150m and am getting the same error: May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: fail ttm_validate May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validating bo list May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validate: -12 May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: fail ttm_validate May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validating bo list May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validate: -12 As a side note, I've been seeing these errors since updating from plasma4 to plasma5 (a couple of months now), and thus far have been uncertain whether or not this is a driver issue or a plasma issue.
Created attachment 123753 [details] kernel log with ttm_validate, RT_FAULT, ZETA_FAULT, PAGE_NOT_PRESENT The problems also started in earnest for me around the time I upgraded to Plasma 5. Nouveau was never *stable* before then, but I was able to ignore its errors for the most part. Now I can't go more than a few days without X freezing or even the kernel panicking. I do not believe the problems are triggered solely by plasmashell. I most frequently see the "fail ttm_validate" message for kscreenlocker_greet while I am away from my computer. I also very frequently see graphical corruption on the lock screen in the border around my avatar. There are other regressions too. I used to be able to use the XVideo output module in VLC (in fact, it was the only one that was stable). Now, neither XVideo nor OpenGL/GLX will run more than a few frames before the video freezes and "fail ttm_validate" messages spew into the kernel log. The only VLC output module that gives me any stability anymore is VDPAU and only if I disable hardware decoding, but even that will freeze X hard from time to time. The "fail ttm_validate" messages are just the harbinger of impending doom. If I continue without rebooting, eventually I'll be hit by an onslaught of much more ominous errors. Here's a small sampling: May 14 02:53:20 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 4 [chrome[21051]] subc 0 mthd 0060 data beef0201 May 14 03:43:11 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 6 [kwin_x11[2304]] subc 0 mthd 0060 data beef0201 May 14 05:06:50 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 1 [DRM] subc 0 mthd 0060 data 80000002 May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000040 [RT_FAULT] - Address 00204c7000 May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c: 00000000, e18: 00000000, e1c: 00000000, e20: 00001100, e24: 00030000 May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - 00000040 [RT_FAULT] - Address 00204c8000 May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - e0c: 00000000, e18: 00000000, e1c: 00000010, e20: 00001100, e24: 00030000 May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: 00200000 [] ch 11 [000eeda000 plasmashell[2665]] subc 3 class 8297 mthd 1904 data 01000404 May 14 13:17:52 [kernel] nouveau 0000:01:00.0: fb: trapped write at 00204c8000 on channel 11 [0eeda000 plasmashell[2665]] engine 00 [PGRAPH] client 0b [PROP] subclient 00 [RT0] reason 00000002 [PAGE_NOT_PRESENT] May 14 13:17:52 [kernel] nouveau 0000:01:00.0: fb: trapped write at 0020563800 on channel 2 [0fb2f000 X[2086]] engine 00 [PGRAPH] client 0b [PROP] subclient 08 [ZETA] reason 00000002 [PAGE_NOT_PRESENT] May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000020 [ZETA_FAULT] - Address 002054b100 May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c: 00000000, e18: 00000000, e1c: 00040000, e20: 00020000, e24: 08030000 May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - 00000040 [RT_FAULT] - Address 00204f1b00 May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - e0c: 00000000, e18: 00000000, e1c: 006c0110, e20: 00001100, e24: 00030000 May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: 00200000 [] ch 11 [000eeda000 plasmashell[2665]] subc 3 class 8297 mthd 1344 data 00004001 May 14 13:18:01 [kernel] nouveau 0000:01:00.0: fb: trapped write at 0020555b00 on channel 11 [0eeda000 plasmashell[2665]] engine 00 [PGRAPH] client 0b [PROP] subclient 08 [ZETA] reason 00000002 [PAGE_NOT_PRESENT] Attached is the complete error log from this session. The problems aren't limited to X, though. When nouveau enters a failure state like this, it corrupts memory belonging to other processes. I have several times (at least thrice) seen bitcoind crash at the same time as this storm of nouveau errors, logging an error message like: 2016-05-14 16:26:37 Corruption: block checksum mismatch 2016-05-14 16:26:37 *** System error while flushing: Database corrupted 2016-05-14 16:26:37 Error: Error: A fatal internal error occurred, see debug.log for details 2016-05-14 16:26:37 Shutdown: done When I started seeing these problems, I suspected bad RAM, so I ran Memtest86+ overnight but found no errors. So my suspicion is that nouveau is writing to pages it shouldn't. Could someone help me modify my kernel so that, instead of merely printing "fail ttm_validate", nouveau sends a SIGBUS to the active process when this occurs? Then I can run plasmashell in gdb and get a clue as to what's causing this.
"fail ttm_validate" usually means "you tried to use too much vram at once". Unfortunately nouveau's mesa driver isn't particularly good at handling that issue, which causes it to get much worse failures down the line. If you have an IGP, please increase the size of the "VRAM" allocation. Realistically, I doubt the plasma5 use-case fits well with nv30/nv40 hardware. You're taking a 2015 compositor and running it on 2005 hardware. You could (correctly) make the argument that nouveau should do a better job at this -- and you'd be right. However volunteers aren't falling over themselves, rushing to fix these issues.
(In reply to Ilia Mirkin from comment #39) > "fail ttm_validate" usually means "you tried to use too much vram at once". > Unfortunately nouveau's mesa driver isn't particularly good at handling that > issue, which causes it to get much worse failures down the line. A failure in Mesa should, at worst, merely cause the offending process to crash. It shouldn't be possible for an unprivileged user-mode process to bring the entire system down. If it is, then there's a serious (denial-of-service) kernel bug. > Realistically, I doubt the plasma5 use-case fits well with nv30/nv40 > hardware. You're taking a 2015 compositor and running it on 2005 hardware. In my case it's a GeForce 8600 GT (G84, Tesla microarchitecture) on a PCI-E card. I realize this is only slightly better (release date in April 2007), but supposedly Nouveau supports Tesla. Actually, I was digging into the scant documentation in the Nouveau project, and Tesla seems to be the one microarchitecture for which Nvidia have provided some documentation, so I honestly would expect it to be the best supported of all the chipsets. > You could (correctly) make the argument that nouveau should do a better job > at this -- and you'd be right. However volunteers aren't falling over > themselves, rushing to fix these issues. Okay, fine. I really don't mind buying a newer card. I just need to know what's going to work. Can you tell me what to get? I don't care about gaming. I do use a composited desktop. I want to play with Wayland on DRM (no X server). I just want a card that will give me a stable desktop without needing to run a proprietary driver. I've been eyeing a GeForce GT 730 (GK208-301-A1, Kepler microarchitecture) in the hope that switching cards would solve my stability problems. Would Kepler be more stable on Nouveau than Tesla? I have the general notion that Nouveau has worse support for the newer cards because they're more complex and less is known about them. Is this true? What would you recommend for someone who prioritizes stability above all else? Thanks.
Created attachment 123756 [details] KDE Compositor XRender XRender: Initial revision - 2003-11-14 https://cgit.freedesktop.org/xorg/proto/renderproto/commit/?id=bb5a469 https://cgit.freedesktop.org/xorg/xserver/commit/render?id=9508a38 NV30 family (Rankine) GeForce FX / 5 https://nouveau.freedesktop.org/wiki/CodeNames/#NV30 https://en.wikipedia.org/wiki/GeForce_FX_series http://download.opensuse.org/tumbleweed/iso/openSUSE-Tumbleweed-KDE-Live-i686-Snapshot20160512-Media.iso
Created attachment 123757 [details] KDE Compositor GLX GLX: Initial revision - 2003-11-14 https://cgit.freedesktop.org/xorg/proto/glproto/commit/?id=ba28c09 https://cgit.freedesktop.org/xorg/xserver/commit/GL/glx?id=9508a38 NV50 family (Tesla) GeForce 8 / 9 / 100 / 200 / 300 https://nouveau.freedesktop.org/wiki/CodeNames/#NV50 https://en.wikipedia.org/wiki/GeForce_8_series http://download.opensuse.org/tumbleweed/iso/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20160512-Media.iso
(In reply to Ilia Mirkin from comment #39) > "fail ttm_validate" usually means "you tried to use too much vram at once". > Unfortunately nouveau's mesa driver isn't particularly good at handling that > issue, which causes it to get much worse failures down the line. > > If you have an IGP, please increase the size of the "VRAM" allocation. > > Realistically, I doubt the plasma5 use-case fits well with nv30/nv40 > hardware. You're taking a 2015 compositor and running it on 2005 hardware. > Looks like GLX is too much for GPU family <= NV50, i.e. NV40, NV30, ... but can go with XRender - the same situation as with Xfwm4 compositing. > You could (correctly) make the argument that nouveau should do a better job > at this -- and you'd be right. However volunteers aren't falling over > themselves, rushing to fix these issues. Only volunteers there!? :)
(In reply to poma from comment #43) [... ] > Looks like GLX is too much for GPU family <= NV50, i.e. NV40, NV30, ... [... ] Perhaps this is a better expression in relation to the actual situation, Looks like GLX is too much for GPU family ≈ NV50, i.e. some of NV50, following all NV40, NV30, etc.
This bug has been corrupted by too many people adding in their own unrelated issues on their unrelated hardware with totally different versions of things, and yet claiming "oh yeah, it must be the same thing!". So I'm closing this. If your issues persist, feel free to open a fresh bug detailing your problems (one bug per reporter, in case it's not clear). That said, a ton of people have various issues with plasma5 + nouveau. Matt, if you're looking for advice on a GPU to buy, try IRC (#nouveau on freenode). My quick recommendation is: "not NVIDIA". If you're set on NVIDIA, happy to discuss the various trade-offs on IRC.
I discovered several failed capacitors on my motherboard. After replacing them, my system stability issues (including some single-bit I/O errors I was observing fairly frequently) have been resolved. (In reply to Ilia Mirkin from comment #45) > Matt, if you're looking for advice on a GPU to buy, try IRC (#nouveau on > freenode). My quick recommendation is: "not NVIDIA". If you're set on > NVIDIA, happy to discuss the various trade-offs on IRC. Thank you for the recommendation. I have switched to a Radeon R7 360 (GCN 1.1 microarchitecture, Bonaire Pro chipset) and could not be happier with the open-source Radeon driver and the resulting desktop graphics performance on my system. Kudos to AMD for releasing enough documentation on their chipsets to allow development of a proper (non-reversed-engineered) driver. What a difference it makes. I'll never go back to Nvidia.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.