59069 – nouveau E[ DRM] fail ttm_validate

Bug 59069 - nouveau E[ DRM] fail ttm_validate

Summary: nouveau E[ DRM] fail ttm_validate

Status:	RESOLVED INVALID

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium critical
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Duplicates (1):	56718 (view as bug list)
Depends on:
Blocks:

Reported:	2013-01-05 21:43 UTC by Kees Bakker
Modified:	2016-06-03 23:29 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:
i915 features:

Attachments
/var/log/kern.log (143.37 KB, text/plain) 2013-01-05 21:43 UTC, Kees Bakker	no flags	Details
output of lspci -vv (10.86 KB, text/plain) 2013-01-05 21:50 UTC, Kees Bakker	no flags	Details
Xorg.0.log (71.56 KB, text/plain) 2013-01-05 21:52 UTC, Kees Bakker	no flags	Details
better logging (54.21 KB, patch) 2013-01-06 14:11 UTC, Marcin Slusarz	no flags	Details \| Splinter Review
dmesg with logging patch applied (76.15 KB, text/plain) 2013-01-07 10:40 UTC, Kees Bakker	no flags	Details
Another dmesg with an early PROTECTION_FAULT but still functional X (79.67 KB, text/plain) 2013-01-08 15:19 UTC, Kees Bakker	no flags	Details
better logging v2 (56.15 KB, patch) 2013-01-12 23:20 UTC, Marcin Slusarz	no flags	Details \| Splinter Review
my lspci -vv (13.03 KB, text/plain) 2013-01-15 11:35 UTC, Alexander Stein	no flags	Details
nouveau output stored in /var/log/messages (155.79 KB, text/plain) 2013-01-15 11:42 UTC, Alexander Stein	no flags	Details
dmesg with logging v2 patch (66.88 KB, text/plain) 2013-01-16 20:27 UTC, Jacopo Moronato	no flags	Details
dmesg with loggingv2 patch applied (110.55 KB, text/plain) 2013-01-17 07:16 UTC, Kees Bakker	no flags	Details
A trace of this bug (22.17 KB, text/plain) 2013-02-19 23:32 UTC, gianogli	no flags	Details
another dmesg with logging v2 patch applied (66.07 KB, text/plain) 2013-02-21 00:01 UTC, Jacopo Moronato	no flags	Details
dmesg kernel log (1.77 MB, text/plain) 2015-08-05 13:52 UTC, Mauro Rossi	no flags	Details
Xorg log (53.46 KB, text/plain) 2015-08-05 13:57 UTC, Mauro Rossi	no flags	Details
kern.log from 4.2.0-19-generic ubuntu-15.10 with dual-head NV44 (118.39 KB, text/plain) 2015-12-15 16:23 UTC, peter swain	no flags	Details
kernel log with ttm_validate, RT_FAULT, ZETA_FAULT, PAGE_NOT_PRESENT (94.31 KB, text/plain) 2016-05-14 21:31 UTC, Matt Whitlock	no flags	Details
KDE Compositor XRender (49.83 KB, image/png) 2016-05-15 12:58 UTC, poma	no flags	Details
KDE Compositor GLX (53.23 KB, image/png) 2016-05-15 12:59 UTC, poma	no flags	Details
View All

Description Kees Bakker 2013-01-05 21:43:22 UTC

Created attachment 72569 [details]
/var/log/kern.log

After upgrading from Ubuntu precise to quantal there are lots of messages like these
(attachment has /var/log/kern.log of the time when that happened)

Jan  4 12:18:12 koli kernel: [ 3705.970720] nouveau E[     DRM] fail ttm_validate
Jan  4 12:18:12 koli kernel: [ 3705.970726] nouveau E[     DRM] validate vram_list
Jan  4 12:18:12 koli kernel: [ 3705.970760] nouveau E[     DRM] validate: -12

and there is graphics corruption. After a while it gets really bad so that I have to stop X. After restarting
X the graphics adapter is still unusable, graphics remain distorted.

By that time the following messages can be seen in /var/log/kern.log

Jan  4 15:08:23 koli kernel: [13917.531124] nouveau  [  PGRAPH][0000:01:00.0]  ERROR nsource: LIMIT_COLOR nstatus: PROTECTION_FAULT
Jan  4 15:08:23 koli kernel: [13917.531137] nouveau E[  PGRAPH][0000:01:00.0] ch 3 [0x0013c000] subc 7 class 0x4097 mthd 0x0204 data 0x00180000
Jan  4 15:08:26 koli kernel: [13920.488008] nouveau E[     DRM] reloc wait_idle failed: -16
Jan  4 15:08:26 koli kernel: [13920.488013] nouveau E[     DRM] reloc apply: -16
Jan  4 15:08:26 koli kernel: [13920.492005] [sched_delayed] sched: RT throttling activated
Jan  4 15:08:29 koli kernel: [13923.600790] nouveau E[     DRM] fail ttm_validate
Jan  4 15:08:29 koli kernel: [13923.600796] nouveau E[     DRM] validate vram_list
Jan  4 15:08:29 koli kernel: [13923.600803] nouveau E[     DRM] validate: -16
Jan  4 15:08:35 koli kernel: [13929.600009] nouveau E[     DRM] reloc wait_idle failed: -16
Jan  4 15:08:35 koli kernel: [13929.600014] nouveau E[     DRM] reloc apply: -16
Jan  4 15:09:42 koli kernel: [13993.332007] nouveau E[    1701] failed to idle channel 0xcccc0001
Jan  4 15:09:42 koli kernel: [13996.332006] nouveau E[    1701] failed to idle channel 0xcccc0000
Jan  4 15:09:59 koli kernel: [14010.792008] nouveau E[  PGRAPH][0000:01:00.0] idle timed out with status 0x0be80001
Jan  4 15:10:01 koli kernel: [14013.614181] nouveau E[  PGRAPH][0000:01:00.0] idle timed out with status 0x0be80001
Jan  4 15:10:03 koli kernel: [14016.824509] [TTM] Failed to expire sync object before buffer eviction

Comment 1 Kees Bakker 2013-01-05 21:50:10 UTC

Created attachment 72570 [details]
output of lspci -vv

Comment 2 Kees Bakker 2013-01-05 21:52:54 UTC

Created attachment 72571 [details]
Xorg.0.log

Notice that I have two 24" screens attached both at 1660x1050

Comment 3 Kees Bakker 2013-01-05 21:57:23 UTC

(In reply to comment #2)
> Created attachment 72571 [details]
> Xorg.0.log
> 
> Notice that I have two 24" screens attached both at 1660x1050

1680x1050

Comment 4 Marcin Slusarz 2013-01-05 22:12:35 UTC

mesa version?

Comment 5 Kees Bakker 2013-01-06 13:27:41 UTC

Mesa version 9.0 (i.e. on Ubuntu quantal it's called 9.0-0ubuntu1)

Comment 6 Marcin Slusarz 2013-01-06 14:11:17 UTC

Created attachment 72588 [details] [review]
better logging

Please attach dmesg with this patch applied (on top of 3.7).

Comment 7 Jacopo Moronato 2013-01-06 18:45:11 UTC

*** Bug 56718 has been marked as a duplicate of this bug. ***

Comment 8 Kees Bakker 2013-01-07 10:40:27 UTC

Created attachment 72616 [details]
dmesg with logging patch applied

Comment 9 Kees Bakker 2013-01-08 15:19:55 UTC

Created attachment 72680 [details]
Another dmesg with an early PROTECTION_FAULT but still functional X

Today no reboot boot yet. The X server is still up, but for a moment I thought it got stuck. Last messages are:

[20361.350868] nouveau E[     DRM] fail ttm_validate
[20361.350875] nouveau E[     DRM] validate vram_list, vram_list_size: 183160832, gart_list_size: 4780032, both_list_size: 0
[20361.350959] nouveau E[     DRM] validate: -12 [compiz[5399]]
[20364.892611] nouveau E[     DRM] reloc wait_idle failed: -16
[20364.892618] nouveau E[     DRM] reloc apply: -16 [compiz[5399]]
[20365.091459] nouveau E[     DRM] fail ttm_validate
[20365.091465] nouveau E[     DRM] vram
[20365.091491] nouveau E[     DRM] validate: -16 [compiz[5399]]
[20374.260008] nouveau E[     DRM] reloc wait_idle failed: -16
[20374.260015] nouveau E[     DRM] reloc apply: -16 [Xorg[3542]]
[20377.261006] nouveau E[     DRM] reloc wait_idle failed: -16
[20377.261013] nouveau E[     DRM] reloc apply: -16 [Xorg[3542]]
[20377.262000] [sched_delayed] sched: RT throttling activated
[20380.951008] nouveau E[     DRM] reloc wait_idle failed: -16
[20380.951015] nouveau E[     DRM] reloc apply: -16 [Xorg[3542]]

Comment 10 Marcin Slusarz 2013-01-12 23:19:14 UTC

Compiz (through Mesa) asks for 180MB of VRAM and the card has 256MB.

So, there seems to be 2 bugs here:
- 3D driver asks for too much VRAM (180MB for compositor?)
- kernel should handle applications asking for 180MB out of 256MB (pinned buffers should not take 76MB)

Let's figure out what's wrong on the kernel side first.

Comment 11 Marcin Slusarz 2013-01-12 23:20:34 UTC

Created attachment 72932 [details] [review]
better logging v2

please attach kernel log with this patch applied

Comment 12 Kees Bakker 2013-01-13 19:42:50 UTC

(In reply to comment #11)
> Created attachment 72932 [details] [review] [review]
> better logging v2
> 
> please attach kernel log with this patch applied

Hi Marcin,

Please be patient since I have replaced the video adapter last week. It's the PC at work and I couldn't get much work done with that setup. We do have a few other PC's with that same video adapter, but I'm sure they don't have the two 1680x1050 monitors attached. So it may influence triggering the bug.

I'll see what I can do to help you chase that bug.

Comment 13 Jacopo Moronato 2013-01-13 21:56:10 UTC

I'm experiencing the same bug on a laptop with Nvidia 8400M GS.
So I could help with the log, if necessary.

Comment 14 Kees Bakker 2013-01-14 07:05:42 UTC

(In reply to comment #13)
> I'm experiencing the same bug on a laptop with Nvidia 8400M GS.
> So I could help with the log, if necessary.

Yes please do, that would be great. Thanks

Comment 15 Jacopo Moronato 2013-01-14 16:27:13 UTC

(In reply to comment #11)
> Created attachment 72932 [details] [review] [review]
> better logging v2
> 
> please attach kernel log with this patch applied


Hi Marcin,
I'm on Raring (Ubuntu +1), which is now based on 3.8.0 kernel. The bug is the same, and it's anyway reproducible.
My question is: does your patch apply to 3.8 ? Cause patch -p1 gives me some "Hunk FAILED".

Comment 16 Alexander Stein 2013-01-15 11:34:33 UTC

I get the same warning/errors when switching the KDE graphics settings in systemsetting. Sometimes the systemlocks up and I need a hard reset and sometimes just X segfaults and it restarts (but here sometimes the GPU is locked and I need to reboot).
I'm running 00:0d.0 VGA compatible controller: NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2)

Comment 17 Alexander Stein 2013-01-15 11:35:01 UTC

Created attachment 73073 [details]
my lspci -vv

Comment 18 Alexander Stein 2013-01-15 11:42:12 UTC

Created attachment 73074 [details]
nouveau output stored in /var/log/messages

I found all (maybe only most) kernel error messages in /var/log/messages.
I forgot to add that I'm running: 3.6.11-gentoo. If I shall test any patches, feel free to ask.

Comment 19 Jacopo Moronato 2013-01-16 20:27:35 UTC

Created attachment 73165 [details]
dmesg with logging v2 patch

I successfully built v2 patch on top of latest 3.7.0.x available kernel for Raring. 
This patch seems to cause an XServer crash, whenever I launch the Dash (talking about Unity desktop environment).

Comment 20 Kees Bakker 2013-01-17 07:16:18 UTC

Created attachment 73179 [details]
dmesg with loggingv2 patch applied

Here is the dmesg from another system, with the same (G73, NV40) video adapter, with two smaller monitors (1240x1024).

Notice that there are a lot of CACHE ERROR message. So many that the printk buffer overflowed.

[70448.465702] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 3 mthd 0x0184 data 0xbeef0201
[70448.465725] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 3 mthd 0x0188 data 0xbeef0201
[70448.465745] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 3 mthd 0x0300 data 0x0000000b
...
[70448.475032] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 7 mthd 0x1dac data 0x00000000
[70448.475053] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 7 mthd 0x1dac data 0x00000000
[70448.475072] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [unity_support_t[30391]] subc 7 mthd 0x1dac data 0x00000000


BTW The user of this system is not doing much with it, so the logging is preliminary. I'm waiting for more details to show up. In the mean time, this is what we get after the first login.

Comment 21 gianogli 2013-02-19 23:32:58 UTC

Created attachment 75140 [details]
A trace of this bug

Comment 22 gianogli 2013-02-19 23:46:24 UTC

(In reply to comment #21)
> Created attachment 75140 [details]
> A trace of this bug

I can confirm this bug.

I've this trace (attachment 75140 [details]) in my HP workstation usually after 6/7 days of nonstop use.

Some information of my system:

Debian testing
kernel: 3.7.7 (Vanilla)
xserver-xorg-video-nouveau: 1.0.1-5

40:00.0 VGA compatible controller: NVIDIA Corporation NV44 [Quadro NVS 285] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 0334
        Physical Slot: 2
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at f8000000 (32-bit, non-prefetchable) [size=16M]
        Memory at f0000000 (64-bit, prefetchable) [size=128M]
        Memory at f9000000 (64-bit, non-prefetchable) [size=16M]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [60] Power Management version 2
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Kernel driver in use: nouveau

If you want I can send to you some other info or apply a patch to try to solve the issue.

Comment 23 Jacopo Moronato 2013-02-21 00:01:31 UTC

Created attachment 75214 [details]
another dmesg with logging v2 patch applied

Here is another dmesg with v2 patch built on top of 3.7 kernel.

Comment 24 ceenturion 2013-05-14 14:46:11 UTC

Hello,
I have a similar problem.
Everything seems ok until I launch Xbmc (full screen).
I then have lots of "ttm_validate" messges on syslog, and then display becomes crazy.

May  6 19:39:16 mercure dbus[725]: [system] Successfully activated service 'org.freedesktop.UDisks'
May  6 19:39:24 mercure kernel: [ 9001.240755] nouveau E[     DRM] fail ttm_validate
May  6 19:39:24 mercure kernel: [ 9001.240765] nouveau E[     DRM] validate vram_list
May  6 19:39:24 mercure kernel: [ 9001.240781] nouveau E[     DRM] validate: -12
May  6 19:39:26 mercure kernel: [ 9004.097548] nouveau E[     DRM] fail ttm_validate
May  6 19:39:26 mercure kernel: [ 9004.097558] nouveau E[     DRM] validate vram_list
May  6 19:39:26 mercure kernel: [ 9004.097577] nouveau E[     DRM] validate: -12
May  6 19:39:30 mercure kernel: [ 9007.536331] nouveau E[     DRM] fail ttm_validate
May  6 19:39:30 mercure kernel: [ 9007.536340] nouveau E[     DRM] validate vram_list
May  6 19:39:30 mercure kernel: [ 9007.536359] nouveau E[     DRM] validate: -12
May  6 19:39:33 mercure kernel: [ 9010.661699] nouveau E[     DRM] fail ttm_validate
May  6 19:39:33 mercure kernel: [ 9010.661710] nouveau E[     DRM] validate vram_list
May  6 19:39:33 mercure kernel: [ 9010.661729] nouveau E[     DRM] validate: -12
...
May  6 19:42:15 mercure kernel: [ 9173.135847] BUG: Bad page map in process xbmc.bin  pte:800000002bc20067 pmd:21da7067
May  6 19:42:15 mercure kernel: [ 9173.135862] page:ffffea0000af0800 count:-1 mapcount:-1 mapping:          (null) index:0x0
May  6 19:42:15 mercure kernel: [ 9173.135866] page flags: 0x14(referenced|dirty)
May  6 19:42:15 mercure kernel: [ 9173.135878] addr:00007f30f007f000 vm_flags:002000fb anon_vma:          (null) mapping:ffff880036790e48 index:80
May  6 19:42:15 mercure kernel: [ 9173.135890] vma->vm_ops->fault: shmem_fault+0x0/0xa0
May  6 19:42:15 mercure kernel: [ 9173.135896] vma->vm_file->f_op->mmap: shmem_mmap+0x0/0x30
May  6 19:42:15 mercure kernel: [ 9173.135903] Pid: 4223, comm: xbmc.bin Tainted: GF            3.8.0-19-generic #29-Ubuntu
May  6 19:42:15 mercure kernel: [ 9173.135907] Call Trace:
May  6 19:42:15 mercure kernel: [ 9173.135922]  [<ffffffff8115477d>] print_bad_pte+0x1dd/0x250
May  6 19:42:15 mercure kernel: [ 9173.135930]  [<ffffffff81156f62>] unmap_page_range+0x692/0x750
May  6 19:42:15 mercure kernel: [ 9173.135939]  [<ffffffff8105fb8a>] ? current_fs_time+0x1a/0x60
May  6 19:42:15 mercure kernel: [ 9173.135947]  [<ffffffff81155e73>] ? do_wp_page+0x393/0x7f0
May  6 19:42:15 mercure kernel: [ 9173.135955]  [<ffffffff811570aa>] unmap_single_vma+0x8a/0x100
May  6 19:42:15 mercure kernel: [ 9173.135962]  [<ffffffff81157909>] unmap_vmas+0x49/0x90
May  6 19:42:15 mercure kernel: [ 9173.135970]  [<ffffffff8115c894>] unmap_region+0xa4/0x120
May  6 19:42:15 mercure kernel: [ 9173.135979]  [<ffffffff8115ebca>] do_munmap+0x2ba/0x410
May  6 19:42:15 mercure kernel: [ 9173.135987]  [<ffffffff8115ed6e>] vm_munmap+0x4e/0x70
May  6 19:42:15 mercure kernel: [ 9173.135994]  [<ffffffff8115fc4b>] sys_munmap+0x2b/0x40
May  6 19:42:15 mercure kernel: [ 9173.136005]  [<ffffffff816d379d>] system_call_fastpath+0x1a/0x1f

I am using:
Xubuntu 13.04  x86_64
Kernel 3.8.0-19-generic
xserver-xorg  1:7.7+1ubuntu4

00:05.0 VGA compatible controller: NVIDIA Corporation C51PV [GeForce 6150] (rev a2) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. A8N-VM CSM
        Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 16
        Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at fc000000 (64-bit, non-prefetchable) [size=16M]
        [virtual] Expansion ROM at 40000000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Kernel driver in use: nouveau             

Regards

Comment 25 Pierre Ossman 2013-08-13 18:47:39 UTC

Anything happening on this front? Are more logs needed?

I've been suffering from this bug for years, and with the upgrade to Fedora 19 it has reached a point where the system is almost unusable. I originally reported it to Fedora, but I did not get much feedback there:

https://bugzilla.redhat.com/show_bug.cgi?id=699551

Comment 26 Ilia Mirkin 2013-08-13 19:01:47 UTC

Please confirm that this is still an issue with the latest and greatest (i.e. kernel 3.10+, xf86-video-nouveau 1.0.9, mesa git or at the very least, 9.1.6)

Also... a summary of what the issue is would be great. There are a lot of comments, and they potentially seem to talk about different things. But perhaps not, it's all a bit unclear. Repro steps (that don't start with step 1: install fedora/ubuntu/whatever) would be fantastic too.

Comment 27 Ilia Mirkin 2013-10-01 16:15:49 UTC

No response to re-test request in over a month. Closing as invalid.

Also, this bug appears to have been a hodge-podge of potentially unrelated issues, I can't even really tell what this issue was about. If things persist, open new separate issues, and follow the advice on http://nouveau.freedesktop.org/wiki/Bugs/ for what information to provide.

Comment 28 Kees Bakker 2013-10-01 17:31:31 UTC

Yes, I agree

Comment 29 Mauro Rossi 2015-08-05 13:50:57 UTC

Hi, the bug is alive and kicking on Ubuntu 15.04 kernel 3.19 with both mesa 10.6.x and 11.0.0devel.

It seams to specifically affect GeForce 61xx and GeForce 70xx
Attaching new logs, if possible

Mauro

Comment 30 Mauro Rossi 2015-08-05 13:52:55 UTC

Created attachment 117543 [details]
dmesg kernel log

Comment 31 Mauro Rossi 2015-08-05 13:57:30 UTC

Created attachment 117544 [details]
Xorg log

Comment 32 peter swain 2015-12-15 16:23:23 UTC

Created attachment 120530 [details]
kern.log from 4.2.0-19-generic ubuntu-15.10 with dual-head NV44

Working nicely until woken from overnight screen blanking, then the classic ttm_validate issue.
This is with ppa.launchpad.net/graphics-drivers/'s xserver-xorg-video-nouveau amd64 1:1.0.11-1ubuntu3

Comment 33 peter swain 2015-12-17 03:20:16 UTC

my issue was resolved by update yesterday, to a version which I can't confirm at the moment.
Twice gone thru the once-fatal screen-blank/resume sequence on same dual-monitor setup

Comment 34 peter swain 2015-12-17 05:57:18 UTC

(In reply to peter swain from comment #33)
> my issue was resolved by update yesterday,
by xserver-xorg-video-nouveau                                            1:1.0.12+git1512080732.b18bc0~gd~w
from
deb http://ppa.launchpad.net/oibaf/graphics-drivers/ubuntu wily main


Peeking into logs, it looks like this commit was responsible,
as the issue looks similar ...
author	Mario Kleiner <mario.kleiner.de@gmail.com>	2015-06-28 00:33:49 (GMT)
committer	Ben Skeggs <bskeggs@redhat.com>	2015-11-17 05:55:42 (GMT)
commit	6e6d8ac1c7b4ee047a7b40b95dea1e65a7c3211a
  "Take shift in crtc positions for ZaphodHeads configs into account"

Comment 35 Matt Whitlock 2016-04-20 18:34:27 UTC

I too am seeing sporadic "fail ttm_validate" messages, though mine say "validating bo list" (as in Comment 32) rather than "validate vram_list" (as in Comment 0 et al).

[Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kscreenlocker_g[15847]: fail ttm_validate
[Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kscreenlocker_g[15847]: validating bo list
[Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kscreenlocker_g[15847]: validate: -12
[Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kwin_x11[2268]: fail ttm_validate
[Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kwin_x11[2268]: validating bo list
[Wed Apr 20 13:39:58 2016] nouveau 0000:01:00.0: kwin_x11[2268]: validate: -12


01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce 8600 GT] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Gigabyte Technology Co., Ltd G84 [GeForce 8600 GT]
        Flags: bus master, fast devsel, latency 0, IRQ 27
        Memory at e4000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e2000000 (64-bit, non-prefetchable) [size=32M]
        I/O ports at 3000 [size=128]
        Expansion ROM at <ignored> [disabled]
        Capabilities: <access denied>
        Kernel driver in use: nouveau

Linux version 4.5.1-gentoo (root@Crushinator) (gcc version 5.3.0 (Gentoo 5.3.0 p1.0, pie-0.6.5) ) #3 SMP Wed Apr 20 10:39:09 EDT 2016

Comment 36 Matt Whitlock 2016-05-08 18:40:08 UTC

This is still a problem in Linux 4.5.3.

I might add that I also see kernel log lines like this (not new in 4.5.3):

[Sun May  8 12:23:15 2016] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 8 [chrome[2561]] subc 0 mthd 0060 data beef0201

The "beef0201" seems suspicious. I have seen other values there, but most often it's "beef0201". Doesn't this seem like a sentinel value?

Comment 37 Jeff Hodd 2016-05-14 20:02:06 UTC

This is still an issue with linux 4.5.4 as well.

I'm using a GeForce 7150m and am getting the same error:

May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: fail ttm_validate
May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validating bo list
May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validate: -12
May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: fail ttm_validate
May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validating bo list
May 14 14:16:00 bslxhp64 kernel: nouveau 0000:00:12.0: plasmashell[8397]: validate: -12

As a side note, I've been seeing these errors since updating from plasma4 to plasma5 (a couple of months now), and thus far have been uncertain whether or not this is a driver issue or a plasma issue.

Comment 38 Matt Whitlock 2016-05-14 21:31:04 UTC

Created attachment 123753 [details]
kernel log with ttm_validate, RT_FAULT, ZETA_FAULT, PAGE_NOT_PRESENT

The problems also started in earnest for me around the time I upgraded to Plasma 5. Nouveau was never *stable* before then, but I was able to ignore its errors for the most part. Now I can't go more than a few days without X freezing or even the kernel panicking.

I do not believe the problems are triggered solely by plasmashell. I most frequently see the "fail ttm_validate" message for kscreenlocker_greet while I am away from my computer. I also very frequently see graphical corruption on the lock screen in the border around my avatar.

There are other regressions too. I used to be able to use the XVideo output module in VLC (in fact, it was the only one that was stable). Now, neither XVideo nor OpenGL/GLX will run more than a few frames before the video freezes and "fail ttm_validate" messages spew into the kernel log. The only VLC output module that gives me any stability anymore is VDPAU and only if I disable hardware decoding, but even that will freeze X hard from time to time.

The "fail ttm_validate" messages are just the harbinger of impending doom. If I continue without rebooting, eventually I'll be hit by an onslaught of much more ominous errors. Here's a small sampling:

May 14 02:53:20 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 4 [chrome[21051]] subc 0 mthd 0060 data beef0201
May 14 03:43:11 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 6 [kwin_x11[2304]] subc 0 mthd 0060 data beef0201
May 14 05:06:50 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 1 [DRM] subc 0 mthd 0060 data 80000002

May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000040 [RT_FAULT] - Address 00204c7000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c: 00000000, e18: 00000000, e1c: 00000000, e20: 00001100, e24: 00030000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - 00000040 [RT_FAULT] - Address 00204c8000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - e0c: 00000000, e18: 00000000, e1c: 00000010, e20: 00001100, e24: 00030000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: 00200000 [] ch 11 [000eeda000 plasmashell[2665]] subc 3 class 8297 mthd 1904 data 01000404
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: fb: trapped write at 00204c8000 on channel 11 [0eeda000 plasmashell[2665]] engine 00 [PGRAPH] client 0b [PROP] subclient 00 [RT0] reason 00000002 [PAGE_NOT_PRESENT]
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: fb: trapped write at 0020563800 on channel 2 [0fb2f000 X[2086]] engine 00 [PGRAPH] client 0b [PROP] subclient 08 [ZETA] reason 00000002 [PAGE_NOT_PRESENT]

May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000020 [ZETA_FAULT] - Address 002054b100
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c: 00000000, e18: 00000000, e1c: 00040000, e20: 00020000, e24: 08030000
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - 00000040 [RT_FAULT] - Address 00204f1b00
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - e0c: 00000000, e18: 00000000, e1c: 006c0110, e20: 00001100, e24: 00030000
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: 00200000 [] ch 11 [000eeda000 plasmashell[2665]] subc 3 class 8297 mthd 1344 data 00004001
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: fb: trapped write at 0020555b00 on channel 11 [0eeda000 plasmashell[2665]] engine 00 [PGRAPH] client 0b [PROP] subclient 08 [ZETA] reason 00000002 [PAGE_NOT_PRESENT]

Attached is the complete error log from this session.

The problems aren't limited to X, though. When nouveau enters a failure state like this, it corrupts memory belonging to other processes. I have several times (at least thrice) seen bitcoind crash at the same time as this storm of nouveau errors, logging an error message like:

2016-05-14 16:26:37 Corruption: block checksum mismatch
2016-05-14 16:26:37 *** System error while flushing: Database corrupted
2016-05-14 16:26:37 Error: Error: A fatal internal error occurred, see debug.log for details
2016-05-14 16:26:37 Shutdown: done

When I started seeing these problems, I suspected bad RAM, so I ran Memtest86+ overnight but found no errors. So my suspicion is that nouveau is writing to pages it shouldn't.

Could someone help me modify my kernel so that, instead of merely printing "fail ttm_validate", nouveau sends a SIGBUS to the active process when this occurs? Then I can run plasmashell in gdb and get a clue as to what's causing this.

Comment 39 Ilia Mirkin 2016-05-14 22:23:14 UTC

"fail ttm_validate" usually means "you tried to use too much vram at once". Unfortunately nouveau's mesa driver isn't particularly good at handling that issue, which causes it to get much worse failures down the line.

If you have an IGP, please increase the size of the "VRAM" allocation.

Realistically, I doubt the plasma5 use-case fits well with nv30/nv40 hardware. You're taking a 2015 compositor and running it on 2005 hardware.

You could (correctly) make the argument that nouveau should do a better job at this -- and you'd be right. However volunteers aren't falling over themselves, rushing to fix these issues.

Comment 40 Matt Whitlock 2016-05-15 05:38:46 UTC

(In reply to Ilia Mirkin from comment #39)
> "fail ttm_validate" usually means "you tried to use too much vram at once".
> Unfortunately nouveau's mesa driver isn't particularly good at handling that
> issue, which causes it to get much worse failures down the line.

A failure in Mesa should, at worst, merely cause the offending process to crash. It shouldn't be possible for an unprivileged user-mode process to bring the entire system down. If it is, then there's a serious (denial-of-service) kernel bug.

> Realistically, I doubt the plasma5 use-case fits well with nv30/nv40
> hardware. You're taking a 2015 compositor and running it on 2005 hardware.

In my case it's a GeForce 8600 GT (G84, Tesla microarchitecture) on a PCI-E card. I realize this is only slightly better (release date in April 2007), but supposedly Nouveau supports Tesla. Actually, I was digging into the scant documentation in the Nouveau project, and Tesla seems to be the one microarchitecture for which Nvidia have provided some documentation, so I honestly would expect it to be the best supported of all the chipsets.

> You could (correctly) make the argument that nouveau should do a better job
> at this -- and you'd be right. However volunteers aren't falling over
> themselves, rushing to fix these issues.

Okay, fine. I really don't mind buying a newer card. I just need to know what's going to work. Can you tell me what to get? I don't care about gaming. I do use a composited desktop. I want to play with Wayland on DRM (no X server). I just want a card that will give me a stable desktop without needing to run a proprietary driver. I've been eyeing a GeForce GT 730 (GK208-301-A1, Kepler microarchitecture) in the hope that switching cards would solve my stability problems. Would Kepler be more stable on Nouveau than Tesla? I have the general notion that Nouveau has worse support for the newer cards because they're more complex and less is known about them. Is this true? What would you recommend for someone who prioritizes stability above all else? Thanks.

Comment 41 poma 2016-05-15 12:58:02 UTC

Created attachment 123756 [details]
KDE Compositor XRender

XRender:
Initial revision - 2003-11-14
https://cgit.freedesktop.org/xorg/proto/renderproto/commit/?id=bb5a469
https://cgit.freedesktop.org/xorg/xserver/commit/render?id=9508a38

NV30 family (Rankine) GeForce FX / 5
https://nouveau.freedesktop.org/wiki/CodeNames/#NV30
https://en.wikipedia.org/wiki/GeForce_FX_series

http://download.opensuse.org/tumbleweed/iso/openSUSE-Tumbleweed-KDE-Live-i686-Snapshot20160512-Media.iso

Comment 42 poma 2016-05-15 12:59:16 UTC

Created attachment 123757 [details]
KDE Compositor GLX

GLX:
Initial revision - 2003-11-14
https://cgit.freedesktop.org/xorg/proto/glproto/commit/?id=ba28c09
https://cgit.freedesktop.org/xorg/xserver/commit/GL/glx?id=9508a38

NV50 family (Tesla) GeForce 8 / 9 / 100 / 200 / 300
https://nouveau.freedesktop.org/wiki/CodeNames/#NV50
https://en.wikipedia.org/wiki/GeForce_8_series

http://download.opensuse.org/tumbleweed/iso/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20160512-Media.iso

Comment 43 poma 2016-05-15 13:09:35 UTC

(In reply to Ilia Mirkin from comment #39)
> "fail ttm_validate" usually means "you tried to use too much vram at once".
> Unfortunately nouveau's mesa driver isn't particularly good at handling that
> issue, which causes it to get much worse failures down the line.
> 
> If you have an IGP, please increase the size of the "VRAM" allocation.
> 
> Realistically, I doubt the plasma5 use-case fits well with nv30/nv40
> hardware. You're taking a 2015 compositor and running it on 2005 hardware.
> 

Looks like GLX is too much for GPU family <= NV50, i.e. NV40, NV30, ... 
but can go with XRender - the same situation as with Xfwm4 compositing.


> You could (correctly) make the argument that nouveau should do a better job
> at this -- and you'd be right. However volunteers aren't falling over
> themselves, rushing to fix these issues.

Only volunteers there!? :)

Comment 44 poma 2016-05-15 13:30:04 UTC

(In reply to poma from comment #43)
[... ]
> Looks like GLX is too much for GPU family <= NV50, i.e. NV40, NV30, ... 
[... ]

Perhaps this is a better expression in relation to the actual situation,
Looks like GLX is too much for GPU family ≈ NV50, i.e. some of NV50, following all NV40, NV30, etc.

Comment 45 Ilia Mirkin 2016-05-15 14:46:20 UTC

This bug has been corrupted by too many people adding in their own unrelated issues on their unrelated hardware with totally different versions of things, and yet claiming "oh yeah, it must be the same thing!". So I'm closing this. If your issues persist, feel free to open a fresh bug detailing your problems (one bug per reporter, in case it's not clear).

That said, a ton of people have various issues with plasma5 + nouveau.

Matt, if you're looking for advice on a GPU to buy, try IRC (#nouveau on freenode). My quick recommendation is: "not NVIDIA". If you're set on NVIDIA, happy to discuss the various trade-offs on IRC.

Comment 46 Matt Whitlock 2016-06-03 23:29:18 UTC

I discovered several failed capacitors on my motherboard. After replacing them, my system stability issues (including some single-bit I/O errors I was observing fairly frequently) have been resolved.

(In reply to Ilia Mirkin from comment #45)
> Matt, if you're looking for advice on a GPU to buy, try IRC (#nouveau on
> freenode). My quick recommendation is: "not NVIDIA". If you're set on
> NVIDIA, happy to discuss the various trade-offs on IRC.

Thank you for the recommendation. I have switched to a Radeon R7 360 (GCN 1.1 microarchitecture, Bonaire Pro chipset) and could not be happier with the open-source Radeon driver and the resulting desktop graphics performance on my system. Kudos to AMD for releasing enough documentation on their chipsets to allow development of a proper (non-reversed-engineered) driver. What a difference it makes. I'll never go back to Nvidia.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.