Bug 91535

Summary: [NVE7] Chrome can cause a nouveau 'multiple instances of buffer' message when overlaying a menu leading to X lockup
Product: xorg Reporter: Bryan O'Donoghue <pure.logic>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: peter
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description Bryan O'Donoghue 2015-08-03 00:44:37 UTC
Chrome Version 44.0.2403.125 (64-bit) on ubuntu with 
libdrm-2.4.60, libdrm-2.4.56 and Peter Hurley's libdrm-2.4.60 with the fix applied for this bug : https://bugs.freedesktop.org/show_bug.cgi?id=89842#c19 can cause the kernel driver to reject validation of a push_buffer from the nouveau push_buffer logic in nouveau_pushbuf_kick, leading to the following message

nouveau E[chrome[2737]] multiple instances of buffer 33 on validation list 
nouveau E[chrome[2737]] validate_init 
nouveau E[chrome[2737]] validate: -22 
nouveau E[chrome[2737]] multiple instances of buffer 18 on validation list 
nouveau E[chrome[2737]] validate_init 
nouveau E[chrome[2737]] validate: -22 
nouveau E[ PFIFO][0000:01:00.0] PFIFO: read fault at 
0x0003e21000 [PAGE_NOT_PRESENT] from (unknown enum 
0x00000000)/GPC0/(unknown enum 0x0000000f) on channel 0x007f80c000 
[unknown] 

The following work-around works for me on Linux 4.2-rc4

http://www.gossamer-threads.com/lists/linux/kernel/2228405

Where I tell the kernel to 'continue' if it already has mapped memory specified in a push buffer for a given PID.

Using that work-around means I still get the 'multiple instances' error message but, it's not treated as fatal and so far has been completely stable for me.

The feedback from LKML was that this is probably a bug in libdrm.

I've downloaded and run the test associated with bug 89842 i.e. libdrm-2.4.60/tests/nouveau/threaded.c using various versions of libdrm2 as suggested on LKML and I can confirm that the race condition the test tests for is not present.

Launchpad PPA @ ppa:phurley/libdrm - 2.4.60 with bugfix 89842 applied does not fault on the threaded test libdrm-2.4.60/tests/nouveau/threaded.c but does get into a 'multiple instances' state on the nouveau push_buf list - again on the stock ubuntu kernel and the tip-of-tree 4.2-rcX

deckard@aineko:~/Development/nouveau/libdrm-2.4.60$ dpkg -s libdrm2
Package: libdrm2
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 106
Maintainer: Debian X Strike Force <debian-x@lists.debian.org>
Architecture: amd64
Multi-Arch: same
Source: libdrm
Version: 2.4.60-2ppa1~trusty1
Depends: libc6 (>= 2.17)
Pre-Depends: multiarch-support
Description: Userspace interface to kernel DRM services -- runtime
 This library implements the userspace interface to the kernel DRM
 services.  DRM stands for "Direct Rendering Manager", which is the
 kernelspace portion of the "Direct Rendering Infrastructure" (DRI).
 The DRI is currently used on Linux to provide hardware-accelerated
 OpenGL drivers.
 .
 This package provides the runtime environment for libdrm.

deckard@aineko:~/Development/nouveau/libdrm-2.4.60$ uname -a
Linux aineko 4.2.0-rc4+ #50 SMP Thu Jul 30 01:22:01 IST 2015 x86_64 x86_64 x86_64 GNU/Linux

Chrome Version 44.0.2403.125 (64-bit)

Steps to replicate:

Run the version of chrome indicated above. Open a number of tabs to different websites. Click on the horizontal bars in the top right to get the drop-down menu, and hover the cursor over bookmarks or recent tabs.

This process is a bit hit and miss and make take a unknown number of tabs/time to elicit the behaviour, sorry I can't be more concise at this point.

It's not clear if bug 89842 will fix all or some of the issues reported in this Ubuntu thread : http://tinyurl.com/orvbzf3 but, I've verified the test program developed to debug that race condition doesn't cause a mjultiple instances message on my machine.

2014 Macbook pro - running bunutu 14.04, issue is present with stock ubuntu kernel and the 4.2-rcX kernel I'm using to debug this issue.

01:00.0 VGA compatible controller: NVIDIA Corporation GK107M [GeForce GT 750M Mac Edition] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Apple Inc. Device 0130
	Flags: bus master, fast devsel, latency 0, IRQ 45
	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 80000000 (64-bit, prefetchable) [size=256M]
	Memory at 90000000 (64-bit, prefetchable) [size=32M]
	I/O ports at 1000 [size=128]
	Expansion ROM at c1000000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Kernel driver in use: nouveau

01:00.1 Audio device: NVIDIA Corporation GK107 HDMI Audio Controller (rev a1)
	Subsystem: Apple Inc. Device 0130
	Flags: bus master, fast devsel, latency 0, IRQ 17
	Memory at c1080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Kernel driver in use: snd_hda_intel
Comment 1 Ilia Mirkin 2015-08-03 00:52:09 UTC
Can you say a bit more about your desktop environment? I've never seen this, and I use both chrome and nouveau on a regular basis on a GF108 which is fairly similar to kepler, at least wrt how buffers are added to validation lists.

Are you using a compositor, if so which one, and how is it configured? Is there optimus involved, if so are you using DRI2 or DRI3? Are you using GLAMOR + DRI3 in the nouveau DDX?
Comment 2 Bryan O'Donoghue 2015-08-03 01:00:57 UTC
Compositor : I've tried on gnome+metacity, fluxbox, lxde and KDE.

DRI:
Looks like DRI2

[     8.723] (--) Depth 24 pixmap format is 32 bpp
[     8.725] (II) NOUVEAU(0): Opened GPU channel 0
[     8.728] (II) NOUVEAU(0): [DRI2] Setup complete
[     8.728] (II) NOUVEAU(0): [DRI2]   DRI driver: nouveau
[     8.728] (II) NOUVEAU(0): [DRI2]   VDPAU driver: nouveau

GLAMOR:
and so therefore not using GLAMOR + DRI3
Comment 3 Arjen 2015-08-12 13:27:41 UTC
Also having X lockups, triggered by chromium and/or gnome-shell.

Linux 4.1.4-1-ARCH
libdrm-2.4.62-1
mesa-10.6.3-1

Running ArchLinux, so using all the latests versions.
Using XFCE does not fix the problem, as chromium alone can trigger it.

Aug 12 15:15:53 imac.office.nl kernel: nouveau E[chromium[2045]] fail set_domain
Aug 12 15:15:53 imac.office.nl kernel: nouveau E[chromium[2045]] validating bo list
Aug 12 15:15:53 imac.office.nl kernel: nouveau E[chromium[2045]] validate: -22
Aug 12 15:15:53 imac.office.nl kernel: nouveau E[chromium[2045]] fail set_domain
Aug 12 15:15:53 imac.office.nl kernel: nouveau E[chromium[2045]] validating bo list
Aug 12 15:15:53 imac.office.nl kernel: nouveau E[chromium[2045]] validate: -22

[..]

Aug 12 13:05:24 imac.office.nl kernel: nouveau E[   PFIFO][0000:01:00.0] write fault at 0x0010fc0000 [PTE] from GR/GPC0/PROP_0 on channel 0x001f7d9000 [gnome-shell[906]]
Aug 12 13:05:24 imac.office.nl kernel: nouveau E[   PFIFO][0000:01:00.0] PGRAPH engine fault on channel 8, recovering...
Comment 4 Ilia Mirkin 2015-09-28 21:41:41 UTC
I've recently pushed some patches which solved this issue in Witcher 2. They should end up in Mesa 11.0.2 when that is released, but are already at mesa HEAD:

http://cgit.freedesktop.org/mesa/mesa/commit/?id=d4e650b07bc80075f0d088e7d85df9efa45e11bd
http://cgit.freedesktop.org/mesa/mesa/commit/?id=3a6b9a7830c3df14ffcfbbf57c82ea08bd59ef04
http://cgit.freedesktop.org/mesa/mesa/commit/?id=1d8cba9b51b7a6e7dbf3f0d3f53b5c232fd0b5b2

I have no idea if these will help with the issues you guys see with Chrome, but... they might!
Comment 5 Bryan O'Donoghue 2015-09-28 22:42:00 UTC
Ilia

I'm running kernel 4.3-rcX right now and I don't see this issue any longer. However I do see something else (which I'll post to a separate bug). I guess it would be worthwhile getting a ppa for mesa and trying it out...
Comment 6 Ilia Mirkin 2015-09-28 23:54:05 UTC
Unbeknownst to me as I was writing this, 11.0.2 got released as an "emergency" release and only contained a handful of fixes not including mine. Will have to wait for 11.0.3.
Comment 7 Ilia Mirkin 2015-10-20 18:37:13 UTC
I believe the last of the "multiple instances of buffer" issues are fixed with Mesa 11.0.3. The set_domain issue is unrelated -- see bug 92504 for details. Marking this resolved.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.