Bug 58378 - [NV86] Distorted graphics on NVIDIA GeForce 8400M G after upgrade the kernel to 3.5.x (and RHEL 6.5) or later
Summary: [NV86] Distorted graphics on NVIDIA GeForce 8400M G after upgrade the kernel ...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL: https://bugs.gentoo.org/show_bug.cgi?...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-16 23:36 UTC by Henrique Dias
Modified: 2014-02-10 18:59 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
stack trace (4.24 KB, text/plain)
2012-12-16 23:36 UTC, Henrique Dias
no flags Details
Screenshot (162.99 KB, image/jpeg)
2012-12-16 23:37 UTC, Henrique Dias
no flags Details
my graphics are a mess. (55.88 KB, image/jpeg)
2012-12-17 17:12 UTC, Henrique Dias
no flags Details
Another screenshot (44.31 KB, image/jpeg)
2012-12-17 21:42 UTC, Henrique Dias
no flags Details
Distorted graphics with RHEL6/OL6 showing uname -a kernel 3.12.4 (75.95 KB, image/png)
2013-12-13 12:19 UTC, Andreas Loew
no flags Details
Distorted graphics: Icons (on kernel 3.12.4) (115.42 KB, image/png)
2013-12-13 12:20 UTC, Andreas Loew
no flags Details
dmesg output on 3.13-rc3 while the issue was seen (51.51 KB, text/plain)
2013-12-14 11:21 UTC, Andreas Loew
no flags Details
dmesg output in debug mode (nouveau.debug=debug) on 3.13-rc3 while the issue was seen (125.13 KB, text/plain)
2013-12-14 11:22 UTC, Andreas Loew
no flags Details
/var/log/messages from 3.12.4 start attempt with NV50 fence (38.85 KB, application/x-bzip2)
2013-12-15 23:09 UTC, Andreas Loew
no flags Details
patch to honor disabled engines (2.66 KB, patch)
2014-01-08 22:44 UTC, Ilia Mirkin
no flags Details | Splinter Review
patch to honor hw disables after vbios (3.93 KB, patch)
2014-01-09 16:15 UTC, Ilia Mirkin
no flags Details | Splinter Review
Complete dmesg output booting 3.12.6 with "hwunits.patch" applied (nouveau.debug=debug) (106.03 KB, text/plain)
2014-01-09 21:56 UTC, Andreas Loew
no flags Details
nouveau-related dmesg output booting 3.12.6 with "hwunits.patch" applied (nouveau.debug=debug) (55.60 KB, text/plain)
2014-01-09 21:57 UTC, Andreas Loew
no flags Details

Description Henrique Dias 2012-12-16 23:36:06 UTC
Created attachment 71610 [details]
stack trace

I have a NVIDIA GeForce 8400M G graphics card. I've been using nouveau drive for a long time without any kind of problems. After upgrade the kernel to 3.7.0 version I have a lot of issues. After login in to the system and after having spent some time using the system the graphics are corrupted. The graphics show up with mixed colors.
Comment 1 Henrique Dias 2012-12-16 23:37:57 UTC
Created attachment 71612 [details]
Screenshot

Screenshot showing the problem.
Comment 2 Henrique Dias 2012-12-17 17:04:44 UTC
Today messages from dmesg:

[ 4115.879007] nouveau E[  PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 07ff00 warp 0, opcode 00000000 00000000
[ 4115.879007] nouveau  [  PGRAPH][0000:01:00.0]  TRAP
[ 4115.879007] nouveau E[  PGRAPH][0000:01:00.0] ch 5 [0x00077db000] subc 3 class 0x8297 mthd 0x1694 data 0x00010031
Comment 3 Henrique Dias 2012-12-17 17:12:40 UTC
Created attachment 71674 [details]
my graphics are a mess.

my graphics are a mess.
Comment 4 Henrique Dias 2012-12-17 19:21:03 UTC
more dmesg messages:

[ 1123.476832] nouveau E[  PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 07ff00 warp 0, opcode ffffffff ffffffff
[ 1123.476839] nouveau  [  PGRAPH][0000:01:00.0]  TRAP
[ 1123.476844] nouveau E[  PGRAPH][0000:01:00.0] ch 6 [0x000765e000] subc 3 class 0x8297 mthd 0x1694 data 0x00010031
Comment 5 Henrique Dias 2012-12-17 21:42:34 UTC
Created attachment 71698 [details]
Another screenshot
Comment 6 Henrique Dias 2012-12-17 21:49:41 UTC
# lspci -nnvv

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation G86 [GeForce 8400M G] [10de:0428] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Micro-Star International Co., Ltd. Device [1462:3fe9]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
	Region 5: I/O ports at cc00 [size=128]
	Expansion ROM at fe0e0000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <4us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM L0s L1 Enabled; RCB 128 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Kernel driver in use: nouveau
Comment 7 Henrique Dias 2012-12-19 10:23:48 UTC
The problem persist with 3.7.1 kernel.
Comment 8 nemasu 2013-02-14 01:20:55 UTC
I am having the same problems post kernel version 3.7.0 with a GeForce 8800 GTS. Even glxgears will lock up.

I get a ton of these messages:
[   83.399004] nouveau  [   PFIFO][0000:01:00.0] CACHE_ERROR - Ch 2/3 Mthd 0x108c Data 0x2036652f

with the occasional:
[   83.418650] nouveau E[  PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 4 MP 1: INVALID_OPCODE at 07f4d8 warp 2, opcode 0423c788 10000811
[   83.418659] nouveau  [  PGRAPH][0000:01:00.0]  TRAP
[   83.418663] nouveau E[  PGRAPH][0000:01:00.0] ch 4 [0x0027948000] subc 3 class 0x5097 mthd 0x0f04 data 0x00000000
[   83.418672] nouveau E[     PFB][0000:01:00.0] trapped read at 0x0000000000 on channel 0x00027948 PFIFO/PFIFO_READ/SEMAPHORE reason: DMAOBJ_LIMIT
[   83.431368] nouveau E[  PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 4 MP 1: INVALID_OPCODE at 07f4d8 warp 2, opcode 0423c788 10000811
[   83.431376] nouveau  [  PGRAPH][0000:01:00.0]  TRAP
[   83.431379] nouveau E[  PGRAPH][0000:01:00.0] ch 4 [0x0027948000] subc 3 class 0x5097 mthd 0x0f04 data 0x00000000
Comment 9 Carolien 2013-02-26 13:36:05 UTC
Same here with nVidia GeForce 8400M G videocard in an Acer Aspire 7520 G laptop running Ubuntu 12.10 64bit AMD64. My first impression was a heat problem due to dust. So i cleaned the laptop fan and refitted the heatsink and heatpipes with new thermal (silver) contact paste, but the video-error reoccurs. When only two webpages are opened: no problem. Starting a Youtube video: screen is a mass, like Henrique Dias reported.

Is there a relation to the reported failure of nVidia GeForce 8 series?? http://news.cnet.com/8301-13924_3-10037632-64.html


Carolien.
Comment 10 Paulo Castro 2013-03-26 14:48:10 UTC
Hi.

I have exactly the same issue.
I seem to be able to trigger it faster by opening firefox on a page with many images.

Current Kernel: 3.8.3-103.fc17.x86_64
Other kernels affected:

kernel-3.7.9-104.fc17.x86_64
kernel-3.7.9-101.fc17.x86_64

01:00.0 VGA compatible controller: nVidia Corporation G86 [GeForce 8300 GS] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: nVidia Corporation Device 0494
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
        Region 5: I/O ports at df00 [size=128]
        [virtual] Expansion ROM at fb000000 [disabled] [size=128K]
        Capabilities: [60] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <4us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM unknown, Latency L0 <512ns, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel driver in use: nouveau
Comment 11 Paulo Castro 2013-04-24 10:01:53 UTC
After further investigation, this issue only seems to happen to applications using the gtk libs.
In my case at least ...

After triggering the bug, any app which is using the GTK libs will be affected.
It does not seem to affect other app's ( not using gtk ) rendering process.

Also, the same issue doesn't happen whilst using the NVIDIA drivers, which are just impossible to use as in my case the system is just unusable slow.
Comment 12 Dave Bjork 2013-05-28 20:03:52 UTC
Hello!

New to ubuntu. I have an old acer 5520g with the exact same problem you are describing in the comments above. I also tought it was a heat problem and found alot of dust in the graphics cards fan. My computer completely locks down and I am unable to even login or open a terminal at the loginscreen after the first glitch.

Dave
Comment 13 torsten.stocklossa 2013-11-28 11:17:49 UTC
Hi, same here after updating to Ubuntu 12.04.3 

Kernel 3.8.0-33-generic


lspci -nnvv says:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation G86M [GeForce 8400M G] [10de:0428] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Fujitsu Limited. Device [10cf:1422]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at de000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at dc000000 (64-bit, non-prefetchable) [size=32M]
	Region 5: I/O ports at 2000 [size=128]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: <access denied>
	Kernel driver in use: nouveau
	Kernel modules: nouveau, nvidiafb

Graphic is distorted once it happens the system is frozen ( with some luck I may reach a terminal ) 

Before it happens the fontcolor in Windowframes changes to "white on white " e.g. same as the background color 
I run a E8410 Lifebook 

BTW : Using the Nvidia proprietary drivers is not an option they made the system unusable at all and forced me to reinstall several times
Comment 14 torsten.stocklossa 2013-11-29 10:48:58 UTC
HI again, in addition some error messages 

Nov 29 11:21:47 torsten-LIFEBOOK-E8410 kernel: [   66.215782] nouveau E[  PGRAPH][0000:01:00.0] ch 2 [0x0007b23000] subc 7 class 0x8297 mthd 0x15e0 data 0x00000000
Nov 29 11:21:47 torsten-LIFEBOOK-E8410 kernel: [   66.304180] nouveau E[  PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 000004 warp 10, opcode ffb9c1d8 ffbac2d9
Nov 29 11:21:47 torsten-LIFEBOOK-E8410 kernel: [   66.304188] nouveau E[  PGRAPH][0000:01:00.0]  TRAP
Nov 29 11:21:47 torsten-LIFEBOOK-E8410 kernel: [   66.304193] nouveau E[  PGRAPH][0000:01:00.0] ch 2 [0x0007b23000] subc 7 class 0x8297 mthd 0x15e0 data 0x00000000
Nov 29 11:21:47 torsten-LIFEBOOK-E8410 kernel: [   66.304477] nouveau E[  PGRAPH][0000:01:00.0] TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 000004 warp 10, opcode ffb9c1d8 ffbac2d9
Nov 29 11:21:47 torsten-LIFEBOOK-E8410 kernel: [   66.304483] nouveau E[  PGRAPH][0000:01:00.0]  TRAP
Nov 29 11:21:47 torsten-LIFEBOOK-E8410 kernel: [   66.304487] nouveau E[  PGRAPH][0000:01:00.0] ch 2 [0x0007b23000] subc 7 class 0x8297 mthd 0x15e0 data 0x00000000


and 

Nov 29 11:26:04 torsten-LIFEBOOK-E8410 kernel: [  323.106306] nouveau E[     DRM] GPU lockup - switching to software fbcon
Nov 29 11:27:07 torsten-LIFEBOOK-E8410 kernel: [  386.736037] nouveau E[    3431] failed to idle channel 0xcccc0001
Nov 29 11:27:09 torsten-LIFEBOOK-E8410 kernel: [  388.735098] nouveau E[   PFIFO][0000:01:00.0] channel 3 unload timeout
Nov 29 11:27:12 torsten-LIFEBOOK-E8410 kernel: [  391.732025] nouveau E[    3431] failed to idle channel 0xcccc0000
Nov 29 11:27:14 torsten-LIFEBOOK-E8410 kernel: [  393.731221] nouveau E[   PFIFO][0000:01:00.0] channel 2 unload timeout
Nov 29 11:28:09 torsten-LIFEBOOK-E8410 kernel: [  448.580025] nouveau E[    4056] failed to idle channel 0xcccc0001
Nov 29 11:28:11 torsten-LIFEBOOK-E8410 kernel: [  450.579162] nouveau E[   PFIFO][0000:01:00.0] channel 3 unload timeout
Nov 29 11:28:14 torsten-LIFEBOOK-E8410 kernel: [  453.576022] nouveau E[    4056] failed to idle channel 0xcccc0000
Nov 29 11:28:16 torsten-LIFEBOOK-E8410 kernel: [  455.575198] nouveau E[   PFIFO][0000:01:00.0] channel 2 unload timeout
Nov 29 11:29:17 torsten-LIFEBOOK-E8410 kernel: [  516.552036] nouveau E[    4211] failed to idle channel 0xcccc0001
Nov 29 11:29:19 torsten-LIFEBOOK-E8410 kernel: [  518.553893] nouveau E[   PFIFO][0000:01:00.0] channel 3 unload timeout
Nov 29 11:29:22 torsten-LIFEBOOK-E8410 kernel: [  521.556024] nouveau E[    4211] failed to idle channel 0xcccc0000
Nov 29 11:29:24 torsten-LIFEBOOK-E8410 kernel: [  523.555077] nouveau E[   PFIFO][0000:01:00.0] channel 2 unload timeout


For both the session is Gnome. Now when running on Gnome (no effects ) ist is slighly more stable.

As mentioned I also tried NVIDIA drivers .... with the effect that the system was unusable at all. 

Since the issue seems to be quite old . . . there should be an appropriate solution by now !

cheers
TS
Comment 15 torsten.stocklossa 2013-12-12 14:00:02 UTC
HI,
I wonder if this is still alive ?? Any news on this

cheers
T
Comment 16 Ilia Mirkin 2013-12-12 14:11:23 UTC
Messing with priority just annoys the developers.

In the meanwhile, try new kernels. I only see up to 3.8 tested. Do a bisect. There was a major driver rewrite in 3.7, but it might have been something else that causes the issue. Make sure you're running an updated DDX.

As you might imagine, none of the devs are seeing this, so you'll have to do the debugging if you want it fixed.
Comment 17 Andreas Loew 2013-12-13 12:19:43 UTC
Created attachment 90715 [details]
Distorted graphics with RHEL6/OL6 showing uname -a kernel 3.12.4
Comment 18 Andreas Loew 2013-12-13 12:20:52 UTC
Created attachment 90717 [details]
Distorted graphics: Icons (on kernel 3.12.4)
Comment 19 Andreas Loew 2013-12-13 12:45:32 UTC
Hello,

I would like to join discussions in this bug, as I have found myself affected after the recent update from Red Hat Enterprise Linux/Oracle Linux 6.4 (stock RHEL kernel 2.6.32-358.23.2) to RHEL/OL 6.5 (RHEL kernel 2.6.32-431).

My graphics card is NVidia Quadro NVS 130M:
BOOT0  : 0x086a00a2
Chipset: G86 (NV86)
Family : NV50

It seems that RHEL 6.5 kernel 2.6.32-431 has updated its kernel modules for nouveau DRM to a codebase level that matches official Linux kernels 3.7, and therefore introduced this severe graphics distortion issue into mainline RHEL 6.

In order to verify that it indeed is the nouveau DRM kernel module resonsible for the distortion, I have upgraded my OL6 packages to the following versions:

* mesa-9.2.0.5 (including support for nouveau, which is commented out by default in RHEL6)
* libdrm-2.4.50
* xorg-x11-drv-nouveau-1.0.9

but this does NOT affect the issue at all.

But reverting back to RHEL stock kernel 2.6.32-358.23.2 makes the issue vanish, also when using the above updated library versions.

I then tried Oracle's UEK kernels, and while the current UEK2 kernel (2.6.39-400.211.2) does NOT have the issue, the current UEK3 kernel (3.8.13-16.2.2) also shows it.

I then tried to find out about the exact "versions" (git commit levels?) of the nouveau libdrm modules, and found out the following:

(1) Oracle UEK2 kernel 2.6.39-400.211.2 - NO ISSUE:
[drm] Initialized nouveau 0.0.16 20090420 for 0000:01:00.0 on minor 0

(2) RHEL stock kernel 2.6.32-358.23.2 - NO ISSUE:
[drm] Initialized nouveau 1.0.0 20120316 for 0000:01:00.0 on minor 0

(3) RHEL stock kernel 2.6.32-431 - DOES SHOW THE ISSUE:
[drm] Initialized nouveau 1.1.0 20120801 for 0000:01:00.0 on minor 0

(4) more recent kernels, such as Oracle UEK3 (3.8.13-16.2.2) and the most recent Oracle "playground" kernel from public-yum.oracle.com (3.12.4-3.12.y.20131210) all DO SHOW THE ISSUE:
[drm] Initialized nouveau 1.1.1 20120801 for 0000:01:00.0 on minor 0

So to me it now seems as if the issue has been introduced with the massive changes to nouveau/DRM that went into 3.7:

http://www.phoronix.com/scan.php?page=news_item&px=MTE1NDg

and affects ALL subsequent versions since then... :-(

I would be very interested and willing to help in debugging/tracking this down, but I don't have any git background, so you would have to guide me through how to do the "bisect"...

Hope this helps & looking forward to your feedback! :-)

Best regards,
Andreas
Comment 20 Andreas Loew 2013-12-13 13:25:20 UTC
Had been missing my "lspci -nnvv" information:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation G86M [Quadro NVS 130M] [10de:042a] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Toshiba America Info Systems Device [1179:0002]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
	Region 5: I/O ports at cf00 [size=128]
	[virtual] Expansion ROM at fc000000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <4us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM L0s L1 Enabled; RCB 128 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Kernel driver in use: nouveau
	Kernel modules: nouveau, nvidiafb
Comment 21 Ilia Mirkin 2013-12-13 13:53:47 UTC
(a) Can we see a full boot log (e.g. output of dmesg) with a recent kernel? Ideally it would include the time that the visual issues happen.

(b) This looks like it could be a fencing issue, i.e. we try to draw to a texture, but then instead of waiting, we don't wait. There were some fixes that went into 3.13-rc1, so perhaps trying the latest and greatest (e.g. 3.13-rc3, or the latest Linus HEAD) would be good to test out.

(c) There are many bisection guides on the internet. You will also need to figure out how to make the compiled kernel play nice with your distribution. The basics are simple though:

1. git bisect start v3.7 v3.6 -- drivers/gpu/drm/nouveau
2. build/install/boot/test
3. if it's good, "git bisect good", if it's bad, "git bisect bad"
4. goto 2

At some point running the step 3 command will tell you "first bad commit is xyz". That's when you're done. I suspect it might be the giant mega "rewrite nouveau" commit, in which case we're screwed and this will have been a huge time-waster (apologies in advance if it turns out this way). But it might be one of the many other commits that went into 3.7, which would be nice and indicate an area to focus on.
Comment 22 Andreas Loew 2013-12-13 16:57:44 UTC
Hello Ilia,

regarding (a) and (b): I am just waiting for a rpmbuild of an OL6 version of 3.13-rc3 to finish and will report back on my findings and include a dmesg output from that version.

Regarding (c):

Would'nt it make more sense than starting with 3.6 release and 3.7 release tags to first rule out the "mega commit"?

Can you give me the git commands (or point me to a doc that tells me how to produce them) for getting "ordinary kernel tarballs" out of the DRM nouveau git just like the ones published on

https://www.kernel.org/pub/linux/kernel/v3.0/testing/

for two points in time in between 3.6 and 3.7:

(1) for the version up to the immediate commit BEFORE the "mega commit"
(2) for the version exactly matching the "mega commit"?

Using these two kernel tarballs, I could then either confirm or rule out the "mega commit" as the root cause for the issue, and in the (unlikely) case the mega commit can indeed be ruled out, I could then concentrate on further narrowing down the commits

* either between 3.6 and the mega commit if build (1) is already broken
* or between the mega commit and 3.7 if build (2) still works, but 3.7 fails?

Sorry, but rather than pulling the whole git on my poor old laptop and starting a huge number of bisection attemps "into the blue", I think that this makes more sense and does not require me to become a git expert in order to try and help tracking this down... ;-)

What do you think?

I will report back shortly with my 3.13-rc3 results...

BR,
Andreas
Comment 23 Ilia Mirkin 2013-12-13 19:03:38 UTC
The mega-commit is ebb945a94bba2ce8dff7b0942ff2b3f2a52a0a69. So you could check out ebb945a94bba^ and see if it works, and then test ebb945a94bba to see if it doesn't. In either case, you could use those as your new "good" or "bad" starting points.

You can do a clone with like --depth 1 or something. Not sure how to do that at a commit. Also I'd recommend against it, it'll just be more downloading later on if things don't pan out. A full git clone of the linux kernel is ~800MB (+ space to actually store the files, but that's all part of the 800MB). In fact, I don't even know if that 818MB is compressed or not -- I'd guess not, so the download is probably much smaller.
Comment 24 Andreas Loew 2013-12-14 11:21:50 UTC
Created attachment 90764 [details]
dmesg output on 3.13-rc3 while the issue was seen
Comment 25 Andreas Loew 2013-12-14 11:22:31 UTC
Created attachment 90765 [details]
dmesg output in debug mode (nouveau.debug=debug) on 3.13-rc3 while the issue was seen
Comment 26 Andreas Loew 2013-12-14 11:30:05 UTC
Hi again,

sorry, it took longer than needed for me to find my way through compiling recent kernels with rpmbuild and an appropriate spec file.

The result of my testing is negative: The bug is still included in the most recent 3.13-rc3 kernel... :-(

From the attached dmesg output (which in both cases, includes the time when the issue was seen and my screen was completely garbled), it looks to me that there are no signs - not even in debug mode - of anything going wrong, so if I am right with this assumption, I think this supports your theory that the root cause of the severe screen corruption indeed is a "fencing" issue...

In the meantime, I have created a git repository on my machine and produced two 3.6-based tarballs for before and after the "mega patch".

I will now move forward to adapt a 3.6 kernel rpmbuild spec file and then build two kernels for these two snapshots.

I should be able to update you on my progress some time tomorrow...

Thanks & best regards,
Andreas
Comment 27 Andreas Loew 2013-12-14 17:36:05 UTC
Hmm - bad news once again:

I have now compiled and tested a 3.6.kernel to match the commit immediately before the "mega commit", i.e. the kernel tarball has been produced by the following command:

$ git archive --format=tar "ebb945a94bba2ce8dff7b0942ff2b3f2a52a0a69^" | bzip2 > ~/Projekte/nouveau_drm/linux-before-mega.tar.bz2

Unfortunately, I am unable to test whether the screen distortion issue occurs with this kernel, because I get a complete hang (system freezes, CPU and GPU fans running full speed) somewhere between some seconds and some minutes after starting GNOME...

Note that I have seen both: either no screen corruption at all or first slight signs of screen corruption (white rectangles around window frames) at the times of the hangs.

The error messages that I find in /var/log/messages probably associated with the hangs (sorry, I can't get any messages ot of dmesg due to the hang...) seem to be the following:

[drm:drm_mm_takedown] *ERROR* Memory manager not clean. Delaying takedown
[drm:drm_mm_takedown] *ERROR* Memory manager not clean. Delaying takedown
[drm:drm_mm_takedown] *ERROR* Memory manager not clean. Delaying takedown

repeating any number between 3 to 5 times directly before the hangs (immediately followed by /var/log/messages starting over with my power-off machine restart).

Will now move forward to test with the most recent stock kernel from the 3.6 series: 3.6.11-3.6.y.20121225.ol6 from the Oracle public yum playground to see whether this already is affected... :-(

BR,
Andreas
Comment 28 Andreas Loew 2013-12-14 17:58:51 UTC
This gets really interesting now:

Oracle public yum "playground" 3.6.11-3.6.y.20121225.ol6 (should be stock kernel 3.6.11) does NOT show any hangs, but DOES INDEED ALREADY show the graphics corruption issue FOR ME (although it was thought by the original posters here that it started with 3.7.0)...!?

So I will now try and move backwards in kernel versions until I might find one that does not exhibit the corruption bug.

As Oracle's "playground" kernels are only available starting from 3.6, I will probably move to ELRepo "ml" kernels for this job.

I'll report back once I have some idea of where exactly the issue indeed started...

BR,
Andreas
Comment 29 Andreas Loew 2013-12-15 14:09:55 UTC
OK, finally I have some more encouraging news:

It now looks like the issue indeed started much earlier than initially thought, namely already between the 3.4 and 3.5 kernel series!!!

Results from my testing with stock kernels obtained from kernel.org (I've never ever before compiled so many kernels in such a short period of time...):

* 3.4.5 -> NO ISSUE
* 3.4.74 -> NO ISSUE
* 3.5.1 -> ISSUE SEEN
* 3.5.5 -> ISSUE SEEN
* all later versions (3.6 onwards) -> ISSUE SEEN.

So please advise now what next steps I should undertake to track it down more closely:

What new commits have happened between the 3.4 and 3.5 series, and did one of them possibly affect so-called "fencing" on NV86/NV50 chips?

(And - in order to learn some more git - how can I find out the associated commits using git command-line, such that I can produce the respective kernel tarballs for testing out of git?)

Many thanks in advance for your feedback! :-)

Andreas
Comment 30 Andreas Loew 2013-12-15 14:13:11 UTC
In addition, one more request to all the other people who raised this issue here and/or have also seen it before myself:

Can you confirm that for you, the issue indeed also already started after the 3.4 series like it does for me, i.e. you never tried a 3.5.x or 3.6.x kernels?

Thanks & BR,
Andreas
Comment 31 Ilia Mirkin 2013-12-15 14:42:27 UTC
You really need to figure out how to do things inside the git tree and not do some sort of crazy export. That will speed things up by an order of magnitude.

To get the list of nouveau changes between 3.4 and 3.5:

git log v3.4..v3.5 -- drivers/gpu/drm/nouveau

To do a bisect between 3.4 and 3.5, same instructions as before, but use v3.5 as the bad tag and v3.4 as the good tag.

Looking through the list of changes, c420b2dc8dc3cdd507214f4df5c5f96f08812cbe stands out as a big one, as does 5e120f6e4b3f35b741c5445dfc755f50128c3c44 which actually introduces the nv84+ fence mechanism.

This had actually previously occurred to me, but a quick thing to try out is to switch to the nv17 fence and see what happens. You can do this by editing the logic in drivers/gpu/drm/nouveau/nouveau_drm.c:nouveau_accel_init, and just replace nv84_fence_create with nv50_fence_create (which will make a nv50+ appropriate nv17 fence impl).
Comment 32 Andreas Loew 2013-12-15 15:19:35 UTC
Many thanks for your quick reply - even on a Sunday! :-)


Regarding:

"You really need to figure out how to do things inside the git tree and not do
some sort of crazy export. That will speed things up by an order of magnitude."

the main issue is that I need to build a RHEL6/OL6 compliant kernel on my machine, and I simply don't have a spec file which properly builds such a kernel from git, so I need to export the git snapshot to a tarball.

In case you have such an RHEL6/OL6 spec file (or know where to get one from), please let me know...


I'm just in the process of trying whether moving from nv84_fence_create to nv50_fence_create will make a difference with 3.6.11 and will report back later.

BR,
Andreas
Comment 33 Andreas Loew 2013-12-15 23:09:23 UTC
Created attachment 90812 [details]
/var/log/messages from 3.12.4 start attempt with NV50 fence
Comment 34 Andreas Loew 2013-12-15 23:16:08 UTC
Bad news once again...

I applied the following single-line patch to a stock 3.12.4 kernel in order to switch to the NV50 fence:

diff -Nrpu linux-3.12.4.orig/drivers/gpu/drm/nouveau/nouveau_drm.c linux-3.12.4/drivers/gpu/drm/nouveau/nouveau_drm.c
--- linux-3.12.4.orig/drivers/gpu/drm/nouveau/nouveau_drm.c	2013-12-08 17:18:58.000000000 +0100
+++ linux-3.12.4/drivers/gpu/drm/nouveau/nouveau_drm.c	2013-12-15 16:37:25.000000000 +0100
@@ -180,7 +180,7 @@ nouveau_accel_init(struct nouveau_drm *d
 	else if (device->chipset   <  0x17) ret = nv10_fence_create(drm);
 	else if (device->card_type < NV_50) ret = nv17_fence_create(drm);
 	else if (device->chipset   <  0x84) ret = nv50_fence_create(drm);
-	else if (device->card_type < NV_C0) ret = nv84_fence_create(drm);
+	else if (device->card_type < NV_C0) ret = nv50_fence_create(drm);
 	else                                ret = nvc0_fence_create(drm);
 	if (ret) {
 		NV_ERROR(drm, "failed to initialise sync subsystem, %d\n", ret);

but the result is that after the GUI login screen (gdm) which works fine, I get a complete hang when GNOME starts up using compiz (cannot even switch to a text vt any more) and lots of the following output:

Dec 15 23:23:05 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 0 mthd 0x0018 data 0x00000002
Dec 15 23:23:05 aloew-lap kernel: nouveau E[     PFB][0000:01:00.0] trapped write at 0x0000000114 on channel 0x0000f949 [unknown] PFIFO/PFIFO_READ/SEMAPHORE reason: PT_NOT_PRESENT
Dec 15 23:23:06 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 2 mthd 0x0860 data 0x6f000000
Dec 15 23:23:06 aloew-lap kernel: nouveau E[     PFB][0000:01:00.0] trapped write at 0x0000000114 on channel 0x0000f949 [unknown] PFIFO/PFIFO_READ/SEMAPHORE reason: PT_NOT_PRESENT
Dec 15 23:23:06 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 2 mthd 0x0860 data 0x72000000
Dec 15 23:23:06 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 2 mthd 0x0860 data 0x76000000
Dec 15 23:23:06 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 2 mthd 0x0860 data 0x74000000
Dec 15 23:23:06 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 2 mthd 0x0860 data 0x6f000000
Dec 15 23:23:06 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 2 mthd 0x0860 data 0x60000000
Dec 15 23:23:06 aloew-lap kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [compiz[4637]] subc 2 mthd 0x0860 data 0x41000000
(...)

(see the attached bz2 for the full log).

So does this mean that your proposal of switching to the nv50_fence won't work for me?

In the meantime, I will continue and try kernel builds based on commits "5e120f6e4b3f35b741c5445dfc755f50128c3c44^" and "5e120f6e4b3f35b741c5445dfc755f50128c3c44" tomorrow...

Thanks & BR,
Andreas
Comment 35 Andreas Loew 2013-12-15 23:44:21 UTC
Follow-up question note that I am digging deeper into git... ;-)

From the commits in the v3.4..v3.5 range, only two of them:

9bd0c15fcfb42f6245447c53347d65ad9e72080b (dated Jun 26, 2012) and
e9bf5f36b09f8ec6c168ef58ee7d4890545ede1c (dated Jun 27)

when looking at the global Makefile:

git show <commit-sha1>:Makefile

have been done on 3.5.0-rc4 version of the kernel.

All other commits in this range had been done on 3.4.0 and less:

35916acedd8dadb361ef6439d05d60fbe8f53032 (dated May 31)

and all earlier commits have been done on 3.4.0 and its rc builds.

As the issue is NOT present in the 3.4.x series anyway, I assume that only the two commits on 3.5.0-rc4 above (if any) from this interval are relevant, and we rather need to look at the subsequent v3.5..v3.6 range!?

Am I correct (and please bear with me in case I got it wrong - I had never used git before looking into this...)?

Thanks & BR,
Andreas
Comment 36 Ilia Mirkin 2013-12-15 23:51:11 UTC
(In reply to comment #35)
> Follow-up question note that I am digging deeper into git... ;-)
> 
> From the commits in the v3.4..v3.5 range, only two of them:
> 
> 9bd0c15fcfb42f6245447c53347d65ad9e72080b (dated Jun 26, 2012) and
> e9bf5f36b09f8ec6c168ef58ee7d4890545ede1c (dated Jun 27)
> 
> when looking at the global Makefile:
> 
> git show <commit-sha1>:Makefile
> 
> have been done on 3.5.0-rc4 version of the kernel.

Most assuredly not the right way to look at it.

> 
> All other commits in this range had been done on 3.4.0 and less:
> 
> 35916acedd8dadb361ef6439d05d60fbe8f53032 (dated May 31)

And of course date of the commit has nothing to do with anything either.

Think about the branched development model. Let's say I do some work, basing my work on, say, 2.6.0. I spend a lot of time on it. Then I send a pull request to Linus (or whoever). He merges it. When looking at my commits, you might think that you're looking at a 2.6.0 kernel, based on the Makefile. And in a large sense you are. But in reality the commits were merged into some much later release. Same with dates.

You can either read about git and fully understand it, or you can kinda trust that the tools aren't lying to you when you ask for a bisect in a range, or a log of commits between two revisions.
Comment 37 Andreas Loew 2013-12-16 13:02:11 UTC
Hello again, Ilia,

ok, I see - did some further reading and I think I now fully understand the way it works:

This also means that you are NOT regularly pulling the updates from Linus' central git into the nouveau git, but typically only do this ONCE after Linus released a new version (here: 3.4.0) and then NOT for any minor subsequent release by Linus (3.4.1, 3.4.2 and so on), but ONLY shortly before he opens the "rc" pull window for his next release series (here: 3.5-rc1).

So it indeed looks as if all the local commits on the nouveau git have been made on a 3.4.0 kernel, although they ended up in the official 3.5 version released by Linus.


Besed on this, I did further testing:

Both "5e120f6e4b3f35b741c5445dfc755f50128c3c44^" and "5e120f6e4b3f35b741c5445dfc755f50128c3c44" do still run fine, i.e. the commit 5e120f6e4b3f35b741c5445dfc755f50128c3c44 - which actually introduced the nv84_fence - does NOT seem to be causing the distortion issue.

I will now move forward (slowly, as I need to do the tarball-based rpmbuild process), and keep you updated on my findings.


Also, I repeat my question to the other folks who had reported this issue before:

Can you confirm that you also already see the issue when you use any stock 3.5.x or 3.6.x kernels, i.e. the issue did start long before 3.7.0 and the 3.4.x is the most recent release that works fine?

Thanks & BR,
Andreas
Comment 38 Tom Wijsman 2013-12-16 13:19:47 UTC
(In reply to comment #37)
> This also means that you are NOT regularly pulling the updates from Linus'
> central git into the nouveau git, but typically only do this ONCE after
> Linus released a new version (here: 3.4.0)

That usually is done as often as necessary; if one does it more often, it could lead to situations where you have pulled a new broken commit that could slow down Nouveau development. And thus, pulling major releases is efficient.

> and then NOT for any minor
> subsequent release by Linus (3.4.1, 3.4.2 and so on)

Note that those releases happen by Greg KH and consist of backported patches.

> but ONLY shortly
> before he opens the "rc" pull window for his next release series (here:
> 3.5-rc1).

That would be one possible moment where the conditions are ideal enough to pull.

Though I am in doubt whether it matters when this was pulled from Linus. If you don't like to bisect the Nouveau development branch, you can bisect kernel git.

> Besed on this, I did further testing:
> 
> Both "5e120f6e4b3f35b741c5445dfc755f50128c3c44^" and
> "5e120f6e4b3f35b741c5445dfc755f50128c3c44" do still run fine, i.e. the
> commit 5e120f6e4b3f35b741c5445dfc755f50128c3c44 - which actually introduced
> the nv84_fence - does NOT seem to be causing the distortion issue.
> 
> I will now move forward (slowly, as I need to do the tarball-based rpmbuild
> process), and keep you updated on my findings.

You really want to be doing a git bisect to do the least amount of work; I don't see what you mean by "move forward" but I really hope that you are testing the commits in a binary tree style.

You can put the tarball-based process in a script so you only need to run a single command after moving further in the bisection.
Comment 39 Andreas Loew 2013-12-16 14:18:21 UTC
> You really want to be doing a git bisect to do the least amount of work; I
> don't see what you mean by "move forward" but I really hope that you are
> testing the commits in a binary tree style.

Yup. Indeed, I am trying to use a "binary" approach to minimize work, but am not using git bisect, but hope to augment augment cutting the solution tree in half by reading the commit comments and letting my intellect suggest which ones look like more or less likely candidates...

> You can put the tarball-based process in a script so you only need to run a
> single command after moving further in the bisection.

I have indeed already done so (not based on git bisect, but a commit id).

Note that while I'm indeed a complete newbie to git, I am not at all a newbie Linux/Unix shell scripting. In my main job, though, I am a Java architect/developer/support engineer, so I typically am only a "dummy user" of Linux kernels - unless it very rarely happens that something breaks which is really important for me, so I try and see whether I can help... ;-)

I started trying to drive this forward when I became suddenly affected by this issue because it has indeed been introduced into RHEL6 mainline kernels with the most recent RHEL 6.5 kernel update - so I hope that once we're done and you have been able to successfully fix the issue, you can take care of the fix also being ported into subsequent RHEL6 kernels (working at Red Hat, I hope that Ben Skeggs should hopefully be interested enough in doing so...).

Will report back here as soon as I have been able to track things down to a particular commit... :-)

Thanks & BR,
Andreas
Comment 40 Andreas Loew 2013-12-17 14:51:03 UTC
Hello again,

I need your help on how to proceed:

Using the bisection approach, I have now largely reduced the candidate commits that might have introduced the issue:

4c193d254ee94da02857b9670e815b1765a9579b shows the issue, while
c420b2dc8dc3cdd507214f4df5c5f96f08812cbe does not, so the issue has been introduced between May 2nd and May 4th, 2012.

I now wanted to check 78df3a1c585c8c95fd9a472125f0cd406e8617ce, but this commit does not even compile:

The error message for the above is:

drivers/gpu/drm/nouveau/nouveau_fbcon.c: In function 'nouveau_fbcon_sync':
drivers/gpu/drm/nouveau/nouveau_fbcon.c:166: error: void value not ignored as it ought to be
make[4]: *** [drivers/gpu/drm/nouveau/nouveau_fbcon.o] Error 1
make[3]: *** [drivers/gpu/drm/nouveau] Error 2
make[2]: *** [drivers/gpu/drm] Error 2
make[1]: *** [drivers/gpu] Error 2
make: *** [drivers] Error 2

So how should I proceed? Can you tell me how to fix the above compile error, or should I proceed to check both

b355096992e2b4d30bb77173927f45e7f2c12570
(immediately before 78df3a1c585c8c95fd9a472125f0cd406e8617ce) and

d1b167e168bdac0b6af11e7a8c601773639fc419
(immediately after 78df3a1c585c8c95fd9a472125f0cd406e8617ce)?

Please advise how I should move forward!

Thanks & best regards,
Andreas
Comment 41 Ilia Mirkin 2013-12-17 15:09:59 UTC
The fix for that compilation issue is contained in d1b167e168bdac0b6af11e7a8c601773639fc419

Basically you need to make nouveau_channel_idle return an int, and just stick a 'return ret' at the end. And adjust the prototype in nouveau_drv.h.
Comment 42 Andreas Loew 2013-12-17 15:19:32 UTC
Thanks a million for your super-fast reply! So I'll proceed with 

d1b167e168bdac0b6af11e7a8c601773639fc419

rather than 78df3a1c585c8c95fd9a472125f0cd406e8617ce, and will report back later...

BR,
Andreas
Comment 43 Andreas Loew 2013-12-17 21:17:53 UTC
Hello again,

so it looks like I have now tracked the issue down.

The "offending" commit seems to be:

4c193d254ee94da02857b9670e815b1765a9579b

(the first commit which showed the issue - I tried for more than half an hour with its direct predecessor d1b167e168bdac0b6af11e7a8c601773639fc419, but could not reproduce the issue.


As the change by the offending commit seems really very 'simple':

  "use crypto engine for async buffer copies"

@@ -821,6 +839,7 @@ nouveau_bo_move_init(struct nouveau_channel *chan)
        } _methods[] = {
                {  "COPY", 0xa0b5, nve0_bo_move_copy, nvc0_bo_move_init },
                {  "M2MF", 0x9039, nvc0_bo_move_m2mf, nvc0_bo_move_init },
+               { "CRYPT", 0x74c1, nv84_bo_move_exec, nv50_bo_move_init },
                {  "M2MF", 0x5039, nv50_bo_move_m2mf, nv50_bo_move_init },
                {  "M2MF", 0x0039, nv04_bo_move_m2mf, nv04_bo_move_init },

at the heart of the issue, I think we now have the question whether it indeed is correct that there might be NV84-compatible G86 variants (such as my 8400M-based Quadro NVS130M), for which this "nv84_bo_move_exec" causes issues...!?


One more question regarding verification with current kernels:

In a current kernel, method nouveau_bo_move_init looks similar, but different:

        } _methods[] = {
                {  "COPY", 4, 0xa0b5, nve0_bo_move_copy, nve0_bo_move_init },
                {  "GRCE", 0, 0xa0b5, nve0_bo_move_copy, nvc0_bo_move_init },
                { "COPY1", 5, 0x90b8, nvc0_bo_move_copy, nvc0_bo_move_init },
                { "COPY0", 4, 0x90b5, nvc0_bo_move_copy, nvc0_bo_move_init },
                {  "COPY", 0, 0x85b5, nva3_bo_move_copy, nv50_bo_move_init },
                { "CRYPT", 0, 0x74c1, nv84_bo_move_exec, nv50_bo_move_init },
                {  "M2MF", 0, 0x9039, nvc0_bo_move_m2mf, nvc0_bo_move_init },
                {  "M2MF", 0, 0x5039, nv50_bo_move_m2mf, nv50_bo_move_init },
                {  "M2MF", 0, 0x0039, nv04_bo_move_m2mf, nv04_bo_move_init },
                {},
                { "CRYPT", 0, 0x88b4, nv98_bo_move_exec, nv50_bo_move_init },
        }, *mthd = _methods;

what would be an equivalent change to a current kernel to roll back the effects of the above forward patch?

Looking forward to your feedback...

Thanks a million & best regards,
Andreas
Comment 44 Ilia Mirkin 2013-12-17 21:34:34 UTC
Try commenting the same line out and see what happens... (i.e. the one with 0x74c1)

FWIW I do remember seeing some PCRYPT-related (and PVP/PBSP-related) errors on start in the form of MMIO write failures in your log and thinking it odd:

nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010 FAULT at 0x10200c

Which of course is an enable of FIFO_ACCESS... probably pretty important. (See https://github.com/envytools/envytools/blob/master/rnndb/vdec/vp2/pcrypt2.xml) But why do you get that error... anyone's guess. If you have the blob installed, would be interested to know if VDPAU hw decode acceleration works for H.264 (i.e. things are actually accelerated), because you get similar errors for the VP/BSP engines:

nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00fd94
nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x103d94

And they're all interconnected, I think. [Sadly none of the other bug commenters were kind enough to leave a kernel log around, so can't easily tell if that was their issue as well.]
Comment 45 Andreas Loew 2013-12-17 22:09:08 UTC
Hmm...

While my card definitely is not at all defective (it works the exact same way it always has), I indeed also have some issues that make it impossible for me to still use the "blob" even on an attic distro like RHEL6:

The most recent version of the NVidia proprietary driver that worked fine for me was their 285.09 (which can no longer be used even on RHEL6 due to the fact that it relies on an outdated X11 ABI (AFAIK).

Any more recent version of NVidia's driver (and most interestingly, on Linux as well as on Windows 7 x64!) - even though NVidia states that Quadro NVS 130M would still be supported with their latest drivers - has an issue which causes sudden complete hangs every once in a while (between a few seconds to few hours), but completely unpredictably...

Unfortunately, NVidia tech support is unable/unwilling to help with this issue (I tried for several months without any progress...).

I just did some Google research: Do you know that an 8400M (as well as my NVS 130M) does only support VDPAU "feature set A", but most VDPAU software relies on feature sets C or D being implemented by the cards!?

Would it make sense to address the above questions about the feature set implemented by these early G86 cards and how to properly activate these features directly to NVidia (AFAIK, they recently offered some help by answering questions from nouveau developers)?

In order to move forward, I will try to comment out the single line

 { "CRYPT", 0, 0x74c1, nv84_bo_move_exec, nv50_bo_move_init },

in a current 3.12.4 build, and report back on what happens...

Best regards,
Andreas
Comment 46 Ilia Mirkin 2013-12-18 00:58:35 UTC
Well, given that it doesn't work on the blob makes it sound like you have some sort of funkiness in your hardware. One unsubstantiated theory is that the vdec clock is *disabled*, and pcrypt is hooked up to that clock. Or perhaps that clock is somehow broken. It'd be interesting to see whether the old blob version can be made to work, but I wouldn't spend _too_ much time on it. May be easy to do with an older livecd or something.

So, in order to completely disable PCRYPT without patching your system you can boot with "nouveau.config=PCRYPT=0" in your kernel cmdline (since 3.7, I think). It will also disallow userspace from using PCRYPT, which is probably for the best if it's really broken. (Whereas just commenting it out there prevents a very specific use-case of it.)

With a pre-3.13 kernel, you can also try adding nouveau.perflvl_wr=7777 nouveau.perflvl=1 which will force reclocking to happen on boot (to level '1' which in your case is comparable to what you had been booting to anyways), and just might get PCRYPT going (if the clock theory is right). Or hang your machine. (Or both!) With 3.13 you'll need to apply a patch to enable the reclocking functionality. [Obviously test this theory without the PCRYPT disable stuff.]

BTW, to people who are not Andreas: Please post a full kernel log of a boot with nouveau in a semi-recent kernel, that should reveal whether you guys are all having the same issue or not.
Comment 47 Andreas Loew 2013-12-18 13:20:00 UTC
Hi - it's me again ;-)

> Well, given that it doesn't work on the blob makes it sound like you have some
> sort of funkiness in your hardware. One unsubstantiated theory is that the 
> vdec clock is *disabled*, and pcrypt is hooked up to that clock. Or perhaps 
> that clock is somehow broken. It'd be interesting to see whether the old blob
> version can be made to work, but I wouldn't spend _too_ much time on it. May 
> be easy to do with an older livecd or something.

I could try and do an install of RHEL6.2 (before the ABI change) onto an USB HDD. On this version, 285.09 still ran fine. What exactly would you want me to check with the "blob"?

Is there any diagnostic tool that could check about my "vdec clock" or pcrypt status?

In the meantime, I have verified that with the single line

 { "CRYPT", 0, 0x74c1, nv84_bo_move_exec, nv50_bo_move_init },

commented out, a 3.12.4 kernel works fine - no corruption seen.

> So, in order to completely disable PCRYPT without patching your system you
> can boot with "nouveau.config=PCRYPT=0" in your kernel cmdline (since 3.7, I > think). It will also disallow userspace from using PCRYPT, which is probably > for the best if it's really broken. (Whereas just commenting it out there 
> prevents a very specific use-case of it.)

Also, booting both a stock 3.12.4 kernel and even the current RHEL 6.5 2.32-431 kernel with kernel commend line option "nouveau.config=PCRYPT=0" seems to work fine, which means that my basic issue is already resolved, as with this option, I already seem to be able to make sure that future stock RHEL kernels won't break my screen all the time... ;-)

> With a pre-3.13 kernel, you can also try adding nouveau.perflvl_wr=7777 
> nouveau.perflvl=1 which will force reclocking to happen on boot (to level '1' 
> which in your case is comparable to what you had been booting to anyways), 
> and just might get PCRYPT going (if the clock theory is right). Or hang your 
> machine. (Or both!) With 3.13 you'll need to apply a patch to enable the 
> reclocking functionality. [Obviously test this theory without the PCRYPT 
> disable stuff.]

Can you please provide more details about this?

I tried to pass the below options to stock kernel version 3.13.0-rc4 (as of Dec 16) but got the following message:

Command line: ro root=UUID=034d34cd-a464-4ee3-8db9-d6061a318a16 rd_NO_LUKS LANG=en_US.UTF-8  KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_NO_LVM rd_NO_DM nouveau.perflvl_wr=7777 nouveau.perflvl=1
[...]
nouveau: unknown parameter 'perflvl_wr' ignored
nouveau: unknown parameter 'perflvl' ignored

and then of course got the distorted screen again.

So what exactly do I need to do in order to be able to pass these two parameters and see whether reclocking my "vdec" clock helps to successfully use the pcrypt feature?

Many thanks one more time,
Andreas
Comment 48 Andreas Loew 2013-12-18 13:22:07 UTC
Sorry, typo:
Of course wanted to refer to the current RHEL 6.5 kernel "2.6.32-431" above...
Comment 49 Andreas Loew 2013-12-18 13:39:51 UTC
OK, tired with RHEL 6.5 kernel 2.6.32.431 and the two options:

Command line: ro root=UUID=034d34cd-a464-4ee3-8db9-d6061a318a16 rd_NO_LUKS LANG=en_US.UTF-8  KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_NO_LVM rd_NO_DM nouveau.perflvl_wr=7777 nouveau.perflvl=1 crashkernel=auto

and saw

nouveau 0000:01:00.0: setting latency timer to 64
nouveau 0000:01:00.0: power state changed by ACPI to D0
nouveau 0000:01:00.0: power state changed by ACPI to D0
nouveau 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x086a00a2
nouveau  [  DEVICE][0000:01:00.0] Chipset: G86 (NV86)
nouveau  [  DEVICE][0000:01:00.0] Family : NV50
nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...
nouveau  [   VBIOS][0000:01:00.0] ... appears to be valid
nouveau  [   VBIOS][0000:01:00.0] using image from PRAMIN
nouveau  [   VBIOS][0000:01:00.0] BIT signature found
nouveau  [   VBIOS][0000:01:00.0] version 60.86.49.00.27
nouveau  [     PFB][0000:01:00.0] RAM type: DDR2
nouveau  [     PFB][0000:01:00.0] RAM size: 256 MiB
nouveau  [     PFB][0000:01:00.0]    ZCOMP: 646 tags
nouveau  [  PTHERM][0000:01:00.0] FAN control: none / external
nouveau  [  PTHERM][0000:01:00.0] fan management: disabled
nouveau  [  PTHERM][0000:01:00.0] internal sensor: yes
nouveau  [  PTHERM][0000:01:00.0] Programmed thresholds [ 90(3), 95(3), 125(5), 125(5) ]
[TTM] Zone  kernel: Available graphics memory: 2963482 kiB
[TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[TTM] Initializing pool allocator
[TTM] Initializing DMA pool allocator
nouveau  [     DRM] VRAM: 256 MiB
nouveau  [     DRM] GART: 512 MiB
nouveau  [     DRM] TMDS table version 2.0
nouveau  [     DRM] DCB version 4.0
nouveau  [     DRM] DCB outp 00: 010003f3 00010035
nouveau  [     DRM] DCB outp 01: 02811300 00000028
nouveau  [     DRM] DCB outp 02: 02822312 00000030
nouveau  [     DRM] DCB outp 03: 01833320 00000028
nouveau  [     DRM] DCB conn 00: 0040
nouveau  [     DRM] DCB conn 01: 0100
nouveau  [     DRM] DCB conn 02: 1255
nouveau  [     DRM] DCB conn 03: 0351
nouveau  [     DRM] BIOS FP mode: 1680x1050 (119880kHz pixel clock)
nouveau E[  PTHERM][0000:01:00.0] unhandled intr 0x000001e1
Slow work thread pool: Starting up
Slow work thread pool: Ready
nouveau W[     DRM] unknown connector type 55
nouveau W[     DRM] unknown connector type 51
[drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[drm] No driver support for vblank timestamp query.
nouveau  [     DRM] ACPI backlight interface available, not registering our own
nouveau  [     DRM] 3 available performance level(s)
nouveau  [     DRM] 0: core 169MHz shader 338MHz memory 100MHz voltage 1150mV fanspeed 100%
nouveau  [     DRM] 1: core 275MHz shader 550MHz memory 200MHz voltage 1150mV fanspeed 100%
nouveau  [     DRM] 2: core 400MHz shader 800MHz memory 400MHz voltage 1200mV fanspeed 100%
nouveau  [     DRM] c: core 275MHz shader 550MHz memory 99MHz voltage 1200mV
nouveau  [     DRM] setting performance level: 1
nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x88888888 FAULT at 0x100844
nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x11111111 FAULT at 0x100764
nouveau  [     DRM] > reclocking took 8299680ns

nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010 FAULT at 0x10200c
nouveau  [     DRM] MM: using CRYPT for buffer copies
nouveau  [     DRM] allocated 1680x1050 fb: 0x60000, bo ffff8801b3e6d400
fbcon: nouveaufb (fb0) is primary device

So it seems we had some "reclocking" taking place, but also we still have the errors about "MMIO write" errors, and also the screen is distorted in the exact same way like before even after just two minutes...

Any comments?

Thanks,
Andreas
Comment 50 AnAkkk 2013-12-18 15:48:57 UTC
For the reclocking to work on 3.13 you need to apply this patch:

http://cgit.freedesktop.org/~darktama/nouveau/commit/?h=devel-pm&id=74556533b2cc3dd787ba9fc8a346177116d1a68e

And you can change the performance level with /sys/class/drm/card0/device/pstate
(I think the command line options don't do anything anymore)

This new 3.13 reclocking might just hang your system though (it does for mine on the two computers I tested on).
Comment 51 Ilia Mirkin 2013-12-18 16:28:14 UTC
So actually I'm told that pcrypt is on the main clock, so that theory is out. Can you grab envytools (https://github.com/envytools/envytools) and run

nvapeek 10200c
nvapoke 10200c 10
nvapeek 10200c

and see what's in dmesg?Do you see additional MMIO read/write failures, or is it all good?  What does the peek return? (I'm wondering if it's an initialization order issue or something.)

What issues are you seeing with the blob driver? I'd also still be interested in knowing whether a previously-known-good version of the blob still works.
Comment 52 Andreas Loew 2013-12-18 18:57:06 UTC
Hello again, Ilia,

> Can you grab envytools (https://github.com/envytools/envytools) and run

bad news (or maybe expected from what we have been seeing earlier): 

[aloew@aloew-lap envytools-master]$ ./nva/nvapeek 10200c
WARN: Can't probe 0000:01:00.0
PCI init failure!

[aloew@aloew-lap envytools-master]$ ./nva/nvapoke 10200c 10
WARN: Can't probe 0000:01:00.0
PCI init failure!

> and see what's in dmesg?

No additional output in dmesg - probably because of the "PCI init failure"...

> Do you see additional MMIO read/write failures, or is it 
> all good?  What does the peek return? (I'm wondering if it's an initialization 
> order issue or something.)

As above - and additionally, during the boot process, I also see the following messages in dmesg:

nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00fd94
nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x103d94
(...)
nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010 FAULT at 0x10200c

> What issues are you seeing with the blob driver? 

As stated earlier: Every more recent version of NVidia's driver after their 295.09 causes unpredictable complete hangs at some point in time - sooner or later, but consistently (especially on GUI actions that initiate screen changes like closing windows or using the scrollbar). Fan runs at 100% and the only thing I can still do is a hard power-off...

> I'd also still be interested in knowing whether a previously-known-good 
> version of the blob still works.

I am 99.9% certain it does, as my Windows install with NVidia 285.09 driver also still runs fine, while any more recent Windows driver from NVidia hangs with the same symptoms as their Linux "blob" - I had just checked this last week with their latest Windows version 331.82, once again without any luck.

Will try to do a new install of old RHEL 6.1 or 6.2 onto a USB HDD either later today or tomorrow night and report back about this.

Is there anything else that we can try to find out why the above memory addresses seemingly cannot be accessed on my card?

Could this be a motherboard layout issue by Toshiba or some defective chips that NVidia has sold anyway to OEM manufacturers?

Maybe indeed you could ask your new friends/contacts at NVidia about this?

And please let me know if I shall check some other commands using the "envytools" (nice name!)...

Many thanks one more time & best regards,
Andreas
Comment 53 Andreas Loew 2013-12-18 18:59:21 UTC
Oops - typo: was referring to NVidia version 285.09 above (not 295.09).
Comment 54 Ilia Mirkin 2013-12-18 19:01:01 UTC
(In reply to comment #52)
> Hello again, Ilia,
> 
> > Can you grab envytools (https://github.com/envytools/envytools) and run
> 
> bad news (or maybe expected from what we have been seeing earlier): 
> 
> [aloew@aloew-lap envytools-master]$ ./nva/nvapeek 10200c
> WARN: Can't probe 0000:01:00.0
> PCI init failure!
> 
> [aloew@aloew-lap envytools-master]$ ./nva/nvapoke 10200c 10
> WARN: Can't probe 0000:01:00.0
> PCI init failure!

You need to run these as root.

> > I'd also still be interested in knowing whether a previously-known-good 
> > version of the blob still works.
> 
> I am 99.9% certain it does, as my Windows install with NVidia 285.09 driver
> also still runs fine, while any more recent Windows driver from NVidia hangs
> with the same symptoms as their Linux "blob" - I had just checked this last
> week with their latest Windows version 331.82, once again without any luck.

Ah OK, that's probably good enough of a test.

> Maybe indeed you could ask your new friends/contacts at NVidia about this?

I just bugged them about video decoding stuff a few weeks ago, don't want to use up all of my brownie points :)
Comment 55 Andreas Loew 2013-12-18 19:16:45 UTC
> > [aloew@aloew-lap envytools-master]$ ./nva/nvapoke 10200c 10
> > WARN: Can't probe 0000:01:00.0
> > PCI init failure!

> You need to run these as root.

Ouch - sorry - could have indeed had this idea myself... :-(

Here are the results as root:

[aloew@aloew-lap envytools-master]$ sudo ./nva/nvapeek 10200c 10
0010200c: SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS
[aloew@aloew-lap envytools-master]$ sudo ./nva/nvapoke 10200c 10
0010200c: ERR S
[aloew@aloew-lap envytools-master]$ sudo ./nva/nvapeek 10200c 10
0010200c: SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS
[aloew@aloew-lap envytools-master]$ 

And no new messages in "dmesg" output at all. Still not enlightening... :-(

BR,
Andreas
Comment 56 Andreas Loew 2013-12-18 19:27:16 UTC
> > I am 99.9% certain it does, as my Windows install with NVidia 285.09 driver
> > also still runs fine, while any more recent Windows driver from NVidia hangs
> > with the same symptoms as their Linux "blob" - I had just checked this last
> > week with their latest Windows version 331.82, once again without any luck.

> Ah OK, that's probably good enough of a test.

So I don't need to do this any more? That would be great, because I am pretty certain that it won't give any new results other than the Linux 285.09 driver still works fine.

My card definitely has no new hardware defect. In case it might indeed be defective in some sense, then it has been from the very beginning...

BR,
Andreas
Comment 57 Ilia Mirkin 2013-12-18 19:27:47 UTC
(In reply to comment #55)
> > > [aloew@aloew-lap envytools-master]$ ./nva/nvapoke 10200c 10
> > > WARN: Can't probe 0000:01:00.0
> > > PCI init failure!
> 
> > You need to run these as root.
> 
> Ouch - sorry - could have indeed had this idea myself... :-(
> 
> Here are the results as root:
> 
> [aloew@aloew-lap envytools-master]$ sudo ./nva/nvapeek 10200c 10
> 0010200c: SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS

nvapeek 10200c without the 10. (Not sure what that does.... maybe reads out 0x10 regs)

> [aloew@aloew-lap envytools-master]$ sudo ./nva/nvapoke 10200c 10
> 0010200c: ERR S

Oh well. Some sort of error.

> [aloew@aloew-lap envytools-master]$ sudo ./nva/nvapeek 10200c 10
> 0010200c: SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS
> [aloew@aloew-lap envytools-master]$ 
> 
> And no new messages in "dmesg" output at all. Still not enlightening... :-(

Well, no one's heard of a "missing" PCRYPT before, but it's certainly conceivable that certain blocks were omitted. I'd feel better with that diagnosis if more people chimed in saying that they had the same issue.
Comment 58 Andreas Loew 2013-12-18 19:36:57 UTC
> > [aloew@aloew-lap envytools-master]$ sudo ./nva/nvapeek 10200c 10
> > 0010200c: SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS
> 
> nvapeek 10200c without the 10. (Not sure what that does.... maybe reads out
> 0x10 regs)

yes - seems to read 10 registers:

[aloew@aloew-lap envytools-master]$ sudo ./nva/nvapeek 10200c   
0010200c: SS

> Well, no one's heard of a "missing" PCRYPT before, but it's certainly
> conceivable that certain blocks were omitted. I'd feel better with that
> diagnosis if more people chimed in saying that they had the same issue.

From the screenshots of the corrupted graphics, I definitely think that this is the exact same issue.

But I fully agree that it is a pity that nobody of the folks who had raised this previously and/or commented, do react now that it has probably been tracked down to its root cause.

And something else seems interesting:

All other people who saw the corruption issue (except nemasu with his/her 8800 GTS, who might have seen a different issue indeed from the dmesg output) also were using early G86 chips, particularly 8400M-based, and mostly the "mobile" variants...

Maybe NVidia omitted part of the 8400 functionality in the mobile variants? This would again make up a nice (and easy) question to them...!? ;-)

Thanks again & BR,
Andreas
Comment 59 Ilia Mirkin 2013-12-19 15:04:56 UTC
There was a bug in nvapeek/poke (it was using the wrong address space by default), can you update your pull and try again? [That explains why you saw 'S' in the output.]
Comment 60 Andreas Loew 2013-12-19 15:20:02 UTC
> There was a bug in nvapeek/poke (it was using the wrong address space by
> default), can you update your pull and try again? [That explains why you saw
> 'S' in the output.]

Of course - here you are:

The version I used is from the "Download ZIP" button in GitHub:
https://github.com/envytools/envytools/archive/master.zip

[aloew@aloew-lap nva]$ sudo ./nvapeek 10200c
...
[aloew@aloew-lap nva]$ sudo ./nvapoke 10200c 10
[aloew@aloew-lap nva]$ sudo ./nvapeek 10200c
...

Maybe now another bug, as we don't seem to get any hex address and/or value output?

Please advise if we need to pass any additional parameters to get hex ouput...

BR,
Andreas
Comment 61 Andreas Loew 2013-12-19 15:26:34 UTC
Hmm... Looking at the code for nvapeek, I fear that nva_rd(...) still did not return any meaningful data, as it look like we get s == 0...!?


                int s = 0;
                for (i = j = 0; i < 16 && i < b; i+=rs.regsz, j++) {
                        e[j] = nva_rd(&rs, a+i, &z[j]);
                        if (e[j] || z[j])
                                s = 1;
                }
                if (s) {
                        ls = 1;
                        printf ("%08x:", a);
                        for (i = j = 0; i < 16 && i < b; i+=rs.regsz, j++) {
                                nva_rsprint(&rs, e[j], z[j]);
                        }
                        printf ("\n");
                } else {
                        if (ls) printf ("...\n"), ls = 0;
                }

BR,
Andreas
Comment 62 Andreas Loew 2013-12-19 15:32:45 UTC
But generally, nvapeek seems to work fine now:

[aloew@aloew-lap nva]$ sudo ./nvapeek 0
00000000: 086a00a2

Looking forward to your comments...

BR,
Andreas
Comment 63 Ilia Mirkin 2013-12-19 15:40:11 UTC
Yeah, it prints "..." instead of 0. This makes a lot of sense when you're peeking a large range full of 0's. Anyways, were there any additional messages in dmesg, e.g. MMIO read/write failures as a result?
Comment 64 Andreas Loew 2013-12-19 15:49:42 UTC
Yes, we indeed see the same well-known:

nouveau E[    PBUS][0000:01:00.0] MMIO read of 0x00010000 FAULT at 0x10200c

at any nvpeek read attempt, and

nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010 FAULT at 0x10200c

at any nvpoke attempt... :-(

BR,
Andreas
Comment 65 Andreas Loew 2013-12-19 15:55:36 UTC
Oops - I just remembered that I am booting my kernel with "nouveau.config=PCRYPT=0" in the meantime...

Does this make any difference, i.e. do I need to retry the nvapeek/nvapoke sequence without this kernel option?

Sorry & thanks,
Andreas
Comment 66 Andreas Loew 2014-01-08 11:43:13 UTC
A Happy New Year to everybody! :-)

Just wondering whether you intend to simply close this issue down with the workaround "solution" for me to set kernel option

nouveau.config=PCRYPT=0

or whether you are still interested in finding out *why* my Quadro NVS 130M and other 8400M-based cards do not seem support this functionality (or what might need to be done differently in the driver to ensure they do).


Additional interesting information:

I have been informed that folks at NVIDIA have recently succeeded to track down a Solaris hang issue in their proprietary Unix drivers ("blob") that affected exactly Quadro NVS 130M cards (AFAIK, NVIDIA IR # 1172500).

I can indeed reproduce these hangs on Solaris 11.1, so this issue probably matches the unpredictable hangs that I have been also seeing with the Linux blob versions > 285.05.09 that made their drivers unusable for me.

AFAIK, their fix is scheduled to be fixed in the third update to their R331 series in February.


So how would you like to proceed regarding this issue?

Thanks & BR,
Andreas
Comment 67 Ilia Mirkin 2014-01-08 20:51:40 UTC
Could you provide the output of

nvapeek 154c
nvapeek 1540

Those registers specify which engines are there. I think we're ignoring them in nouveau...
Comment 68 Ilia Mirkin 2014-01-08 22:44:18 UTC
Created attachment 91714 [details] [review]
patch to honor disabled engines

Give this a shot (without forcing PCRYPT=0). You should hopefully see a message saying that it and a few other engines are disabled. This needs some more testing on a wider variety of cards before I'll send it upstream, but it may be what you need.
Comment 69 Andreas Loew 2014-01-09 11:23:09 UTC
> Could you provide the output of
> 
> nvapeek 154c
> nvapeek 1540
> 
> Those registers specify which engines are there. I think we're ignoring them
> in nouveau...

OK - that might indeed explain the issues seen...
Here you are:

[aloew@aloew-lap nva]$ sudo ./nvapeek 154c
0000154c: 0000009c

[aloew@aloew-lap nva]$ sudo ./nvapeek 1540
00001540: b1010001

Looking at the patch you provided, if the rusty binary arithmetics chip in my brain is still valid, this means for my case:

vdec = nv_rd32(device, 0x1540) & 0x40000000;

0xb(...) = 1011(...)
0x4(...) = 0100(...)

=> for me, vdec indeed is 0x00000000, i.e. false

and as my chipset is 0x86, furthermore:

MPEG -> disabled
VP -> disabled

and for the dynamic features,

0x9c = 10011100 binary

0x20 = 00100000 binary
0x40 = 01000000 binary

as 0x9c & 0x20 == 0x00, BSP -> disabled
as 0x9c & 0x40 == 0x00, PCRYPT -> disabled

which would probably confirm your that for me your patch is correct.

That said, I will apply this patch to my current stock RHEL6 kernel and report back later today on whether this works fine for me (which it indeed should, based on the above considerations!).

Thanks a million - great work! :-)

Best regards,
Andreas
Comment 70 Ilia Mirkin 2014-01-09 16:15:35 UTC
Created attachment 91765 [details] [review]
patch to honor hw disables after vbios

Unfortunately the first patch runs before VBIOS, so if the manufacturer explicitly disables an engine for some reason (by writing a 0 to those bits) we should probably honor that. This patch does that (actually 2 patches munged into 1). I've tested it on my NV98 and it correctly doesn't disable anything, but would be nice to test it on a card that _does_ disable stuff.

[note, this patch replaces the first patch, not in addition to it]
Comment 71 Andreas Loew 2014-01-09 16:35:26 UTC
Hello Ilia,

hmm - you just caught me with the update five minutes after I had started the rpmbuild with the previous version... ;-)

Unfortunately, while I could make the first patch apply to a current RHEL kernel source with only one change (core/engine/device.c -> core/subdev/device.c), the new patch will need much more rework to make it compile against a RHEL kernel.

I am therefore looking into getting a 3.12 kernel from the Oracle Linux "playground":

http://public-yum.oracle.com/repo/OracleLinux/OL6/playground/latest/x86_64/

Would 3.12.6 be an appropriate version to apply your updated patch to successfully?

Thanks & BR,
Andreas
Comment 72 Ilia Mirkin 2014-01-09 16:43:33 UTC
(In reply to comment #71)
> Would 3.12.6 be an appropriate version to apply your updated patch to
> successfully?

I'm working against, effectively, 3.13-rc8. I'd think it would apply to 3.12, and just about any other semi-recent kernel, but I guess RHEL does something special? Not sure. That subdev -> engine move happened in dded35dee3 which went into 3.10, so I guess you're using something old.
Comment 73 Andreas Loew 2014-01-09 16:50:01 UTC
Yes, definitely, a RHEL6 stock kernel is *very* old (2.6.32.*) - but due to a kernel drm/nouveau module update from 3.x source that they recently did for RHEL 6.5, it also suddenly became new enough to make me see this issue... ;-)

Have just successfully applied the updated patch to 3.12.6, so my rpmbuild is running! :-)

You can expect my results in about two hours or so (will have dinner inbetween).

Thanks & BR,
Andreas
Comment 74 Andreas Loew 2014-01-09 17:07:40 UTC
Just received

drivers/gpu/drm/nouveau/core/subdev/devinit/nv50.c:164: error: 'NVDEV_ENGINE_VIC' undeclared (first use in this function)

but "fixed" it for me by commenting out the lines for a 0xaf card (I have a 0x86 type anyway, so this code does not apply to me):

+	case 0xaf:
+		/* if (!(r154c & 0x40)) */
+		/*	device->disable_mask |= 1ULL << NVDEV_ENGINE_VIC; */
+		/* fallthrough */

BR,
Andreas
Comment 75 Andreas Loew 2014-01-09 21:56:49 UTC
Created attachment 91786 [details]
Complete dmesg output booting 3.12.6 with "hwunits.patch" applied (nouveau.debug=debug)
Comment 76 Andreas Loew 2014-01-09 21:57:29 UTC
Created attachment 91787 [details]
nouveau-related dmesg output booting 3.12.6 with "hwunits.patch" applied (nouveau.debug=debug)
Comment 77 Andreas Loew 2014-01-09 22:03:57 UTC
Sorry that it took me longer to get back here - I needed an additional rpmbuild run due to running out of disk space for my first attempt...

But I can give an all clear signal - at least for my machine, AFAIK, everything seems to be fine:

Kernel command line: ro root=UUID=034d34cd-a464-4ee3-8db9-d6061a318a16 rd_NO_LUKS LANG=en_US.UTF-8  KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_NO_LVM rd_NO_DM nouveau.debug=debug rhgb quiet

nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x086a00a2
nouveau  [  DEVICE][0000:01:00.0] Chipset: G86 (NV86)
nouveau  [  DEVICE][0000:01:00.0] Family : NV50
nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...
nouveau  [   VBIOS][0000:01:00.0] ... appears to be valid
nouveau  [   VBIOS][0000:01:00.0] using image from PRAMIN
nouveau  [   VBIOS][0000:01:00.0] BIT signature found
nouveau  [   VBIOS][0000:01:00.0] version 60.86.49.00.27
(...)
nouveau  [   PMPEG][0000:01:00.0] hardware is marked as disabled
nouveau  [     PVP][0000:01:00.0] hardware is marked as disabled
nouveau  [  PCRYPT][0000:01:00.0] hardware is marked as disabled
nouveau  [    PBSP][0000:01:00.0] hardware is marked as disabled

and also, everything is fine afterwards (as PCRYPT seems to indeed have been properly disabled). :-)

What do you say? Do you agree that everything turned out as expected from my nvpeek results?

Thanks & BR,
Andreas
Comment 78 Ilia Mirkin 2014-01-09 22:42:16 UTC
Great news! I'll update the bug when this makes it upstream (or if we have further questions about your hardware). FWIW I've been going around asking people to report registers 1540/154c to me, and so far everyone except you and one other person having trouble with nouveau has had them listed as everything enabled.

Thanks for tracking down the commit that caused the issue, that was instrumental!
Comment 79 Andreas Loew 2014-01-09 23:06:05 UTC
You're welcome! :-)

I did do this in my very own interest, because the OL6/RHEL6 install on my main work laptop all of a sudden had this distortion issue when RHEL updated the drm/nouveau module to an affected codebase in RHEL 6.5, so I definitely needed a solution for this (other than get a new laptop)...

One final request from my side, as I don't have commercial RHEL6 support (I am using the free OL6 clone):

Hoping that you have pretty good contact/access to Ben Skeggs (who I think officially owns the nouveau modules at Red Hat), can you please approach him and ask him to please take care of the fact that Red Hat also applies a (backported) version of this patch to their mainline stock RHEL 6.5 kernels?

That would be great, as this is definitely needed to ensure that all those people with the affected older/low-end NVIDIA notebook chips - such as myself (and all the other now unfortunately silent people who initially created this issue) - will no longer be affected by this issue in the current RHEL 6 kernels (or don't need the explicit workaround using the kernel parameter PCRYPT=0)?

Thanks a million for your kind help & best regards from Germany,
Andreas
Comment 80 Thomas 2014-01-15 23:21:26 UTC
Hello,

I have the same NVIDIA GeForce NVS 130M with the disabled functions.
I checked with nvapeek:
0000154c: 0000009c
00001540: b1010001

uname -a delivers
Linux mobuntu 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I do not have any issues with distorted graphics during normal usage but my problem is that resume from suspend mode makes X hang.

I also have these errors in dmesg 

[   18.985158] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00fd94
[   18.986213] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x103d94
[   19.026027] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010 FAULT at 0x10200c

but also

[   18.984164] nouveau E[  PTHERM][0000:01:00.0] unhandled intr 0x00000161

When I use the kernel option nouveau.config=PCRYPT=0 it doesn't eliminate the errors and X still hangs when resuming.
I was not sure if I have to set the parameter in quotes.
As you can see I'm not a linux specialist ;)
Comment 81 Ilia Mirkin 2014-01-15 23:31:10 UTC
(In reply to comment #80)
> Hello,
> 
> I have the same NVIDIA GeForce NVS 130M with the disabled functions.
> I checked with nvapeek:
> 0000154c: 0000009c
> 00001540: b1010001
> 
> uname -a delivers
> Linux mobuntu 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013
> x86_64 x86_64 x86_64 GNU/Linux
> 
> I do not have any issues with distorted graphics during normal usage but my
> problem is that resume from suspend mode makes X hang.
> 
> I also have these errors in dmesg 
> 
> [   18.985158] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000
> FAULT at 0x00fd94
> [   18.986213] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000
> FAULT at 0x103d94
> [   19.026027] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010
> FAULT at 0x10200c

These errors should go away with the patch.

> 
> but also
> 
> [   18.984164] nouveau E[  PTHERM][0000:01:00.0] unhandled intr 0x00000161

I believe this is unrelated.

> 
> When I use the kernel option nouveau.config=PCRYPT=0 it doesn't eliminate
> the errors and X still hangs when resuming.

It should eliminate the 10200c error. The others are from PVP and PBSP, you could do like nouveau.config=PCRYPT=0,PVP=0,PBSP=0,PMPEG=0 -- that should have the same effect as my patch for your hardware. (I think.)

> I was not sure if I have to set the parameter in quotes.

Not necessary, but I *think* it'll work with quotes as well. Not sure.

> As you can see I'm not a linux specialist ;)

OK, then you have some different issue. I would recommend filing a fresh issue with all the relevant info.
Comment 82 Andreas Loew 2014-01-15 23:57:27 UTC
Hi Thomas,

> I have the same NVIDIA GeForce NVS 130M with the disabled functions.
> I checked with nvapeek:
> 0000154c: 0000009c
> 00001540: b1010001

great - finally somebody who confirms this issue.

> uname -a delivers
> Linux mobuntu 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013
> x86_64 x86_64 x86_64 GNU/Linux

> I do not have any issues with distorted graphics during normal usage but my
> problem is that resume from suspend mode makes X hang.

> I also have these errors in dmesg 
> 
> [   18.985158] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000
> FAULT at 0x00fd94
> [   18.986213] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000
> FAULT at 0x103d94
> [   19.026027] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010
> FAULT at 0x10200c

Hmm - your kernel and your nvapeek results clearly suggest you should be affected...

Have you enabled compiz (i.e. OpenGL-based 3D acceleration features)? I assume that so far, you haven't (it does not seem to be active in Ubuntu by default), which most likely is the only reason why you are not seeing the distortion issue (so far).

See e.g.

http://www.howtoforge.com/install-compiz-on-the-unity-desktop-on-ubuntu-12.04-precise-pangolin

(depending on your particular Ubuntu version) on how to enable compiz. I am almost certain that once you have done so, you will also run see the distorted graphics, but you now already know the fix... ;-)

> [   18.984164] nouveau E[  PTHERM][0000:01:00.0] unhandled intr 0x00000161

This last "PTHERM" error seems to be a different, unrelated issue.

> When I use the kernel option nouveau.config=PCRYPT=0 it doesn't eliminate
> the errors and X still hangs when resuming.

Hmm - interesting, as I clearly don't have any issues with suspend/resume. Which laptop do you have? Did you already update your BIOS to the latest available version?

> I was not sure if I have to set the parameter in quotes.

No, you don't (and AFAIK, you even must not). Ilia has already proposed the correct workaround for the distortion issue (until your distro of choice has integrated the new fix) - add this:

nouveau.config=PCRYPT=0,PVP=0,PBSP=0,PMPEG=0

to your grub kernel parameters. Having done so, all "MMIO write" errors in dmesg must be gone (they are for me!), otherwise something else is still wrong for you in addition.

Hope this helps & best regards,
Andreas


BTW @ Ilia:
Did you already have a chance to contact Ben Skeggs about applying the fix to mainline RHEL 6.5 (and above) kernels?
Comment 83 Ilia Mirkin 2014-01-16 00:01:59 UTC
(In reply to comment #82)
> BTW @ Ilia:
> Did you already have a chance to contact Ben Skeggs about applying the fix
> to mainline RHEL 6.5 (and above) kernels?

That seems a little premature given that it's not even in the mainline kernel. However I would recommend that once it is, you file a redhat issue to make sure it gets backported to the whatever. I have no knowledge of, and do not care about RHEL or any non-mainline kernel. If you do, work with whatever processes they have. I bug Ben about enough stuff already :)
Comment 84 Andreas Loew 2014-01-16 00:18:18 UTC
Hi Ilia,

> That seems a little premature given that it's not even in the mainline
> kernel. However I would recommend that once it is, you file a redhat issue
> to make sure it gets backported to the whatever. I have no knowledge of, and
> do not care about RHEL or any non-mainline kernel. If you do, work with
> whatever processes they have. I bug Ben about enough stuff already :)

ouch - that's a pity... :-(

As stated earlier, as I am using the free (only as in beer...) Oracle Linux version rather than a commercial pais RHEL license, I cannot file any issues with them, so I was hoping about you being able to raise this with him within the nouveau team. It clearly deserves a fix, but I won't be able to drive anything myself here due to the lack of a paid license... :-(

Oh, and one more thing by the way:

Interestingly, I can also confirm that for me, the proprietary NVidia "blob" Unix driver version 331.38 (which has just been released this week):

https://devtalk.nvidia.com/default/topic/672875

indeed has also fixed the long-standing hang issue with their drivers for my Quadro NVS 130M on both Linux and Solaris. Even better news is that the fix will also be integrated into the next R331 release for Windows (it is not yet in Windows versions 331.93 or 332.21)!

But while it took NVidia a little less than two years between introducing their regression bug (all releases since 285.x are affected), you/the nouveau team have tracked down and fixed this issue in just a couple of days... :-)

So thanks again for your great work on nouveau! :-)

BR,
Andreas
Comment 85 Thomas 2014-01-16 18:34:01 UTC
(In reply to comment #82)
> > I also have these errors in dmesg 
> > 
> > [   18.985158] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000
> > FAULT at 0x00fd94
> > [   18.986213] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000000
> > FAULT at 0x103d94
> > [   19.026027] nouveau E[    PBUS][0000:01:00.0] MMIO write of 0x00000010
> > FAULT at 0x10200c
> 
> Hmm - your kernel and your nvapeek results clearly suggest you should be
> affected...
> 
> Have you enabled compiz(...)

No I don't think so.

> > [   18.984164] nouveau E[  PTHERM][0000:01:00.0] unhandled intr 0x00000161
> 
> This last "PTHERM" error seems to be a different, unrelated issue.
> 
> > When I use the kernel option nouveau.config=PCRYPT=0 it doesn't eliminate
> > the errors and X still hangs when resuming.
> 
> Hmm - interesting, as I clearly don't have any issues with suspend/resume.
> Which laptop do you have? Did you already update your BIOS to the latest
> available version?
> 
> > I was not sure if I have to set the parameter in quotes.
> 
> No, you don't (and AFAIK, you even must not). Ilia has already proposed the
> correct workaround for the distortion issue (until your distro of choice has
> integrated the new fix) - add this:
> 
> nouveau.config=PCRYPT=0,PVP=0,PBSP=0,PMPEG=0
> 
> to your grub kernel parameters. Having done so, all "MMIO write" errors in
> dmesg must be gone (they are for me!), otherwise something else is still
> wrong for you in addition.

Ok, now after adding the whole bunch to the kernel opts the three PBUS errors are gone. For the resume failure I open a new issue.

How can I push the integration of such a fix into another distro?

Many thanks!
-Thomas
Comment 86 Andreas Loew 2014-02-09 16:06:33 UTC
Hello Ilia,

has there been any progress so far in getting this into the mainstream Linux kernel (or mainstream git) for the next official kernel release?

I'd like to make an attempt to get this patch (or rather, a backport of it) into official RHEL 6.x kernels, but I'd like to point to an official kernel patch in order to do so.

Many thanks & best regards,
Andreas
Comment 87 Ilia Mirkin 2014-02-09 21:03:38 UTC
(In reply to comment #86)
> Hello Ilia,
> 
> has there been any progress so far in getting this into the mainstream Linux
> kernel (or mainstream git) for the next official kernel release?

This should be upstream as of commit 4019aaa2b314a5be9886ae1db64ff8c6d3c060ed, available in 3.14-rc1.
Comment 88 Andreas Loew 2014-02-10 18:39:55 UTC
(In reply to comment #87)

> > has there been any progress so far in getting this into the mainstream Linux
> > kernel (or mainstream git) for the next official kernel release?

> This should be upstream as of commit
> 4019aaa2b314a5be9886ae1db64ff8c6d3c060ed, available in 3.14-rc1.

Many thanks, Ilia! :-)

BR,
Andreas


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.