Bug 40630

Summary: Bad memory clock set on GeForce 9500 GT after resume
Product: xorg Reporter: Diego Viola <diego.viola>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: diego.viola, maximlevitsky
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
NOUVEAU IRC LOG
none
kernel log (dmesg)
none
picture displaying the problem (graphics corruption)
none
Another picture displaying the problem
none
one more picture showing the problem
none
glxgears after suspend/resume
none
glxgears after suspend/resume maximized
none
kernel .config
none
kernel .config
none
ctxvals mmiotrace
none
ctxprog mmiotrace
none
ctxprog prepared for loading
none
ctxvals prepared for load
none
patch to allow nouveau load binary ctxprog
none
retrieve kernel log after crash patch
none
SuperTuxKart crash AFTER suspend dmesg (KERNEL LOG)
none
list of POST register writes
none
demmio_replay patch
none
Temporary hack to fix this
none
vbios.rom
none
lspci -a
none
More complete MPLL programming during POST none

Description Diego Viola 2011-09-04 19:46:44 UTC
Description:

When I make my system go into suspend mode (pm-suspend), then I resume my computer, my graphics get corrupted.

Then the only thing I can do is to reboot the computer to prevent the graphics glitches to happen.

I have tried this with Linux 3.0.4, and 3.0.1 and I get the same problem with both.

Hardware:

Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz
2 GB RAM
01:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9500 GT] (rev a1)


Software:

Arch Linux x86-64
Linux 3.0.4
KDE 4.7
xorg-server 1.10.3.901-1
xf86-video-nouveau 0.0.16_git20110726-1
Comment 1 Diego Viola 2011-09-04 19:48:25 UTC
How to reproduce the problem:

1- install archlinux x86-64
2- pacman -Syu
3- install xf86-video-nouveau, nouveau-dri
4- install KDE4.
5- pm-suspend as root
6- resume the computer
7- watch the graphics glitches
Comment 2 Diego Viola 2011-09-04 19:50:29 UTC
Please note that I have OpenGL compositing turned to on in KWin, I have tried to disable it but this problem still persists, I tried to restart X but the problem is still there, the only workaround I found is to reboot the computer completely, but as soon as I suspend my computer the graphics get corrupted again.
Comment 3 Diego Viola 2011-09-04 19:53:33 UTC
Created attachment 50904 [details]
NOUVEAU IRC LOG

if you need more detailed information please see the IRC log attached from #nouveau. Or feel free to ask me questions, and I will gladly provide any help.
Comment 4 Diego Viola 2011-09-04 20:02:57 UTC
Created attachment 50905 [details]
kernel log (dmesg)

Please find the dmesg attached, I have captured this dmesg after/when I have reproduced the problem.
Comment 5 Diego Viola 2011-09-04 20:22:29 UTC
Created attachment 50906 [details]
picture displaying the problem (graphics corruption)

Attached some pictures displaying the problem after having reproduced it.
Comment 6 Diego Viola 2011-09-04 20:24:14 UTC
Created attachment 50907 [details]
Another picture displaying the problem

Another picture showing the problem.
Comment 7 Diego Viola 2011-09-04 20:26:08 UTC
Created attachment 50908 [details]
one more picture showing the problem

One more picture showing the problem.
Comment 8 Diego Viola 2011-09-04 20:29:46 UTC
On the image that shows Firefox I have selected text and dragged around.
Comment 9 Diego Viola 2011-09-04 20:33:58 UTC
I also notice sometimes when calling krunner (ALT + F2) in KDE, then typing something like "konsole", the pixels where krunner is will get all messed up too, I don't even need to suspend to reproduce this, it doesn't always happen but it happens most of the time, when I'm able to reproduce that I will take a picture.

Thanks.
Comment 10 maximlevitsky 2011-09-05 09:18:42 UTC
I had exactly the same problem.
Then it looked like 2 bugs. One was in mesa and the way it didn't upload relocations for constant buffer, and other problem possible was in kernel, but I am not very sure about that, and anyway you have new enough kernel not to worry about.

So please make sure you have recent enough mesa (7.11 should be enough).

Also to reproduce that easily (cat /sys/kernel/debug/dri/0/evict_vram) was added.
Just read that file few times like that:

cat /sys/kernel/debug/dri/0/evict_vram

(That assumes you have debugfs mounted on /sys/kernel/debug)
Comment 11 Diego Viola 2011-09-05 13:31:47 UTC
(In reply to comment #10)
> I had exactly the same problem.
> Then it looked like 2 bugs. One was in mesa and the way it didn't upload
> relocations for constant buffer, and other problem possible was in kernel, but
> I am not very sure about that, and anyway you have new enough kernel not to
> worry about.
> 
> So please make sure you have recent enough mesa (7.11 should be enough).
> 
> Also to reproduce that easily (cat /sys/kernel/debug/dri/0/evict_vram) was
> added.
> Just read that file few times like that:
> 
> cat /sys/kernel/debug/dri/0/evict_vram
> 
> (That assumes you have debugfs mounted on /sys/kernel/debug)

Hello, I have mesa 7.11-2.

[diego@myhost ~]$ pacman -Q mesa
mesa 7.11-2
[diego@myhost ~]$ 

I tried mounting debugfs to /sys/kernel/debug/ and reading /sys/kernel/debug/dri/0/evict_vram a few times with cat but I cannot reproduce the issue that way.
Comment 12 maximlevitsky 2011-09-05 14:11:51 UTC
could you try reading that debug file many times, possibly in a loop?
If that doesn't work, then its different issue I guess.

Please try to update to mesa from git, just in case issue is fixed there.
Comment 13 maximlevitsky 2011-09-05 14:13:49 UTC
Another thing, try compiz and see if it has same issue.
Comment 14 Diego Viola 2011-09-05 15:55:36 UTC
(In reply to comment #12)
> could you try reading that debug file many times, possibly in a loop?
> If that doesn't work, then its different issue I guess.
> 
> Please try to update to mesa from git, just in case issue is fixed there.

I've been running "cat /sys/kernel/debug/dri/0/evict_vram" inside a while loop for more than 1 hour and every 1 second.

$ while true; do cat /sys/kernel/debug/dri/0/evict_vram; sleep 1; done

but I couldn't reproduce the issue that way.

How can I upgrade mesa from git? Any instructions?
Comment 15 Diego Viola 2011-09-05 15:57:38 UTC
BTW running this has crashed my machine:

while true; do cat /sys/kernel/debug/dri/0/evict_vram; done

when I added "sleep 1" it didn't crash my machine, I believe I exhausted my hardware.

but still, I couldn't reproduce the issue. Any instructions for upgrading mesa to git?

Thanks.
Comment 16 maximlevitsky 2011-09-05 16:10:45 UTC
Oh, I didn't mean to make you run that for a hour!, a few minutes should have done it.
Its clear you have another bug.
So only suggestion left is to update mesa (and try compiz too).
Its not difficult at all to compile mesa, I can even guide you if you join IRC now
Comment 17 Diego Viola 2011-09-05 16:46:04 UTC
(In reply to comment #16)
> Oh, I didn't mean to make you run that for a hour!, a few minutes should have
> done it.
> Its clear you have another bug.
> So only suggestion left is to update mesa (and try compiz too).
> Its not difficult at all to compile mesa, I can even guide you if you join IRC
> now

What's your nick on IRC?
Comment 18 Diego Viola 2011-09-05 18:05:11 UTC
I tried compiz and changing window managers doesn't help, the corrupted graphic issues are still there after doing a suspend/resume.
Comment 19 Diego Viola 2011-09-06 14:03:01 UTC
Any ideas?
Comment 20 Diego Viola 2011-09-06 16:34:04 UTC
I tried running openbox and kwin without opengl compositing and after suspending/resuming it doesn't exhibit this problem.

the problem is only when opengl compositing is enabled.
Comment 21 Diego Viola 2011-09-06 17:27:17 UTC
I have compiled libdrm and Mesa from Git, the problem still persists.
Comment 22 Diego Viola 2011-09-06 18:59:13 UTC
Created attachment 50946 [details]
glxgears after suspend/resume

We (MaximLevitsky and I) have found that any 3D apps doesn't survive suspend/resume.

Whether I start glxgears after suspend/resume, the graphics looks corrupted.

screenshots of glxgears attached after suspend/resume (and also logs).
Comment 23 Diego Viola 2011-09-06 18:59:41 UTC
Created attachment 50947 [details]
glxgears after suspend/resume maximized
Comment 24 maximlevitsky 2011-09-06 19:25:15 UTC
So now we found that after suspend, starting glxgears (and that means pretty much any 3D apps) produces the corruption found on screenshot in former comment.

On first glxgear invocation after s2ram cycle, kernel log gets this:

[  297.625761] EXT4-fs (sda2): re-mounted. Opts: commit=0
[  297.628545] EXT4-fs (sda3): re-mounted. Opts: commit=0
[  306.243318] eth0: no IPv6 routers present
[  347.083807] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_MP - TP0: Unhandled ustatus 0x00020000
[  347.083812] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[  347.083818] [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x0001b07000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010

According to envytools, its TP0.MPC_TRAP bit MP1.

The failing PFIFO method doesn't always present in kernel log (eg it wasn't there when we tested compiz) and it decodes to QUERY_GET method.

Any ideas?
Comment 25 maximlevitsky 2011-09-08 17:24:42 UTC
OK. After making kernel handle the TP1 error bit, the error was decoded, and tried several times. These are errors we got:

1st try:

[drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 9, opcode ffc8c2bf ffc8c2bf
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x0001430000) subc 5 class 0x8297 mthd 0x0f04 data 0x3f4b6dfa

2nd try:

[drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 1, opcode ff1f515e ff1f515e
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001cd1000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010

3rd try:

[drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 13, opcode ff1f515e ff1f515e
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001930000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010

4rd try:

[drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07c000 warp 1, opcode ffc7c1bf ffc7c1bf
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x000322c000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010


It was also found that 2d usage after resume from ram and/or xvideo playback don't trigger that MP fault.

Also following suggestions from Marcin Slusarz were tried:


We tried to restart X after resume from ram, and once again no errors.
We tried running rendercheck -tblend, just 9 out of 18 tests passed, but no TP error.

Its somewhat strange as if nouveau drm module didn't restore some of TP/MP state, the error should have being triggered by X as it does use shaders that have to choice but to run on MP/TPs.

Also running just glxinfo was attempted. That does all the 3d init but doesn't render anything (I checked this). No errors.

Thats is ***ing strange, that all I can say, and I am out of ideas for now.
Comment 26 maximlevitsky 2011-09-08 18:28:52 UTC
Same problem when doing hibernate, even using 'shutdown' mode.
So no magic missing help from bios on resume of screwup from bios on resume is expected.

Another sample:

[drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 1, opcode ff1f515e ff1f515e
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001ccd000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010

And another:

[drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 12, opcode 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001319000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010

All errors suggest that always MP1 of TP0 executes random garbage code.
So maybe, maybe......
Comment 27 maximlevitsky 2011-09-08 22:11:45 UTC
So I did a shot in the dark and I almost scored.
This what I found:

nouveau gallium3d driver reserves 3x512KB buffers for each shader type.
We reduced this to 3x32K buffers and problem gone!
(Although one crash with kwin and glxgears was reported, not clear yet it it was related).

Its not easy to say what is going on, but suggestions that I have in mind:

1. memory allocation failure that goes somehow silent or other similar ttm related bug. knowing about all the titanic memory moving it might be the case after all (the system has 1GB of vram and 2GB of system memory...)

2. Issues with VM, for example large pages that might be involved and somewhat disabled.

3. Issues with TLB??

4. ????
Comment 28 Diego Viola 2011-09-09 03:07:35 UTC
Hi,

I've been playing Nexuiz for at least 2 hours at 1680x1050 with effects set to High, the FPS I get with my card is 30-45 FPS.

I don't get corrupted graphics anymore, and I managed to suspend/hibernate WHILE playing the game and the game didn't crash a single time while doing suspend/hibernate, which I find impressive.

So the game runs stable, but it I also got 2 or 3 crashes, what was the command for getting a trace? I don't get corrupted graphics anymore, which is nice.

Thanks Naxim, I'm very impressed with Nouveau, it rocks.
Comment 29 Diego Viola 2011-09-09 03:19:50 UTC
calim's response to this issue on #nouveau:

06:06 < calim> diegoviola: that has to be some issue with kernel not restoring VRAM properly, the addresses the MP executes at look sane but the RAM contents are obviously garbage

06:11 < calim> reducing the size of some buffer just hides the issue though
06:12 < calim> or it might step over the large/small pages boundary, could be a hint

06:13 < calim> he's already thought about all that

06:13 < calim> s/about/of
06:13 < calim> will probably have to look at the kernel's suspend code next
06:14 < calim> I mean nouveau's suspend code

06:16 < calim> maybe double the size (from 32 KiB up to the original 512 KiB) and check at which point it starts to fail to see if it really is the large-pages allocation boundary
Comment 30 Diego Viola 2011-09-09 22:21:03 UTC
I've just tried the latest kernel from git, and the error is still there. I get corrupted graphics after doing a suspend/resume on glxgears.

[diego@myhost ~]$ uname -a
Linux myhost 3.1.0-rc5+ #1 SMP Sat Sep 10 01:15:44 PYT 2011 x86_64 Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz GenuineIntel GNU/Linux
[diego@myhost ~]$
Comment 31 Diego Viola 2011-09-09 22:22:44 UTC
I get this with the latest kernel from git head.

[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_MP - TP0: Unhandled ustatus 0x00020000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001e81000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010


[diego@myhost ~]$ uname -a
Linux myhost 3.1.0-rc5+ #1 SMP Sat Sep 10 01:15:44 PYT 2011 x86_64 Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz GenuineIntel GNU/Linux
[diego@myhost ~]$
Comment 32 Diego Viola 2011-09-09 22:26:20 UTC
Created attachment 51024 [details]
kernel .config

attached my kernel .config file, used for 3.0.4 and 3.1.0-rc5+ (git head)
Comment 33 Diego Viola 2011-09-09 22:30:22 UTC
Created attachment 51025 [details]
kernel .config

reattach kernel config as text file
Comment 34 maximlevitsky 2011-09-10 08:14:46 UTC
Thanks for testing the latest kernel, and sadly as you see the problem is still there.

This is a summary of findings of findings so far:

1. Fault always happens in MP1 of TP0, after resume from ram

2. It doesn't matter if program was run through suspend 
    or launched afrer suspend.

3. It doesn't matter if there was suspend or hibernate, 
   even hibernate without BIOS suport (echo shutdown > /sys/power/disk)

4. Actual size of allocated buffer doesn't matter. However just pretending that we 
   allocated smaller buffers for each code type (by passing smaller size to
   nouveau_resource_alloc fixes the issue. (32K was largest size that was working)

5. Buffers that glxgears happen to upload (tested on his and my system and apppears 
   not to change:

   7fc00 - 2nd vertex shader - 392 bytes
   7fe00 - vertex shader 392 bytes
   fff00 - pixel shader - 24 bytes

   However fault address is almost always 7bf00 or something very close to it 
   (7bf08, 7c000)

   That address is way outside the areas that were uploaded.

6. Filling whole code buffer with pattern (using nouveau_bo_map) with pattern made 
   TP1 execute that pattern once again at 7bf00. Tested many tumes.
   However if in addition to doing so, the code buffer upload was skipped, all TPs 
   (all 4 of them) faulted in tandem at 7fc00 trying to execute the pattern.
Comment 35 maximlevitsky 2011-09-10 08:41:20 UTC
Small update and corrections:

This is a summary of findings  so far:

1. Fault always happens in MP1 of TP0, after resume from ram
   That's of course very strange

2. It doesn't matter if program was run through suspend 
    or launched afrer suspend.
   Which means that its not an error of preserving channel/program state
   through suspend. Everything is uploaded new after the program starts.

3. It doesn't matter if there was suspend or hibernate, 
   even hibernate without BIOS suport (echo shutdown > /sys/power/disk)
   Or in other words 'magic init' done on boot but missing after resume,
   or opposite a screwup by bios after resume is ruled out.

4. Actual size of allocated buffer doesn't matter. However just 
   pretending that we allocated smaller buffers for each code type 
   (by passing  smaller size to nouveau_resource_alloc fixes the issue. 
   (32K was largest size that was working)
   Also support for large pages was disabled in kernel and that didn't help.
   This means that its not allocation, failure. It could be memory copy failure
   though, although initially buffer is allocated in video ram.

   This also means that the _code offset_ is what matters.
   In addition to that note that code was uploaded by SIFC but it takes end 
   address of area to copy to, thus really offset from start of out buffer 
   shouldn't matter.

5. Buffers that glxgears happen to upload (tested on his and my system and
   appears not to change:

   7fc00 - 2nd vertex shader - 392 bytes
   7fe00 - vertex shader 392 bytes
   fff00 - pixel shader - 24 bytes

   However fault address is almost always 7bf00 or something very close to it 
   (7bf08, 7c000)

   That address is way outside the areas that were uploaded, which means that
   TP1 really executes undefined code that just wasn't uploaded.

   Filling the whole code buffer with pattern (using nouveau_bo_map) with 
   pattern made  TP1 execute that pattern once again at 7bf00. Tested 
   many times.
   However if in addition to doing so, the code buffer upload was skipped, all
   TPs  (all 4 of them) faulted in tandem at 7fe00 (vertex shader) trying to
   execute the  pattern.
Comment 36 maximlevitsky 2011-09-10 19:53:44 UTC
OK, we tested and memory is correctly uploaded. Also did a dump of pushbuffers, and nouveau does set the address of code to execute correctly.

Yet, for some damn reason GPU still executes from undefined location.

OK, we mmiotraced the blob, extracted ctxvals/ctxprog, and yes, problem disappeared. Also it was noted that some graphical corruptions disappeared.

So, folks we really need better coverage of ctxprog and especially ctxvals, because these seems to contain chipset revision specific workarounds....

Could we allow to load ctxprog/vals of nvidia then? so that users could do that in easier way that patch their kernel?
Comment 37 maximlevitsky 2011-09-10 19:55:33 UTC
Created attachment 51036 [details]
ctxvals mmiotrace
Comment 38 maximlevitsky 2011-09-10 19:56:03 UTC
Created attachment 51037 [details]
ctxprog mmiotrace
Comment 39 maximlevitsky 2011-09-10 20:03:20 UTC
Created attachment 51038 [details]
ctxprog prepared for loading
Comment 40 maximlevitsky 2011-09-10 20:04:07 UTC
Created attachment 51039 [details]
ctxvals prepared for load
Comment 41 maximlevitsky 2011-09-10 20:05:42 UTC
Created attachment 51040 [details] [review]
patch to allow nouveau load binary ctxprog
Comment 42 Diego Viola 2011-09-10 20:43:30 UTC
I was playing Nexuiz, got some hang while playing, had to reboot the computer.

Unfortunately when running this I get no such file or directory (after reboot)

[root@myhost ~]# cat /sys/kernel/debug/printk/crash_dmesg | strings
cat: /sys/kernel/debug/printk/crash_dmesg: No such file or directory
[root@myhost ~]#
Comment 43 Diego Viola 2011-09-10 20:59:10 UTC
(In reply to comment #42)
> I was playing Nexuiz, got some hang while playing, had to reboot the computer.
> 
> Unfortunately when running this I get no such file or directory (after reboot)
> 
> [root@myhost ~]# cat /sys/kernel/debug/printk/crash_dmesg | strings
> cat: /sys/kernel/debug/printk/crash_dmesg: No such file or directory
> [root@myhost ~]#

I'm very happy that I'm able to suspend/hibernate and graphics don't get screwed anytime though, the stability seems to have improved a lot. Thanks!
Comment 44 Diego Viola 2011-09-10 21:27:00 UTC
I've been playing Nexuiz at max resolution with effects set to max for something like 2 hours and I only managed to crash it once, I'd say it's working very well. Great Work!
Comment 45 Diego Viola 2011-09-10 21:53:39 UTC
I got a crash in SuperTuxKart as well.
Comment 46 Diego Viola 2011-09-10 22:08:40 UTC
s/anytime/anymore/g
Comment 47 maximlevitsky 2011-09-10 22:10:01 UTC
Thats because you don't have my blackbox patch applied
Comment 48 Diego Viola 2011-09-10 22:12:13 UTC
(In reply to comment #47)
> Thats because you don't have my blackbox patch applied

could you please upload your patch here?
Comment 49 maximlevitsky 2011-09-10 22:12:37 UTC
Created attachment 51042 [details] [review]
retrieve kernel log after crash patch

When compiling kernel enable CONFIG_HWMEM_PRINTK.
or as it called in decription, 'Log printk message buffer into fixed physical address' in kernel hacking->kernel debugging.
Comment 50 Diego Viola 2011-09-10 22:13:51 UTC
(In reply to comment #48)
> (In reply to comment #47)
> > Thats because you don't have my blackbox patch applied
> 
> could you please upload your patch here?

I have your patch applied to my 3.0.4 kernel, but not in 3.1.0-rc5+ (git) I believe.
Comment 51 maximlevitsky 2011-09-10 22:16:35 UTC
And you need to use 3.1.0-rc5+ because as you remember we put there patch that loads nvidia's ctxprog
Comment 52 Diego Viola 2011-09-10 22:24:33 UTC
(In reply to comment #51)
> And you need to use 3.1.0-rc5+ because as you remember we put there patch that
> loads nvidia's ctxprog

Correct. I will just SSH in next time I get a hang, if I can make it hang, and get the dmesg. I will report here anything I find. Thanks.
Comment 53 Diego Viola 2011-09-10 22:40:12 UTC
I've noticed this on my dmesg now:

[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x00078b0000) subc 5 class 0x8297 mthd 0x15f0 data 0x01000000
[drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5 [0x000078b0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT

https://gist.github.com/1209217
Comment 54 Diego Viola 2011-09-10 22:55:02 UTC
Got this now (while playing Nexuiz) -- I got no crashes this time but this error in dmesg:

[diego@myhost ~]$ dmesg
[drm] nouveau 0000:01:00.0: firmware ctxvals loaded
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
[drm] nouveau 0000:01:00.0: firmware ctxvals loaded
[drm] nouveau 0000:01:00.0: firmware ctxvals loaded
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x000f7b0000) subc 5 class 0x8297 mthd 0x15f0 data 0x00000000
[drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5 [0x0000f7b0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
[diego@myhost ~]$ 


As you can see, I did a "dmesg -c" before, so I make sure I go this now.
Comment 55 Diego Viola 2011-09-10 23:34:18 UTC
(In reply to comment #54)
> Got this now (while playing Nexuiz) -- I got no crashes this time but this
> error in dmesg:
> 
> [diego@myhost ~]$ dmesg
> [drm] nouveau 0000:01:00.0: firmware ctxvals loaded
> uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
> uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
> uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
> [drm] nouveau 0000:01:00.0: firmware ctxvals loaded
> [drm] nouveau 0000:01:00.0: firmware ctxvals loaded
> [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
> [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000
> 00000000
> [drm] nouveau 0000:01:00.0: PGRAPH - TRAP
> [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x000f7b0000) subc 5 class 0x8297
> mthd 0x15f0 data 0x00000000
> [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5
> [0x0000f7b0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
> uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
> [diego@myhost ~]$ 
> 
> 
> As you can see, I did a "dmesg -c" before, so I make sure I go this now.

s/go/got/g
Comment 56 Diego Viola 2011-09-11 00:07:40 UTC
Again (while playing Nexuis) -- no crashes.


[diego@myhost ~]$ dmesg
[drm] nouveau 0000:01:00.0: firmware ctxvals loaded
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
process `skype' is using obsolete setsockopt SO_BSDCOMPAT
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
[drm] nouveau 0000:01:00.0: firmware ctxvals loaded
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us
uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x00064f0000) subc 5 class 0x8297 mthd 0x0f04 data 0x00000000
[drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5 [0x000064f0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
[diego@myhost ~]$
Comment 57 Diego Viola 2011-09-11 02:45:29 UTC
Created attachment 51046 [details]
SuperTuxKart crash AFTER suspend dmesg (KERNEL LOG)

SuperTuxKart just froze my machine again AFTER suspend/resume, it crashed/froze my whole machine basically, and I had to reboot, but I managed to SSH-in and get the dmesg (see the file attached)

The interesting bit seems to be this:

[drm] nouveau 0000:01:00.0: Failed to idle channel 5.
Comment 58 Diego Viola 2011-09-11 16:56:10 UTC
I've been using KDE 4.7.1 (kwin with opengl compositing) enabled most of the time, and playing games NOT in suspend mode and it's table that way.

Most of the crashes with games happens AFTER suspending.
Comment 59 Diego Viola 2011-09-11 19:15:00 UTC
I suspended my computer (tm-suspend as root) then played SuperTuxKart and after completing the first race it just hung my computer.

I sssh-in and did a dmesg, here is the relevant log:

[drm] nouveau 0000:01:00.0: firmware ctxvals loaded
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: Failed to idle channel 5.
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000
[drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000203
[drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000303
[drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000303 0x00000001 0x00000000 0x00000000
Comment 60 Diego Viola 2011-09-11 19:15:37 UTC
pm-suspend, sorry
Comment 61 Diego Viola 2011-09-11 19:19:15 UTC
(In reply to comment #59)
> I suspended my computer (tm-suspend as root) then played SuperTuxKart and after
> completing the first race it just hung my computer.
> 
> I sssh-in and did a dmesg, here is the relevant log:
> 
> [drm] nouveau 0000:01:00.0: firmware ctxvals loaded
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: Failed to idle channel 5.
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203
> 0x00000001 0x00000000 0x00000000
> [drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000203
> [drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000303
> [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000303
> 0x00000001 0x00000000 0x00000000

The last time it crashed I was running the game in windowed mode at max resolution (1680x1050). AFTER suspending.
Comment 62 Diego Viola 2011-09-11 20:49:01 UTC
I've tried running Doom 3 (without suspending my machine) and it ran very slow (4 FPS), and Doom 3 thought I had only 64 MB of VRAM, wtf?
Comment 63 Diego Viola 2011-09-11 20:49:39 UTC
Doom 3 didn't crash though, but it was very slow.

I managed to run "timedemo demo1" and it passed the test but very slowly (4 FPS)
Comment 64 Diego Viola 2011-09-11 21:15:23 UTC
Sorry!

I was missing some lib32 libs, like DRI and mesa (32 bits), now I installed those and Doom 3 runs freaking FAST!

60+ FPS :D

And runs stable too (without suspending)
Comment 65 Diego Viola 2011-09-13 04:26:05 UTC
maybe we should open another bug report for the crashes that happens after suspending, as this one is for the graphics corruption after suspend.
Comment 66 Diego Viola 2011-09-15 05:37:07 UTC
Hello,

I tried running SuperTuxKart after suspend and it crahsed as usual.

Then I tried your suggestion: suspend without X, resume, unbind, restart nouveau module with your script, start X and play SuperTuxKart, but it crashed after playing for a while (1-2 laps).

I did a dmesg from my laptop via SSH on the affected computer and found nothing interesting, excpet this line:

[drm] nouveau 0000:01:00.0: Failed to idle channel 4.
[drm] nouveau 0000:01:00.0: GPU lockup - switching to software fbcon
[drm] nouveau 0000:01:00.0: Failed to idle channel 2.

I tried looking in Xorg.0.log as well but found nothing interesting as well.

Any ideas?
Comment 67 Diego Viola 2011-09-15 05:55:19 UTC
Just updated the title to something that describes the current problem better.
Comment 68 maximlevitsky 2011-09-17 07:31:00 UTC
We did futher testing and even though we had some promising results, it looks like this bug is too hard to fix at least for myself.

We found following things:

1. Forcing nouveau to run POST tables (force_post=1) made the same corruption as after suspend when we didn't use nvidia's ctxprog. The error reported was the same (failure in MP1 of TP1).
Also, we managed to crash the system twice when force_post was set (running supertuxkart)

Sadly, We couldn't crash the system anymore when we tried to repeat that running Nexuz. Yet.

Also when I decoded the init tables, I found that they do setup of memory controller, reclock the card, etc...

2. Nature of the crashes after resume from ram and nvidia ctxprog:
We found out that disabling of pageflip support made the crash occur less often, also we found that 2D activity like opening firefox or changing virtual desktop made system crash right away on condition that 3d app was already running (we used unigine tropics for that I think)

3. Without nvidia ctxprog crashes happen as well.

4. We ruled out that that problem was that core clock is more that 1/2 of shader clock. Its exactly 1/2

5. ....
Comment 69 Diego Viola 2011-09-17 09:57:12 UTC
We're running SuperTuxKart built from SVN on X and nothing else like this:

xinit /usr/local/games/supertuxkart  --profile-laps=1000 --fullscreen

We ran this with ctxprog loaded and force_post=1 (without suspend) and SuerTuxKart just hung on lap 26/1000.
Comment 70 maximlevitsky 2011-09-17 20:27:04 UTC
Now the situation is much better. We did many tests and we get this picture:

There are 3 problems that are more or less independent. All are caused by ether suspend or running BIOS init tables (tested).
None of these problems happen if nether of above was done.

In other words BIOS init tables set some registers to harmful values, but nvidia driver undoes that - bios writes really are evil and indeed one need 'glove box' to handle BIOSes.

1. Fault in MP1 of TP0 - we tested that often and we did get it with force_post=1.
Nvidia ctxprog (more likely ctxvals) 'fixes' this.

2. Semaphore related fault, happen after around 10 minutes of full screen STK.
Was reproduced with and without forced POSTing, one time in each situation, same error according to PFIFO state dumps.
Can be 'fixed' by turning pageflipping off.


3. Crash that happens if we run glxgears in addition to STK.
Happens very fast (~3 seconds from glxgears start), regardless of pageflipping.
Both PFIFO and PGRAPH appears to be idle though.
Comment 71 maximlevitsky 2011-09-17 21:03:07 UTC
Created attachment 51300 [details]
list of POST register writes

Hopefully when we replay that list of register writes problem will reappear, just like with force_post
Comment 72 maximlevitsky 2011-09-17 21:13:45 UTC
Created attachment 51301 [details] [review]
demmio_replay patch

This is my tool I use to replay traces.
I adapted it to replay this trace.
Patch it to envytools
Comment 73 Diego Viola 2011-09-18 05:59:07 UTC
I've ran SuperTuxKart for 9 hours (620/50000 laps) in total, without suspend the machine.

The result: SuperTuxKart didn't crash or hang at all. I also ran glxgears in the middle of it and no hangs or crashes. Stability is really good without suspend.

I see lots of these messages on my dmesg however:

[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff0d2d07
[drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff346120
[drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff254208
[drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000
[drm] nouveau 0000:01:00.0: PGRAPH - TRAP
[drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff041d00
[drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
Comment 74 maximlevitsky 2011-09-18 15:41:52 UTC
OK Folks, it was just one bit, and it was unset by bios init tables.
One 
Bit 0x400000 of register 0x4008, the register that responsible for memory clock.

Actual values of this register:


before suspend:

0x90596400 PCLOCK.MPLL1 => { UNK0 = 0x6400 | P = 0x1 | UNK19 = 0xb | UNK28 = 0x1 | ENABLE }

after suspend:   
0x90016400 PCLOCK.MPLL1 => { UNK0 = 0x6400 | P = 0x1 | UNK19 = 0 | UNK28 = 0x1 | ENABLE }

test1:  0x90096400 - corruptions, but stable (with ctxfw, hang after 10 secs)
test2   0x90516400 - good (with ctxprog)
test3   0x90416400 - good (with ctxprog), without ctxprog: good
test4   0x90116400 - corruptions without ctxprog

Also seems that without nvidia's ctxprog, while we always got corruption, we didn't get hangs (but we tested just few laps of STK+glxgears)
It doesn't matter much. It is one bit and its in unknown PLL area....
Comment 75 maximlevitsky 2011-09-18 15:45:49 UTC
Created attachment 51323 [details] [review]
Temporary hack to fix this

This is hack to fix this issue.
It must be reversed further to get to bottom of this
Comment 76 Diego Viola 2011-09-18 15:53:03 UTC
Created attachment 51324 [details]
vbios.rom

attaching a dump of /sys/kernel/debug/dri/0/vbios.rom

cat /sys/kernel/debug/dri/0/vbios.rom > vbios.rom
Comment 77 Diego Viola 2011-09-19 07:43:32 UTC
I found another way to hung/crash the system.

1- Boot into console mode (id:3:initdefault: in /etc/inittab)

2- start X (and just X) in tty1

3- switch to tty2 and start STK in that tty2.

start STK in tty2 like this:

export DISPLAY=:0
/usr/local/games/supertuxkart  --profile-laps=50000 --fullscreen

when I start STK like that it would attempt to switch to X as soon as STK is launched.

Then X would hung right there, it wont even start STK, but it would hung as soon as the VT is switched to X, then I would get errors like this one:

[ 17583.534] (EE) NOUVEAU(0): failed to set mode: Permission denied
Fatal server error:
[ 17583.534] EnterVT failed for screen 0

It seems like this issue is more related to this bug than anything else: https://bugzilla.redhat.com/show_bug.cgi?id=680677

If I try to do the same, e.g. reboot the machine, login into tty1, start X, switch to tty2, login into tty2, start glxgears in tty2, nothing happens... it doesn't even switch right back to X when I start glxgears in tty2, it runs glxgears and I see output in tty2, and I can switch to tt7 where X is running (alt+f7) and see glxgears running in there.

But whether I try the same with STK, it would attempt to switch to X directly and hung the machine.
Comment 78 Diego Viola 2011-09-19 07:45:19 UTC
(In reply to comment #77)
> I found another way to hung/crash the system.
> 
> 1- Boot into console mode (id:3:initdefault: in /etc/inittab)
> 
> 2- start X (and just X) in tty1
> 
> 3- switch to tty2 and start STK in that tty2.
> 
> start STK in tty2 like this:
> 
> export DISPLAY=:0
> /usr/local/games/supertuxkart  --profile-laps=50000 --fullscreen
> 
> when I start STK like that it would attempt to switch to X as soon as STK is
> launched.
> 
> Then X would hung right there, it wont even start STK, but it would hung as
> soon as the VT is switched to X, then I would get errors like this one:
> 
> [ 17583.534] (EE) NOUVEAU(0): failed to set mode: Permission denied
> Fatal server error:
> [ 17583.534] EnterVT failed for screen 0
> 
> It seems like this issue is more related to this bug than anything else:
> https://bugzilla.redhat.com/show_bug.cgi?id=680677
> 
> If I try to do the same, e.g. reboot the machine, login into tty1, start X,
> switch to tty2, login into tty2, start glxgears in tty2, nothing happens... it
> doesn't even switch right back to X when I start glxgears in tty2, it runs
> glxgears and I see output in tty2, and I can switch to tt7 where X is running
> (alt+f7) and see glxgears running in there.
> 
> But whether I try the same with STK, it would attempt to switch to X directly
> and hung the machine.

tty7*
Comment 79 Diego Viola 2011-09-19 07:48:30 UTC
I was able to suspend the machine and run STK+glxgears for many hours and I didn't had another crash anymore, it seems like the hung I got in my last post is related to fast VT switching and not another thing, different bug.
Comment 80 Diego Viola 2011-09-19 08:10:54 UTC
if I ssh from my laptop to the affected computer I can run STK+glxgears for hours and no hangs.

The last hangs I experienced only happens when there is a fast VT->X context switch.

So it seems like it's completely unrelated to the suspend hangs.

Thanks.
Comment 81 Diego Viola 2011-09-19 09:48:52 UTC
[root@myhost ~]# lspci -v
00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller (rev 03)
        Subsystem: ASUSTeK Computer Inc. Device 836d
        Flags: bus master, fast devsel, latency 0
        Capabilities: [e0] Vendor Specific Information: Len=0c <?>

00:01.0 PCI bridge: Intel Corporation 4 Series Chipset PCI Express Root Port (rev 03) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000d000-0000dfff
        Memory behind bridge: fa000000-feafffff
        Prefetchable memory behind bridge: 00000000e0000000-00000000efffffff
        Capabilities: [88] Subsystem: ASUSTeK Computer Inc. Device 836d
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
        Capabilities: [a0] Express Root Port (Slot+), MSI 00
        Kernel driver in use: pcieport

00:1b.0 Audio device: Intel Corporation N10/ICH 7 Family High Definition Audio Controller (rev 01)
        Subsystem: ASUSTeK Computer Inc. Device 8445
        Flags: bus master, fast devsel, latency 0, IRQ 6
        Memory at f9ffc000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [50] Power Management version 2
        Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00

00:1c.0 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 1 (rev 01) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
        I/O behind bridge: 00002000-00002fff
        Memory behind bridge: 80600000-807fffff
        Prefetchable memory behind bridge: 0000000080800000-00000000809fffff
        Capabilities: [40] Express Root Port (Slot+), MSI 00
        Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
        Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 8179
        Capabilities: [a0] Power Management version 2
        Kernel driver in use: pcieport
                                                                                                                                                                                                                                             
00:1c.1 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 2 (rev 01) (prog-if 00 [Normal decode])                                                                                                                              
        Flags: bus master, fast devsel, latency 0                                                                                                                                                                                            
        Bus: primary=00, secondary=03, subordinate=03, sec-latency=0                                                                                                                                                                         
        I/O behind bridge: 00001000-00001fff                                                                                                                                                                                                 
        Memory behind bridge: 80200000-803fffff                                                                                                                                                                                              
        Prefetchable memory behind bridge: 0000000080400000-00000000805fffff                                                                                                                                                                 
        Capabilities: [40] Express Root Port (Slot+), MSI 00                                                                                                                                                                                 
        Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-                                                                                                                                                                           
        Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 8179                                                                                                                                                                      
        Capabilities: [a0] Power Management version 2                                                                                                                                                                                        
        Kernel driver in use: pcieport                                                                                                                                                                                                       
                                                                                                                                                                                                                                             
00:1c.3 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 4 (rev 01) (prog-if 00 [Normal decode])                                                                                                                              
        Flags: bus master, fast devsel, latency 0                                                                                                                                                                                            
        Bus: primary=00, secondary=02, subordinate=02, sec-latency=0                                                                                                                                                                         
        I/O behind bridge: 0000e000-0000efff                                                                                                                                                                                                 
        Memory behind bridge: feb00000-febfffff                                                                                                                                                                                              
        Prefetchable memory behind bridge: 0000000080000000-00000000801fffff                                                                                                                                                                 
        Capabilities: [40] Express Root Port (Slot+), MSI 00                                                                                                                                                                                 
        Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-                                                                                                                                                                           
        Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 8179                                                                                                                                                                      
        Capabilities: [a0] Power Management version 2                                                                                                                                                                                        
        Kernel driver in use: pcieport                                                                                                                                                                                                       
                                                                                                                                                                                                                                             
00:1d.0 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #1 (rev 01) (prog-if 00 [UHCI])                                                                                                                               
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard                                                                                                                                                                         
        Flags: bus master, medium devsel, latency 0, IRQ 23
        I/O ports at c480 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.1 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #2 (rev 01) (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard
        Flags: bus master, medium devsel, latency 0, IRQ 19
        I/O ports at c800 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.2 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #3 (rev 01) (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard
        Flags: bus master, medium devsel, latency 0, IRQ 18
        I/O ports at c880 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.3 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #4 (rev 01) (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard
        Flags: bus master, medium devsel, latency 0, IRQ 16
        I/O ports at cc00 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.7 USB Controller: Intel Corporation N10/ICH 7 Family USB2 EHCI Controller (rev 01) (prog-if 20 [EHCI])
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard
        Flags: bus master, medium devsel, latency 0, IRQ 23
        Memory at f9ffbc00 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [50] Power Management version 2
        Capabilities: [58] Debug port: BAR=1 offset=00a0
        Kernel driver in use: ehci_hcd

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) (prog-if 01 [Subtractive decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=00, secondary=05, subordinate=05, sec-latency=32
        Capabilities: [50] Subsystem: ASUSTeK Computer Inc. Device 8179

00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM Motherboard
        Flags: bus master, medium devsel, latency 0
        Capabilities: [e0] Vendor Specific Information: Len=0c <?>

00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01) (prog-if 8a [Master SecP PriP])
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM Motherboard
        Flags: bus master, medium devsel, latency 0, IRQ 18
        I/O ports at 01f0 [size=8]
        I/O ports at 03f4 [size=1]
        I/O ports at 0170 [size=8]
        I/O ports at 0374 [size=1]
        I/O ports at ffa0 [size=16]
        Kernel driver in use: ata_piix

00:1f.2 IDE interface: Intel Corporation N10/ICH7 Family SATA IDE Controller (rev 01) (prog-if 8f [Master SecP SecO PriP PriO])
        Subsystem: ASUSTeK Computer Inc. P5KPL-VM Motherboard
        Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 22
        I/O ports at c400 [size=8]
        I/O ports at c080 [size=4]
        I/O ports at c000 [size=8]
        I/O ports at bc00 [size=4]
        I/O ports at b880 [size=16]
        Capabilities: [70] Power Management version 2
        Kernel driver in use: ata_piix

01:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9500 GT] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: eVga.com. Corp. Device c958
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
        I/O ports at dc00 [size=128]
        Expansion ROM at fea80000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Kernel driver in use: nouveau
        Kernel modules: nouveau

02:00.0 Ethernet controller: Atheros Communications AR8121/AR8113/AR8114 Gigabit or Fast Ethernet (rev b0)
        Subsystem: ASUSTeK Computer Inc. P5KPL-CM Motherboard
        Flags: bus master, fast devsel, latency 0, IRQ 19
        Memory at febc0000 (64-bit, non-prefetchable) [size=256K]
        I/O ports at ec00 [size=128]
        Capabilities: [40] Power Management version 2
        Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [58] Express Endpoint, MSI 00
        Kernel driver in use: ATL1E

[root@myhost ~]#
Comment 82 Diego Viola 2011-09-19 10:33:42 UTC
Pictures of the affected card.

Front:

http://dl.dropbox.com/u/6005119/gpu/P1120649.JPG

Back:

http://dl.dropbox.com/u/6005119/gpu/P1120671.JPG
Comment 83 Diego Viola 2011-09-19 10:35:50 UTC
Created attachment 51365 [details]
lspci -a

added lspci -a
Comment 84 Diego Viola 2011-09-19 10:57:35 UTC
(In reply to comment #82)
> Pictures of the affected card.
> 
> Front:
> 
> http://dl.dropbox.com/u/6005119/gpu/P1120649.JPG
> 
> Back:
> 
> http://dl.dropbox.com/u/6005119/gpu/P1120671.JPG

card pictures (mirror), just in case.

http://ompldr.org/vYWY5aA/P1120649.JPG
http://ompldr.org/vYWY5Zw/P1120671.JPG
Comment 85 Diego Viola 2011-09-20 15:01:25 UTC
I failed to reproduce this same problem with this card:

01:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9400 GT] (rev a1)

BTW this problem is also present on Fedora 15, I tested it and the problem is there also.
Comment 86 Diego Viola 2011-09-20 15:08:41 UTC
I fail to reproduce this issue with this card as well:

02:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 9800 GT] (rev a2)


This issue is only present with my GeForce 9500 GT.

Weird.
Comment 87 Diego Viola 2011-09-20 16:10:31 UTC
I have a ThinkPad T510 with optimus GPU (hybrid Intel/Nvidia), I have switched the GPU in the BIOS to use the Nvidia card so I could try to reproduce this issue in my laptop as well, and failed to reproduce there as well.

I have suspended hte laptop many times but glxgears appears just fine after suspend, no corrupted graphics, etc.

[diego@myhost ~]$ lspci|grep VGA
01:00.0 VGA compatible controller: nVidia Corporation GT218 [NVS 3100M] (rev a2)
[diego@myhost ~]$
Comment 88 Ben Skeggs 2011-09-22 22:09:47 UTC
Created attachment 51531 [details] [review]
More complete MPLL programming during POST

Give this patch a try.  It should correct nouveau's MPLL setup when cold-booting the card.
Comment 89 Diego Viola 2011-09-23 17:08:07 UTC
(In reply to comment #88)
> Created an attachment (id=51531) [details]
> More complete MPLL programming during POST
> 
> Give this patch a try.  It should correct nouveau's MPLL setup when
> cold-booting the card.

Works great, thank you so much! :)
Comment 90 Diego Viola 2011-09-23 17:16:26 UTC
I've rebuilt the kernel today from Linus Torvalds github repository and applied darktama's patch.

Linux myhost 3.1.0-rc7+ #1 SMP Fri Sep 23 13:44:20 PYT 2011 x86_64 Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz GenuineIntel GNU/Linux

So far things are very stable!

I've ran STK (100 laps) with glxgears and suspended many times, I also suspended while in-game and things are very stable.

I also tried using KDE 4.7 with kwin opengl compositing after and before suspending and things are very smooth and stable.

I'm going to close this bug report now and mark it fixed. calim and MaximLevitsky already approved closing the bug report. calim have also stated that the real fix has been merged into nouveau git master and that it will be available in Linux 3.2. The commit is here: http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=a51d43c1b27581e780f60b6d724d146db94b31c5

Thanks to everyone who have helped!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.