Description: When I make my system go into suspend mode (pm-suspend), then I resume my computer, my graphics get corrupted. Then the only thing I can do is to reboot the computer to prevent the graphics glitches to happen. I have tried this with Linux 3.0.4, and 3.0.1 and I get the same problem with both. Hardware: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz 2 GB RAM 01:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9500 GT] (rev a1) Software: Arch Linux x86-64 Linux 3.0.4 KDE 4.7 xorg-server 1.10.3.901-1 xf86-video-nouveau 0.0.16_git20110726-1
How to reproduce the problem: 1- install archlinux x86-64 2- pacman -Syu 3- install xf86-video-nouveau, nouveau-dri 4- install KDE4. 5- pm-suspend as root 6- resume the computer 7- watch the graphics glitches
Please note that I have OpenGL compositing turned to on in KWin, I have tried to disable it but this problem still persists, I tried to restart X but the problem is still there, the only workaround I found is to reboot the computer completely, but as soon as I suspend my computer the graphics get corrupted again.
Created attachment 50904 [details] NOUVEAU IRC LOG if you need more detailed information please see the IRC log attached from #nouveau. Or feel free to ask me questions, and I will gladly provide any help.
Created attachment 50905 [details] kernel log (dmesg) Please find the dmesg attached, I have captured this dmesg after/when I have reproduced the problem.
Created attachment 50906 [details] picture displaying the problem (graphics corruption) Attached some pictures displaying the problem after having reproduced it.
Created attachment 50907 [details] Another picture displaying the problem Another picture showing the problem.
Created attachment 50908 [details] one more picture showing the problem One more picture showing the problem.
On the image that shows Firefox I have selected text and dragged around.
I also notice sometimes when calling krunner (ALT + F2) in KDE, then typing something like "konsole", the pixels where krunner is will get all messed up too, I don't even need to suspend to reproduce this, it doesn't always happen but it happens most of the time, when I'm able to reproduce that I will take a picture. Thanks.
I had exactly the same problem. Then it looked like 2 bugs. One was in mesa and the way it didn't upload relocations for constant buffer, and other problem possible was in kernel, but I am not very sure about that, and anyway you have new enough kernel not to worry about. So please make sure you have recent enough mesa (7.11 should be enough). Also to reproduce that easily (cat /sys/kernel/debug/dri/0/evict_vram) was added. Just read that file few times like that: cat /sys/kernel/debug/dri/0/evict_vram (That assumes you have debugfs mounted on /sys/kernel/debug)
(In reply to comment #10) > I had exactly the same problem. > Then it looked like 2 bugs. One was in mesa and the way it didn't upload > relocations for constant buffer, and other problem possible was in kernel, but > I am not very sure about that, and anyway you have new enough kernel not to > worry about. > > So please make sure you have recent enough mesa (7.11 should be enough). > > Also to reproduce that easily (cat /sys/kernel/debug/dri/0/evict_vram) was > added. > Just read that file few times like that: > > cat /sys/kernel/debug/dri/0/evict_vram > > (That assumes you have debugfs mounted on /sys/kernel/debug) Hello, I have mesa 7.11-2. [diego@myhost ~]$ pacman -Q mesa mesa 7.11-2 [diego@myhost ~]$ I tried mounting debugfs to /sys/kernel/debug/ and reading /sys/kernel/debug/dri/0/evict_vram a few times with cat but I cannot reproduce the issue that way.
could you try reading that debug file many times, possibly in a loop? If that doesn't work, then its different issue I guess. Please try to update to mesa from git, just in case issue is fixed there.
Another thing, try compiz and see if it has same issue.
(In reply to comment #12) > could you try reading that debug file many times, possibly in a loop? > If that doesn't work, then its different issue I guess. > > Please try to update to mesa from git, just in case issue is fixed there. I've been running "cat /sys/kernel/debug/dri/0/evict_vram" inside a while loop for more than 1 hour and every 1 second. $ while true; do cat /sys/kernel/debug/dri/0/evict_vram; sleep 1; done but I couldn't reproduce the issue that way. How can I upgrade mesa from git? Any instructions?
BTW running this has crashed my machine: while true; do cat /sys/kernel/debug/dri/0/evict_vram; done when I added "sleep 1" it didn't crash my machine, I believe I exhausted my hardware. but still, I couldn't reproduce the issue. Any instructions for upgrading mesa to git? Thanks.
Oh, I didn't mean to make you run that for a hour!, a few minutes should have done it. Its clear you have another bug. So only suggestion left is to update mesa (and try compiz too). Its not difficult at all to compile mesa, I can even guide you if you join IRC now
(In reply to comment #16) > Oh, I didn't mean to make you run that for a hour!, a few minutes should have > done it. > Its clear you have another bug. > So only suggestion left is to update mesa (and try compiz too). > Its not difficult at all to compile mesa, I can even guide you if you join IRC > now What's your nick on IRC?
I tried compiz and changing window managers doesn't help, the corrupted graphic issues are still there after doing a suspend/resume.
Any ideas?
I tried running openbox and kwin without opengl compositing and after suspending/resuming it doesn't exhibit this problem. the problem is only when opengl compositing is enabled.
I have compiled libdrm and Mesa from Git, the problem still persists.
Created attachment 50946 [details] glxgears after suspend/resume We (MaximLevitsky and I) have found that any 3D apps doesn't survive suspend/resume. Whether I start glxgears after suspend/resume, the graphics looks corrupted. screenshots of glxgears attached after suspend/resume (and also logs).
Created attachment 50947 [details] glxgears after suspend/resume maximized
So now we found that after suspend, starting glxgears (and that means pretty much any 3D apps) produces the corruption found on screenshot in former comment. On first glxgear invocation after s2ram cycle, kernel log gets this: [ 297.625761] EXT4-fs (sda2): re-mounted. Opts: commit=0 [ 297.628545] EXT4-fs (sda3): re-mounted. Opts: commit=0 [ 306.243318] eth0: no IPv6 routers present [ 347.083807] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_MP - TP0: Unhandled ustatus 0x00020000 [ 347.083812] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [ 347.083818] [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x0001b07000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010 According to envytools, its TP0.MPC_TRAP bit MP1. The failing PFIFO method doesn't always present in kernel log (eg it wasn't there when we tested compiz) and it decodes to QUERY_GET method. Any ideas?
OK. After making kernel handle the TP1 error bit, the error was decoded, and tried several times. These are errors we got: 1st try: [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 9, opcode ffc8c2bf ffc8c2bf [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x0001430000) subc 5 class 0x8297 mthd 0x0f04 data 0x3f4b6dfa 2nd try: [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 1, opcode ff1f515e ff1f515e [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001cd1000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010 3rd try: [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 13, opcode ff1f515e ff1f515e [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001930000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010 4rd try: [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07c000 warp 1, opcode ffc7c1bf ffc7c1bf [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x000322c000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010 It was also found that 2d usage after resume from ram and/or xvideo playback don't trigger that MP fault. Also following suggestions from Marcin Slusarz were tried: We tried to restart X after resume from ram, and once again no errors. We tried running rendercheck -tblend, just 9 out of 18 tests passed, but no TP error. Its somewhat strange as if nouveau drm module didn't restore some of TP/MP state, the error should have being triggered by X as it does use shaders that have to choice but to run on MP/TPs. Also running just glxinfo was attempted. That does all the 3d init but doesn't render anything (I checked this). No errors. Thats is ***ing strange, that all I can say, and I am out of ideas for now.
Same problem when doing hibernate, even using 'shutdown' mode. So no magic missing help from bios on resume of screwup from bios on resume is expected. Another sample: [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 1, opcode ff1f515e ff1f515e [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001ccd000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010 And another: [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 07bf00 warp 12, opcode 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001319000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010 All errors suggest that always MP1 of TP0 executes random garbage code. So maybe, maybe......
So I did a shot in the dark and I almost scored. This what I found: nouveau gallium3d driver reserves 3x512KB buffers for each shader type. We reduced this to 3x32K buffers and problem gone! (Although one crash with kwin and glxgears was reported, not clear yet it it was related). Its not easy to say what is going on, but suggestions that I have in mind: 1. memory allocation failure that goes somehow silent or other similar ttm related bug. knowing about all the titanic memory moving it might be the case after all (the system has 1GB of vram and 2GB of system memory...) 2. Issues with VM, for example large pages that might be involved and somewhat disabled. 3. Issues with TLB?? 4. ????
Hi, I've been playing Nexuiz for at least 2 hours at 1680x1050 with effects set to High, the FPS I get with my card is 30-45 FPS. I don't get corrupted graphics anymore, and I managed to suspend/hibernate WHILE playing the game and the game didn't crash a single time while doing suspend/hibernate, which I find impressive. So the game runs stable, but it I also got 2 or 3 crashes, what was the command for getting a trace? I don't get corrupted graphics anymore, which is nice. Thanks Naxim, I'm very impressed with Nouveau, it rocks.
calim's response to this issue on #nouveau: 06:06 < calim> diegoviola: that has to be some issue with kernel not restoring VRAM properly, the addresses the MP executes at look sane but the RAM contents are obviously garbage 06:11 < calim> reducing the size of some buffer just hides the issue though 06:12 < calim> or it might step over the large/small pages boundary, could be a hint 06:13 < calim> he's already thought about all that 06:13 < calim> s/about/of 06:13 < calim> will probably have to look at the kernel's suspend code next 06:14 < calim> I mean nouveau's suspend code 06:16 < calim> maybe double the size (from 32 KiB up to the original 512 KiB) and check at which point it starts to fail to see if it really is the large-pages allocation boundary
I've just tried the latest kernel from git, and the error is still there. I get corrupted graphics after doing a suspend/resume on glxgears. [diego@myhost ~]$ uname -a Linux myhost 3.1.0-rc5+ #1 SMP Sat Sep 10 01:15:44 PYT 2011 x86_64 Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz GenuineIntel GNU/Linux [diego@myhost ~]$
I get this with the latest kernel from git head. [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_MP - TP0: Unhandled ustatus 0x00020000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001e81000) subc 5 class 0x8297 mthd 0x1b0c data 0x1000f010 [diego@myhost ~]$ uname -a Linux myhost 3.1.0-rc5+ #1 SMP Sat Sep 10 01:15:44 PYT 2011 x86_64 Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz GenuineIntel GNU/Linux [diego@myhost ~]$
Created attachment 51024 [details] kernel .config attached my kernel .config file, used for 3.0.4 and 3.1.0-rc5+ (git head)
Created attachment 51025 [details] kernel .config reattach kernel config as text file
Thanks for testing the latest kernel, and sadly as you see the problem is still there. This is a summary of findings of findings so far: 1. Fault always happens in MP1 of TP0, after resume from ram 2. It doesn't matter if program was run through suspend or launched afrer suspend. 3. It doesn't matter if there was suspend or hibernate, even hibernate without BIOS suport (echo shutdown > /sys/power/disk) 4. Actual size of allocated buffer doesn't matter. However just pretending that we allocated smaller buffers for each code type (by passing smaller size to nouveau_resource_alloc fixes the issue. (32K was largest size that was working) 5. Buffers that glxgears happen to upload (tested on his and my system and apppears not to change: 7fc00 - 2nd vertex shader - 392 bytes 7fe00 - vertex shader 392 bytes fff00 - pixel shader - 24 bytes However fault address is almost always 7bf00 or something very close to it (7bf08, 7c000) That address is way outside the areas that were uploaded. 6. Filling whole code buffer with pattern (using nouveau_bo_map) with pattern made TP1 execute that pattern once again at 7bf00. Tested many tumes. However if in addition to doing so, the code buffer upload was skipped, all TPs (all 4 of them) faulted in tandem at 7fc00 trying to execute the pattern.
Small update and corrections: This is a summary of findings so far: 1. Fault always happens in MP1 of TP0, after resume from ram That's of course very strange 2. It doesn't matter if program was run through suspend or launched afrer suspend. Which means that its not an error of preserving channel/program state through suspend. Everything is uploaded new after the program starts. 3. It doesn't matter if there was suspend or hibernate, even hibernate without BIOS suport (echo shutdown > /sys/power/disk) Or in other words 'magic init' done on boot but missing after resume, or opposite a screwup by bios after resume is ruled out. 4. Actual size of allocated buffer doesn't matter. However just pretending that we allocated smaller buffers for each code type (by passing smaller size to nouveau_resource_alloc fixes the issue. (32K was largest size that was working) Also support for large pages was disabled in kernel and that didn't help. This means that its not allocation, failure. It could be memory copy failure though, although initially buffer is allocated in video ram. This also means that the _code offset_ is what matters. In addition to that note that code was uploaded by SIFC but it takes end address of area to copy to, thus really offset from start of out buffer shouldn't matter. 5. Buffers that glxgears happen to upload (tested on his and my system and appears not to change: 7fc00 - 2nd vertex shader - 392 bytes 7fe00 - vertex shader 392 bytes fff00 - pixel shader - 24 bytes However fault address is almost always 7bf00 or something very close to it (7bf08, 7c000) That address is way outside the areas that were uploaded, which means that TP1 really executes undefined code that just wasn't uploaded. Filling the whole code buffer with pattern (using nouveau_bo_map) with pattern made TP1 execute that pattern once again at 7bf00. Tested many times. However if in addition to doing so, the code buffer upload was skipped, all TPs (all 4 of them) faulted in tandem at 7fe00 (vertex shader) trying to execute the pattern.
OK, we tested and memory is correctly uploaded. Also did a dump of pushbuffers, and nouveau does set the address of code to execute correctly. Yet, for some damn reason GPU still executes from undefined location. OK, we mmiotraced the blob, extracted ctxvals/ctxprog, and yes, problem disappeared. Also it was noted that some graphical corruptions disappeared. So, folks we really need better coverage of ctxprog and especially ctxvals, because these seems to contain chipset revision specific workarounds.... Could we allow to load ctxprog/vals of nvidia then? so that users could do that in easier way that patch their kernel?
Created attachment 51036 [details] ctxvals mmiotrace
Created attachment 51037 [details] ctxprog mmiotrace
Created attachment 51038 [details] ctxprog prepared for loading
Created attachment 51039 [details] ctxvals prepared for load
Created attachment 51040 [details] [review] patch to allow nouveau load binary ctxprog
I was playing Nexuiz, got some hang while playing, had to reboot the computer. Unfortunately when running this I get no such file or directory (after reboot) [root@myhost ~]# cat /sys/kernel/debug/printk/crash_dmesg | strings cat: /sys/kernel/debug/printk/crash_dmesg: No such file or directory [root@myhost ~]#
(In reply to comment #42) > I was playing Nexuiz, got some hang while playing, had to reboot the computer. > > Unfortunately when running this I get no such file or directory (after reboot) > > [root@myhost ~]# cat /sys/kernel/debug/printk/crash_dmesg | strings > cat: /sys/kernel/debug/printk/crash_dmesg: No such file or directory > [root@myhost ~]# I'm very happy that I'm able to suspend/hibernate and graphics don't get screwed anytime though, the stability seems to have improved a lot. Thanks!
I've been playing Nexuiz at max resolution with effects set to max for something like 2 hours and I only managed to crash it once, I'd say it's working very well. Great Work!
I got a crash in SuperTuxKart as well.
s/anytime/anymore/g
Thats because you don't have my blackbox patch applied
(In reply to comment #47) > Thats because you don't have my blackbox patch applied could you please upload your patch here?
Created attachment 51042 [details] [review] retrieve kernel log after crash patch When compiling kernel enable CONFIG_HWMEM_PRINTK. or as it called in decription, 'Log printk message buffer into fixed physical address' in kernel hacking->kernel debugging.
(In reply to comment #48) > (In reply to comment #47) > > Thats because you don't have my blackbox patch applied > > could you please upload your patch here? I have your patch applied to my 3.0.4 kernel, but not in 3.1.0-rc5+ (git) I believe.
And you need to use 3.1.0-rc5+ because as you remember we put there patch that loads nvidia's ctxprog
(In reply to comment #51) > And you need to use 3.1.0-rc5+ because as you remember we put there patch that > loads nvidia's ctxprog Correct. I will just SSH in next time I get a hang, if I can make it hang, and get the dmesg. I will report here anything I find. Thanks.
I've noticed this on my dmesg now: [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x00078b0000) subc 5 class 0x8297 mthd 0x15f0 data 0x01000000 [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5 [0x000078b0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT https://gist.github.com/1209217
Got this now (while playing Nexuiz) -- I got no crashes this time but this error in dmesg: [diego@myhost ~]$ dmesg [drm] nouveau 0000:01:00.0: firmware ctxvals loaded uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us [drm] nouveau 0000:01:00.0: firmware ctxvals loaded [drm] nouveau 0000:01:00.0: firmware ctxvals loaded [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x000f7b0000) subc 5 class 0x8297 mthd 0x15f0 data 0x00000000 [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5 [0x0000f7b0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us [diego@myhost ~]$ As you can see, I did a "dmesg -c" before, so I make sure I go this now.
(In reply to comment #54) > Got this now (while playing Nexuiz) -- I got no crashes this time but this > error in dmesg: > > [diego@myhost ~]$ dmesg > [drm] nouveau 0000:01:00.0: firmware ctxvals loaded > uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us > uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us > uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us > [drm] nouveau 0000:01:00.0: firmware ctxvals loaded > [drm] nouveau 0000:01:00.0: firmware ctxvals loaded > [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT > [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 > 00000000 > [drm] nouveau 0000:01:00.0: PGRAPH - TRAP > [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x000f7b0000) subc 5 class 0x8297 > mthd 0x15f0 data 0x00000000 > [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5 > [0x0000f7b0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT > uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us > [diego@myhost ~]$ > > > As you can see, I did a "dmesg -c" before, so I make sure I go this now. s/go/got/g
Again (while playing Nexuis) -- no crashes. [diego@myhost ~]$ dmesg [drm] nouveau 0000:01:00.0: firmware ctxvals loaded uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us process `skype' is using obsolete setsockopt SO_BSDCOMPAT uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us [drm] nouveau 0000:01:00.0: firmware ctxvals loaded uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: release dev 2 ep01-ISO, period 1, phase 0, 158 us uhci_hcd 0000:00:1d.3: reserve dev 2 ep01-ISO, period 1, phase 0, 158 us [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 5 (0x00064f0000) subc 5 class 0x8297 mthd 0x0f04 data 0x00000000 [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 5 [0x000064f0] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT [diego@myhost ~]$
Created attachment 51046 [details] SuperTuxKart crash AFTER suspend dmesg (KERNEL LOG) SuperTuxKart just froze my machine again AFTER suspend/resume, it crashed/froze my whole machine basically, and I had to reboot, but I managed to SSH-in and get the dmesg (see the file attached) The interesting bit seems to be this: [drm] nouveau 0000:01:00.0: Failed to idle channel 5.
I've been using KDE 4.7.1 (kwin with opengl compositing) enabled most of the time, and playing games NOT in suspend mode and it's table that way. Most of the crashes with games happens AFTER suspending.
I suspended my computer (tm-suspend as root) then played SuperTuxKart and after completing the first race it just hung my computer. I sssh-in and did a dmesg, here is the relevant log: [drm] nouveau 0000:01:00.0: firmware ctxvals loaded [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: Failed to idle channel 5. [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 0x00000001 0x00000000 0x00000000 [drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000203 [drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000303 [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000303 0x00000001 0x00000000 0x00000000
pm-suspend, sorry
(In reply to comment #59) > I suspended my computer (tm-suspend as root) then played SuperTuxKart and after > completing the first race it just hung my computer. > > I sssh-in and did a dmesg, here is the relevant log: > > [drm] nouveau 0000:01:00.0: firmware ctxvals loaded > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: Failed to idle channel 5. > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000203 > 0x00000001 0x00000000 0x00000000 > [drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000203 > [drm] nouveau 0000:01:00.0: PGRAPH idle timed out with status 0x00000303 > [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00000303 > 0x00000001 0x00000000 0x00000000 The last time it crashed I was running the game in windowed mode at max resolution (1680x1050). AFTER suspending.
I've tried running Doom 3 (without suspending my machine) and it ran very slow (4 FPS), and Doom 3 thought I had only 64 MB of VRAM, wtf?
Doom 3 didn't crash though, but it was very slow. I managed to run "timedemo demo1" and it passed the test but very slowly (4 FPS)
Sorry! I was missing some lib32 libs, like DRI and mesa (32 bits), now I installed those and Doom 3 runs freaking FAST! 60+ FPS :D And runs stable too (without suspending)
maybe we should open another bug report for the crashes that happens after suspending, as this one is for the graphics corruption after suspend.
Hello, I tried running SuperTuxKart after suspend and it crahsed as usual. Then I tried your suggestion: suspend without X, resume, unbind, restart nouveau module with your script, start X and play SuperTuxKart, but it crashed after playing for a while (1-2 laps). I did a dmesg from my laptop via SSH on the affected computer and found nothing interesting, excpet this line: [drm] nouveau 0000:01:00.0: Failed to idle channel 4. [drm] nouveau 0000:01:00.0: GPU lockup - switching to software fbcon [drm] nouveau 0000:01:00.0: Failed to idle channel 2. I tried looking in Xorg.0.log as well but found nothing interesting as well. Any ideas?
Just updated the title to something that describes the current problem better.
We did futher testing and even though we had some promising results, it looks like this bug is too hard to fix at least for myself. We found following things: 1. Forcing nouveau to run POST tables (force_post=1) made the same corruption as after suspend when we didn't use nvidia's ctxprog. The error reported was the same (failure in MP1 of TP1). Also, we managed to crash the system twice when force_post was set (running supertuxkart) Sadly, We couldn't crash the system anymore when we tried to repeat that running Nexuz. Yet. Also when I decoded the init tables, I found that they do setup of memory controller, reclock the card, etc... 2. Nature of the crashes after resume from ram and nvidia ctxprog: We found out that disabling of pageflip support made the crash occur less often, also we found that 2D activity like opening firefox or changing virtual desktop made system crash right away on condition that 3d app was already running (we used unigine tropics for that I think) 3. Without nvidia ctxprog crashes happen as well. 4. We ruled out that that problem was that core clock is more that 1/2 of shader clock. Its exactly 1/2 5. ....
We're running SuperTuxKart built from SVN on X and nothing else like this: xinit /usr/local/games/supertuxkart --profile-laps=1000 --fullscreen We ran this with ctxprog loaded and force_post=1 (without suspend) and SuerTuxKart just hung on lap 26/1000.
Now the situation is much better. We did many tests and we get this picture: There are 3 problems that are more or less independent. All are caused by ether suspend or running BIOS init tables (tested). None of these problems happen if nether of above was done. In other words BIOS init tables set some registers to harmful values, but nvidia driver undoes that - bios writes really are evil and indeed one need 'glove box' to handle BIOSes. 1. Fault in MP1 of TP0 - we tested that often and we did get it with force_post=1. Nvidia ctxprog (more likely ctxvals) 'fixes' this. 2. Semaphore related fault, happen after around 10 minutes of full screen STK. Was reproduced with and without forced POSTing, one time in each situation, same error according to PFIFO state dumps. Can be 'fixed' by turning pageflipping off. 3. Crash that happens if we run glxgears in addition to STK. Happens very fast (~3 seconds from glxgears start), regardless of pageflipping. Both PFIFO and PGRAPH appears to be idle though.
Created attachment 51300 [details] list of POST register writes Hopefully when we replay that list of register writes problem will reappear, just like with force_post
Created attachment 51301 [details] [review] demmio_replay patch This is my tool I use to replay traces. I adapted it to replay this trace. Patch it to envytools
I've ran SuperTuxKart for 9 hours (620/50000 laps) in total, without suspend the machine. The result: SuperTuxKart didn't crash or hang at all. I also ran glxgears in the middle of it and no hangs or crashes. Stability is really good without suspend. I see lots of these messages on my dmesg however: [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff0d2d07 [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff346120 [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff254208 [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 [drm] nouveau 0000:01:00.0: PGRAPH - TRAP [drm] nouveau 0000:01:00.0: PGRAPH - ch 4 (0x0001980000) subc 5 class 0x8297 mthd 0x15f0 data 0xff041d00 [drm] nouveau 0000:01:00.0: VM: trapped read at 0x0000000000 on ch 4 [0x00001980] PGRAPH/VFETCH/00 reason: PT_NOT_PRESENT
OK Folks, it was just one bit, and it was unset by bios init tables. One Bit 0x400000 of register 0x4008, the register that responsible for memory clock. Actual values of this register: before suspend: 0x90596400 PCLOCK.MPLL1 => { UNK0 = 0x6400 | P = 0x1 | UNK19 = 0xb | UNK28 = 0x1 | ENABLE } after suspend: 0x90016400 PCLOCK.MPLL1 => { UNK0 = 0x6400 | P = 0x1 | UNK19 = 0 | UNK28 = 0x1 | ENABLE } test1: 0x90096400 - corruptions, but stable (with ctxfw, hang after 10 secs) test2 0x90516400 - good (with ctxprog) test3 0x90416400 - good (with ctxprog), without ctxprog: good test4 0x90116400 - corruptions without ctxprog Also seems that without nvidia's ctxprog, while we always got corruption, we didn't get hangs (but we tested just few laps of STK+glxgears) It doesn't matter much. It is one bit and its in unknown PLL area....
Created attachment 51323 [details] [review] Temporary hack to fix this This is hack to fix this issue. It must be reversed further to get to bottom of this
Created attachment 51324 [details] vbios.rom attaching a dump of /sys/kernel/debug/dri/0/vbios.rom cat /sys/kernel/debug/dri/0/vbios.rom > vbios.rom
I found another way to hung/crash the system. 1- Boot into console mode (id:3:initdefault: in /etc/inittab) 2- start X (and just X) in tty1 3- switch to tty2 and start STK in that tty2. start STK in tty2 like this: export DISPLAY=:0 /usr/local/games/supertuxkart --profile-laps=50000 --fullscreen when I start STK like that it would attempt to switch to X as soon as STK is launched. Then X would hung right there, it wont even start STK, but it would hung as soon as the VT is switched to X, then I would get errors like this one: [ 17583.534] (EE) NOUVEAU(0): failed to set mode: Permission denied Fatal server error: [ 17583.534] EnterVT failed for screen 0 It seems like this issue is more related to this bug than anything else: https://bugzilla.redhat.com/show_bug.cgi?id=680677 If I try to do the same, e.g. reboot the machine, login into tty1, start X, switch to tty2, login into tty2, start glxgears in tty2, nothing happens... it doesn't even switch right back to X when I start glxgears in tty2, it runs glxgears and I see output in tty2, and I can switch to tt7 where X is running (alt+f7) and see glxgears running in there. But whether I try the same with STK, it would attempt to switch to X directly and hung the machine.
(In reply to comment #77) > I found another way to hung/crash the system. > > 1- Boot into console mode (id:3:initdefault: in /etc/inittab) > > 2- start X (and just X) in tty1 > > 3- switch to tty2 and start STK in that tty2. > > start STK in tty2 like this: > > export DISPLAY=:0 > /usr/local/games/supertuxkart --profile-laps=50000 --fullscreen > > when I start STK like that it would attempt to switch to X as soon as STK is > launched. > > Then X would hung right there, it wont even start STK, but it would hung as > soon as the VT is switched to X, then I would get errors like this one: > > [ 17583.534] (EE) NOUVEAU(0): failed to set mode: Permission denied > Fatal server error: > [ 17583.534] EnterVT failed for screen 0 > > It seems like this issue is more related to this bug than anything else: > https://bugzilla.redhat.com/show_bug.cgi?id=680677 > > If I try to do the same, e.g. reboot the machine, login into tty1, start X, > switch to tty2, login into tty2, start glxgears in tty2, nothing happens... it > doesn't even switch right back to X when I start glxgears in tty2, it runs > glxgears and I see output in tty2, and I can switch to tt7 where X is running > (alt+f7) and see glxgears running in there. > > But whether I try the same with STK, it would attempt to switch to X directly > and hung the machine. tty7*
I was able to suspend the machine and run STK+glxgears for many hours and I didn't had another crash anymore, it seems like the hung I got in my last post is related to fast VT switching and not another thing, different bug.
if I ssh from my laptop to the affected computer I can run STK+glxgears for hours and no hangs. The last hangs I experienced only happens when there is a fast VT->X context switch. So it seems like it's completely unrelated to the suspend hangs. Thanks.
[root@myhost ~]# lspci -v 00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller (rev 03) Subsystem: ASUSTeK Computer Inc. Device 836d Flags: bus master, fast devsel, latency 0 Capabilities: [e0] Vendor Specific Information: Len=0c <?> 00:01.0 PCI bridge: Intel Corporation 4 Series Chipset PCI Express Root Port (rev 03) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: 0000d000-0000dfff Memory behind bridge: fa000000-feafffff Prefetchable memory behind bridge: 00000000e0000000-00000000efffffff Capabilities: [88] Subsystem: ASUSTeK Computer Inc. Device 836d Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [a0] Express Root Port (Slot+), MSI 00 Kernel driver in use: pcieport 00:1b.0 Audio device: Intel Corporation N10/ICH 7 Family High Definition Audio Controller (rev 01) Subsystem: ASUSTeK Computer Inc. Device 8445 Flags: bus master, fast devsel, latency 0, IRQ 6 Memory at f9ffc000 (64-bit, non-prefetchable) [size=16K] Capabilities: [50] Power Management version 2 Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 00:1c.0 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 1 (rev 01) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 I/O behind bridge: 00002000-00002fff Memory behind bridge: 80600000-807fffff Prefetchable memory behind bridge: 0000000080800000-00000000809fffff Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 8179 Capabilities: [a0] Power Management version 2 Kernel driver in use: pcieport 00:1c.1 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 2 (rev 01) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=03, subordinate=03, sec-latency=0 I/O behind bridge: 00001000-00001fff Memory behind bridge: 80200000-803fffff Prefetchable memory behind bridge: 0000000080400000-00000000805fffff Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 8179 Capabilities: [a0] Power Management version 2 Kernel driver in use: pcieport 00:1c.3 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 4 (rev 01) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 I/O behind bridge: 0000e000-0000efff Memory behind bridge: feb00000-febfffff Prefetchable memory behind bridge: 0000000080000000-00000000801fffff Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 8179 Capabilities: [a0] Power Management version 2 Kernel driver in use: pcieport 00:1d.0 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #1 (rev 01) (prog-if 00 [UHCI]) Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard Flags: bus master, medium devsel, latency 0, IRQ 23 I/O ports at c480 [size=32] Kernel driver in use: uhci_hcd 00:1d.1 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #2 (rev 01) (prog-if 00 [UHCI]) Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard Flags: bus master, medium devsel, latency 0, IRQ 19 I/O ports at c800 [size=32] Kernel driver in use: uhci_hcd 00:1d.2 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #3 (rev 01) (prog-if 00 [UHCI]) Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard Flags: bus master, medium devsel, latency 0, IRQ 18 I/O ports at c880 [size=32] Kernel driver in use: uhci_hcd 00:1d.3 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #4 (rev 01) (prog-if 00 [UHCI]) Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard Flags: bus master, medium devsel, latency 0, IRQ 16 I/O ports at cc00 [size=32] Kernel driver in use: uhci_hcd 00:1d.7 USB Controller: Intel Corporation N10/ICH 7 Family USB2 EHCI Controller (rev 01) (prog-if 20 [EHCI]) Subsystem: ASUSTeK Computer Inc. P5KPL-VM,P5LD2-VM Mainboard Flags: bus master, medium devsel, latency 0, IRQ 23 Memory at f9ffbc00 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port: BAR=1 offset=00a0 Kernel driver in use: ehci_hcd 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) (prog-if 01 [Subtractive decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=05, subordinate=05, sec-latency=32 Capabilities: [50] Subsystem: ASUSTeK Computer Inc. Device 8179 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) Subsystem: ASUSTeK Computer Inc. P5KPL-VM Motherboard Flags: bus master, medium devsel, latency 0 Capabilities: [e0] Vendor Specific Information: Len=0c <?> 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01) (prog-if 8a [Master SecP PriP]) Subsystem: ASUSTeK Computer Inc. P5KPL-VM Motherboard Flags: bus master, medium devsel, latency 0, IRQ 18 I/O ports at 01f0 [size=8] I/O ports at 03f4 [size=1] I/O ports at 0170 [size=8] I/O ports at 0374 [size=1] I/O ports at ffa0 [size=16] Kernel driver in use: ata_piix 00:1f.2 IDE interface: Intel Corporation N10/ICH7 Family SATA IDE Controller (rev 01) (prog-if 8f [Master SecP SecO PriP PriO]) Subsystem: ASUSTeK Computer Inc. P5KPL-VM Motherboard Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 22 I/O ports at c400 [size=8] I/O ports at c080 [size=4] I/O ports at c000 [size=8] I/O ports at bc00 [size=4] I/O ports at b880 [size=16] Capabilities: [70] Power Management version 2 Kernel driver in use: ata_piix 01:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9500 GT] (rev a1) (prog-if 00 [VGA controller]) Subsystem: eVga.com. Corp. Device c958 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at fd000000 (32-bit, non-prefetchable) [size=16M] Memory at e0000000 (64-bit, prefetchable) [size=256M] Memory at fa000000 (64-bit, non-prefetchable) [size=32M] I/O ports at dc00 [size=128] Expansion ROM at fea80000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [b4] Vendor Specific Information: Len=14 <?> Kernel driver in use: nouveau Kernel modules: nouveau 02:00.0 Ethernet controller: Atheros Communications AR8121/AR8113/AR8114 Gigabit or Fast Ethernet (rev b0) Subsystem: ASUSTeK Computer Inc. P5KPL-CM Motherboard Flags: bus master, fast devsel, latency 0, IRQ 19 Memory at febc0000 (64-bit, non-prefetchable) [size=256K] I/O ports at ec00 [size=128] Capabilities: [40] Power Management version 2 Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [58] Express Endpoint, MSI 00 Kernel driver in use: ATL1E [root@myhost ~]#
Pictures of the affected card. Front: http://dl.dropbox.com/u/6005119/gpu/P1120649.JPG Back: http://dl.dropbox.com/u/6005119/gpu/P1120671.JPG
Created attachment 51365 [details] lspci -a added lspci -a
(In reply to comment #82) > Pictures of the affected card. > > Front: > > http://dl.dropbox.com/u/6005119/gpu/P1120649.JPG > > Back: > > http://dl.dropbox.com/u/6005119/gpu/P1120671.JPG card pictures (mirror), just in case. http://ompldr.org/vYWY5aA/P1120649.JPG http://ompldr.org/vYWY5Zw/P1120671.JPG
I failed to reproduce this same problem with this card: 01:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9400 GT] (rev a1) BTW this problem is also present on Fedora 15, I tested it and the problem is there also.
I fail to reproduce this issue with this card as well: 02:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 9800 GT] (rev a2) This issue is only present with my GeForce 9500 GT. Weird.
I have a ThinkPad T510 with optimus GPU (hybrid Intel/Nvidia), I have switched the GPU in the BIOS to use the Nvidia card so I could try to reproduce this issue in my laptop as well, and failed to reproduce there as well. I have suspended hte laptop many times but glxgears appears just fine after suspend, no corrupted graphics, etc. [diego@myhost ~]$ lspci|grep VGA 01:00.0 VGA compatible controller: nVidia Corporation GT218 [NVS 3100M] (rev a2) [diego@myhost ~]$
Created attachment 51531 [details] [review] More complete MPLL programming during POST Give this patch a try. It should correct nouveau's MPLL setup when cold-booting the card.
(In reply to comment #88) > Created an attachment (id=51531) [details] > More complete MPLL programming during POST > > Give this patch a try. It should correct nouveau's MPLL setup when > cold-booting the card. Works great, thank you so much! :)
I've rebuilt the kernel today from Linus Torvalds github repository and applied darktama's patch. Linux myhost 3.1.0-rc7+ #1 SMP Fri Sep 23 13:44:20 PYT 2011 x86_64 Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz GenuineIntel GNU/Linux So far things are very stable! I've ran STK (100 laps) with glxgears and suspended many times, I also suspended while in-game and things are very stable. I also tried using KDE 4.7 with kwin opengl compositing after and before suspending and things are very smooth and stable. I'm going to close this bug report now and mark it fixed. calim and MaximLevitsky already approved closing the bug report. calim have also stated that the real fix has been merged into nouveau git master and that it will be available in Linux 3.2. The commit is here: http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=a51d43c1b27581e780f60b6d724d146db94b31c5 Thanks to everyone who have helped!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.