Bugzilla – Bug 35876
Hard GPU hangs on NV86
Last modified: 2012-02-12 13:14:23 UTC
Created attachment 45137 [details]
Hardware: Geforce 8400M GS - (0x086700a2)
Acer Aspire 5720G
Userspace: Git tips of all components except Xserver.
In attached file, you see the latest commits of each tree
Kernel drm version: dd34154f9bd75d13caefffd9d6086a1d23a45856
drm/nouveau: use static vidshift of 2 on volt 0x30 tables
This already happened here many times.
Just 2D usage of compiz is enough.
I will eventually test if even just 2D usage can trigger this.
I get this in the kernel log:
<3>[17638.684109] [drm] nouveau 0000:01:00.0: vm flush timeout: engine 0
<3>[17644.904441] [drm] nouveau 0000:01:00.0: PRAMIN flush timeout
<6>[17645.869768] [drm] nouveau 0000:01:00.0: PFIFO_INTR 0x04000000 - Ch 2
<3>[17645.966182] [drm] nouveau 0000:01:00.0: vm flush timeout: engine 5
(Extracted using ram based "black-box", a fixed area in the system memory I read after a crash.)
System freezes fully.
I might have being lucky with this bug, dunno.
I looked at docomentation at envytools and it mentions that you can't do a tlb flush while pgraph is running on NV86.
Well, thats known and code contains a workaround.
What I did in addition to that is I took Martin Peres's code that suspends/resumes pgraph and hooked it into nv84_graph_tlb_flush.
No hangs yet, but that doesn't mean anything yet.
I added this code:
So PGRAPH is paused when it is supposed to be idle.
Yet, his nv50_graph_pause often complains that:
[18931.100046] [drm] nouveau 0000:01:00.0: PGRAPH: PGRAPH paused while running a ctxprog, NV40_PGRAPH_CTXCTL_0310 = 0x11
[18937.770017] [drm] nouveau 0000:01:00.0: PGRAPH: wait for idle fail: 00000000 00000000 00000000 00000101!
Let wait and see, but it could be that I am right and we need to pause PGRAPH more aggressively here.
What version of Linux kernel do you use? I'm having the same problem with vanilla kernel v2.6.39-rc6. But there were no problems with v2.6.37 (may be I haven't tested it enough).
I think, this should be fixed in the kernel (more exactly, in the "nouveau" kernel driver).
Yes, it indeed pageflip support.
Putting 'return FALSE' in can_exchange fixes the problem, last time I must have forgot to 'make install' or something.
However, 'Option "PageFlip" "false"' doesn't help, because even if set, the condition in can_exchange lets flipping in some cases
(if nouveau_exa_pixmap_is_onscreen() == FALSE).
Not sure why that check is there.
@Alexander Potashev - older kernel you describe just doesn't have pageflipping code, so sure you don't see it on it.
Also note that if game doesn't run full-screen, I can still see rare flickering with pageflip disabled (with that return FALSE).
Probably as was suggested before, it just exposes the problem that was there before.
Drat, disregard the comments, these are for other bug
And for a comment that is relevant for this bugreport, I started as a long endurance test, using 2.6.35.
Will see if it hangs in same way.
I remember that long ago there were no hangs.
Of course that endurance test ended very long ago and lasted maybe 2 hours. Crashes.
I'll soon will be tackling that at full power as this is the last issue I have with nouveau (well, I didn't yet fix power usage, but at least I can write the discovered registers to lower power usage here. Of course I will attempt to make them apply to other cards, etc...)
I almost solved it! (trival to fix now).
I switched my system to nvidia ctxprog, and Unegine tropics demo which used to hang after 3~5 minutes or runtime.
Now it already runs for 40 minutes (and I will leave it for few hours) and compiz is on, I use the system, and I did s2ram cycle with tropics and compiz was on.
No changes in mesa/other components were done, so its for sure the ctxprog.
I did score my shot in the dark I guess...
In total tropics was running for hour and 30 minutes when I just shut it down.
Last 20 minutes it run with card upclocked to pm level 2.
Now running Heaven benchmark.
And forgot to mention, that my trace that reduces power usage was run while tropics were running, and everything is just fine!
OK, did run Unegine Heaven for hour and 30 minutes. Don't want to leave system stressed and unattended. It sure works, beyond any doubt.
So far, so good. No hangs, everything I throw at nouveau works.
For now I'll use this ctxprog + my power saving settings for a week or so to ensure that there are no crashes.
Then will debug the ctxprog.
So far its just perfect.
Created attachment 50470 [details] [review]
fix for stability issues on NV86
Thats all. I have finished the compare process.
Like that or hate that, but in the end the source of so many crashes, data losses and general suffering,was found to be just one bit,just one bit that wasn't set by nouveau ctxprog generator in one of ctxvals in per MP state.
(Well my card has 2 MPs, so that would make that 2 bits I guess :-) )
Attached patch was tested by 1/2 hour run of Unegine tropics game.
Unless you have reason to suspect that this bit can break other NV86 cards, please merge it.
Well, tropics is stable allright, but playing supertuxkat for about 5 laps got me another freeze. Well... that saga won't end.
Of course that bit that fixes tropics does fix it, but now it seems to be part of the problem.
Now I once again run closer to nvidia ctxprog. Who knows maybe that will help, but since stk is not automated, I couldn't test it properly until I implement automatic driving there that should be very easy to do.
OK, supertuxkart crashes often. I'll say as often as it would take to justify making it automatic so I could run it in background and not play it as its not fun to play game too much, especially thi game that I don't like that much.
In fact, I reproduced a hang with latest mesa, no reclocking and nvidia ctxprog.
(Only thing that was on is power saving magic, but I had these crashes without it as well, so not likely to be the reason - still will test without it as well soon).
Created attachment 50512 [details]
mmio dump of PGRAPH
this is an mmio dump of PGRAPH after crash, in unlikely case it can help narrow down the problem.
Also I did another dump right after this one, and the difference is:
-004008f0: 00000000 00000000 21c18e7b 00c10001
+004008f0: 00000000 00000000 21f057a8 00c10001
Created attachment 50577 [details]
another mmio dump of PGRAPH
Here an mmio dump of PGRAPH right after freeze, without any killing of any processes that I did wrongly in former dump.
Forgot to mention that crash is from running supertuxkart with compiz.
I somehow suspect that its the combination that makes the crash, so I'll test without compiz as well soon.
Created attachment 50580 [details]
mmio dump of PGRAPH #2
Another crash running supertuxkart without compiz, but with reclocking (and that for sure doesn't matter as it crashed many times without it).
Now was able to reproduce without ether compiz nor reclocking.
Took whole hour of automated STK run.
I attach PFIFO and PGRAPH dumps.
It looks like PFIFO cache error.
Created attachment 50607 [details]
PGRAPH mmio dump
Created attachment 50608 [details]
PFIFO mmio dump
In PFIFO dump it appears that game channel was already killed, and it tried switching to DDX channel.
I have another older dump of PFIFO that, where its not killed, and once again there is attempt to switch to channel #2 which is for DDX.
In fact channel mapping (which just happens to be this way):
1 - framebuffer - idle (while in X of course)
2 - xserver EXA - occasional bursts while game runs. on screen panel updates (clock,etc..) mostly I think
3 - xserver AIGLX - not used at all, as I don't launch such clients
4 - game
1 - framebuffer
2 - xserver EXA
3 - xserver AIGLX
4 - compiz - again, short bursts of activity time after time while game runs, probably to redraw windows that X clients (the same panel for example) writes to.
5 - game
Also channel 0 seems not to be used at all, it with channel 127 are sort of special.
Created attachment 50609 [details]
Older PFIFO dump
Created attachment 50621 [details]
Another PFIFO dump
This time I made sure X doesn't submit anything and thus game channel was the only channel submitting commands.
Thanks to Marcin Kościelnicki, I understand now that hung PFIFO dumps were just a consequence of an attempt to execute 'software' command in the command stream.
These commands indeed cause PFIFO errors, but that is normal as the error triggers an interrupt and kernel driver handles the error.
In this case software command was used to signal a pageflip, and it was issued for every full screen 3d application.
However I forgot to mention that I was running kernel nouveau driver with 'msi=1' option, an option that makes hardware use message signaled interrupts that are known to be broken on several systems because of chipsets bugs or else.
Due to that in rare case, the interrupt was't delivered to kernel, causing PFIFO to hang waiting for software to service that interrupt.
When I disabled the 'msi' option, which is disabled by default, the problem gone and I was able to test run supertuxkart for hour and 50 minutes, and I actually repeated that test few times with different variations, including running with nouveau's ctxprog and compiz. I am quite sure (knocks wood) that problem is gone.
A side note is that I was getting hangs on one small webgl demo the http://helloracer.com/webgl/, and I was using older version of mesa because latest had rendering issues (resolved now and believed to be chromium bug actually). That demo generated a lot of PGRAPH errors (doesn't now) and once again since each error triggers an interrupt, eventually same hang would happen.
Also running XV, syncs to VBlank and that also involves interrupts, and I had in past several hangs while using XV, so I won't be surprised if that was the cause (I used 'msi' option since it was introduced, and didn't expect this to happen)
Thats all, and it seems (knocks wood again) that my card is finally stable.
Back to the hangs caused by unegine demos. I reproduced that once more without my patch of course.
[10459.780042] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10464.810946] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10469.826439] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10474.841967] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10479.770036] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
In kernel log. PFIFO appears to be idle, and PGRAPH hung, while its MPs appear active.
Created attachment 50739 [details]
PFIFO after tropics induced hang
Created attachment 50740 [details]
PGRAPH after tropics induced hangs.
This time it hung after 4 minutes btw.
It really works for me for very very long time. Nouveau here is pretty much uncrashable.