Bug 35876 - Hard GPU hangs on NV86
Hard GPU hangs on NV86
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/nouveau
x86-64 (AMD64) Linux (All)
: medium major
Assigned To: Nouveau Project
Depends on:
  Show dependency treegraph
Reported: 2011-04-01 08:58 UTC by maximlevitsky
Modified: 2012-02-12 13:14 UTC (History)
0 users

See Also:

git snapshot (11.62 KB, application/octet-stream)
2011-04-01 08:58 UTC, maximlevitsky
fix for stability issues on NV86 (808 bytes, patch)
2011-08-22 19:44 UTC, maximlevitsky
Details | Splinter Review
mmio dump of PGRAPH (6.45 KB, text/plain)
2011-08-23 18:04 UTC, maximlevitsky
another mmio dump of PGRAPH (7.26 KB, text/plain)
2011-08-25 13:41 UTC, maximlevitsky
mmio dump of PGRAPH #2 (7.22 KB, text/plain)
2011-08-25 16:55 UTC, maximlevitsky
PGRAPH mmio dump (21.76 KB, text/plain)
2011-08-26 19:57 UTC, maximlevitsky
PFIFO mmio dump (116.61 KB, text/plain)
2011-08-26 19:57 UTC, maximlevitsky
Older PFIFO dump (116.65 KB, text/plain)
2011-08-26 20:05 UTC, maximlevitsky
Another PFIFO dump (116.63 KB, text/plain)
2011-08-27 14:30 UTC, maximlevitsky
PFIFO after tropics induced hang (116.60 KB, text/plain)
2011-08-30 18:57 UTC, maximlevitsky
PGRAPH after tropics induced hangs. (26.06 KB, application/octet-stream)
2011-08-30 18:57 UTC, maximlevitsky

Note You need to log in before you can comment on or make changes to this bug.
Description maximlevitsky 2011-04-01 08:58:47 UTC
Created attachment 45137 [details]
git snapshot

Hardware: Geforce 8400M GS - (0x086700a2)
          Acer Aspire 5720G

Userspace: Git tips of all components except Xserver.
          In attached file, you see the latest commits of each tree

Kernel drm version: dd34154f9bd75d13caefffd9d6086a1d23a45856
          drm/nouveau: use static vidshift of 2 on volt 0x30 tables

This already happened here many times.
Just 2D usage of compiz is enough.
I will eventually test if even just 2D usage can trigger this.

I get this in the kernel log:

<3>[17638.684109] [drm] nouveau 0000:01:00.0: vm flush timeout: engine 0
<3>[17644.904441] [drm] nouveau 0000:01:00.0: PRAMIN flush timeout
<6>[17645.869768] [drm] nouveau 0000:01:00.0: PFIFO_INTR 0x04000000 - Ch 2
<3>[17645.966182] [drm] nouveau 0000:01:00.0: vm flush timeout: engine 5

(Extracted using ram based "black-box", a fixed area in the system memory I read after a crash.)

System freezes fully.
Comment 1 maximlevitsky 2011-04-01 19:49:11 UTC
I might have being lucky with this bug, dunno.
I looked at docomentation at envytools and it mentions that you can't do a tlb flush while pgraph is running on NV86.

Well, thats known and code contains a workaround.

What I did in addition to that is I took Martin Peres's code that suspends/resumes pgraph and hooked it into nv84_graph_tlb_flush.
No hangs yet, but that doesn't mean anything yet.

I added this code:

	nv50_vm_flush_engine(dev, 0);

So PGRAPH is paused when it is supposed to be idle.
Yet, his nv50_graph_pause often complains that:

[18931.100046] [drm] nouveau 0000:01:00.0: PGRAPH: PGRAPH paused while running a ctxprog, NV40_PGRAPH_CTXCTL_0310 = 0x11
[18937.770017] [drm] nouveau 0000:01:00.0: PGRAPH: wait for idle fail: 00000000 00000000 00000000 00000101!

Let wait and see, but it could be that I am right and we need to pause PGRAPH more aggressively here.
Comment 2 Alexander Potashev 2011-05-07 06:51:44 UTC
What version of Linux kernel do you use? I'm having the same problem with vanilla kernel v2.6.39-rc6. But there were no problems with v2.6.37 (may be I haven't tested it enough).

I think, this should be fixed in the kernel (more exactly, in the "nouveau" kernel driver).
Comment 3 maximlevitsky 2011-05-20 19:44:38 UTC
Yes, it indeed pageflip support.
Putting 'return FALSE' in can_exchange fixes the problem, last time I must have forgot to 'make install' or something.

However, 'Option "PageFlip" "false"' doesn't help, because even if set, the condition in can_exchange lets flipping in some cases
(if nouveau_exa_pixmap_is_onscreen() == FALSE).
Not sure why that check is there.

@Alexander Potashev - older kernel you describe just doesn't have pageflipping code, so sure you don't see it on it.
Comment 4 maximlevitsky 2011-05-20 19:46:05 UTC
Also note that if game doesn't run full-screen, I can still see rare flickering with pageflip disabled (with that return FALSE).
Probably as was suggested before, it just exposes the problem that was there before.
Comment 5 maximlevitsky 2011-05-20 19:58:00 UTC
Drat, disregard the comments, these are for other bug
Comment 6 maximlevitsky 2011-05-21 06:47:05 UTC
And for a comment that is relevant for this bugreport, I started as a long endurance test, using 2.6.35.
Will see if it hangs in same way.

I remember that long ago there were no hangs.
Comment 7 maximlevitsky 2011-08-04 17:08:30 UTC
Of course that endurance test ended very long ago and lasted maybe 2 hours. Crashes.

I'll soon will be tackling that at full power as this is the last issue I have with nouveau (well, I didn't yet fix power usage, but at least I can write the discovered registers to lower power usage here. Of course I will attempt to make them apply to other cards, etc...)
Comment 8 maximlevitsky 2011-08-12 15:24:20 UTC
I almost solved it! (trival to fix now).

I switched my system to nvidia ctxprog, and Unegine tropics demo which used to hang after 3~5 minutes or runtime.
Now it already runs for 40 minutes (and I will leave it for few hours) and compiz is on, I use the system, and I did s2ram cycle with tropics and compiz was on.
No changes in mesa/other components were done, so its for sure the ctxprog.

I did score my shot in the dark I guess...
Comment 9 maximlevitsky 2011-08-12 16:37:51 UTC
In total tropics was running for hour and 30 minutes when I just shut it down.
Last 20 minutes it run with card upclocked to pm level 2.

Now running Heaven benchmark.
Comment 10 maximlevitsky 2011-08-12 16:39:19 UTC
And forgot to mention, that my trace that reduces power usage was run while tropics were running, and everything is just fine!
Comment 11 maximlevitsky 2011-08-12 17:49:47 UTC
OK, did run Unegine Heaven for hour and 30 minutes. Don't want to leave system stressed and unattended. It sure works, beyond any doubt.
Comment 12 maximlevitsky 2011-08-13 17:33:32 UTC
So far, so good. No hangs, everything I throw at nouveau works.

For now I'll use this ctxprog + my power saving settings for a week or so to ensure that there are no crashes.

Then will debug the ctxprog.

So far its just perfect.
Comment 13 maximlevitsky 2011-08-22 19:44:59 UTC
Created attachment 50470 [details] [review]
fix for stability issues on NV86

Thats all. I have finished the compare process.

Like that or hate that, but in the end the source of so many crashes, data losses and general suffering,was found to be just one bit,just one bit that wasn't set by nouveau ctxprog generator in one of ctxvals in per MP state.
(Well my card has 2 MPs, so that would make that 2 bits I guess :-) )

Attached patch was tested by 1/2 hour run of Unegine tropics game.
Unless you have reason to suspect that this bit can break other NV86 cards, please merge it.
Comment 14 maximlevitsky 2011-08-22 20:51:13 UTC
Well, tropics is stable allright, but playing supertuxkat for about 5 laps got me another freeze. Well... that saga won't end.

Of course that bit that fixes tropics does fix it, but now it seems to be part of the problem.

Now I once again run closer to nvidia ctxprog. Who knows maybe that will help, but since stk is not automated, I couldn't test it properly until I implement automatic driving there that should be very easy to do.
Comment 15 maximlevitsky 2011-08-23 18:02:52 UTC
OK, supertuxkart crashes often. I'll say as often as it would take to justify making it automatic so I could run it in background and not play it as its not fun to play game too much, especially thi game that I don't like that much.

In fact, I reproduced a hang with latest mesa, no reclocking and nvidia ctxprog.
(Only thing that was on is power saving magic, but I had these crashes without it as well, so not likely to be the reason - still will test without it as well soon).
Comment 16 maximlevitsky 2011-08-23 18:04:04 UTC
Created attachment 50512 [details]
mmio dump of PGRAPH

this is an mmio dump of PGRAPH after crash, in unlikely case it can help narrow down the problem.
Comment 17 maximlevitsky 2011-08-23 18:05:03 UTC
Also I did another dump right after this one, and the difference is:

-004008f0: 00000000 00000000 21c18e7b 00c10001
+004008f0: 00000000 00000000 21f057a8 00c10001
Comment 18 maximlevitsky 2011-08-25 13:41:49 UTC
Created attachment 50577 [details]
another mmio dump of PGRAPH

Here an mmio dump of PGRAPH right after freeze, without any killing of any processes that I did wrongly in former dump.
Comment 19 maximlevitsky 2011-08-25 13:43:03 UTC
Forgot to mention that crash is from running supertuxkart with compiz.
I somehow suspect that its the combination that makes the crash, so I'll test without compiz as well soon.
Comment 20 maximlevitsky 2011-08-25 16:55:12 UTC
Created attachment 50580 [details]
mmio dump of PGRAPH #2

Another crash running supertuxkart without compiz, but with reclocking (and that for sure doesn't matter as it crashed many times without it).
Comment 21 maximlevitsky 2011-08-26 19:56:20 UTC
Now was able to reproduce without ether compiz nor reclocking.
Took whole hour of automated STK run.

I attach PFIFO and PGRAPH dumps.

It looks like PFIFO cache error.
Comment 22 maximlevitsky 2011-08-26 19:57:00 UTC
Created attachment 50607 [details]
PGRAPH mmio dump
Comment 23 maximlevitsky 2011-08-26 19:57:27 UTC
Created attachment 50608 [details]
PFIFO mmio dump
Comment 24 maximlevitsky 2011-08-26 20:04:33 UTC
In PFIFO dump it appears that game channel was already killed, and it tried switching to DDX channel.
I have another older dump of PFIFO that, where its not killed, and once again there is attempt to switch to channel #2 which is for DDX.

In fact channel mapping (which just happens to be this way):

Without compiz:

1 - framebuffer - idle (while in X of course)

2 - xserver EXA - occasional bursts while game runs. on screen panel updates (clock,etc..) mostly I think

3 - xserver AIGLX - not used at all, as I don't launch such clients
4 - game

With compiz:

1 - framebuffer
2 - xserver EXA
3 - xserver AIGLX

4 - compiz - again, short bursts of activity time after time while game runs, probably to redraw windows that X clients (the same panel for example) writes to.

5 - game

Also channel 0 seems not to be used at all, it with channel 127 are sort of special.
Comment 25 maximlevitsky 2011-08-26 20:05:15 UTC
Created attachment 50609 [details]
Older PFIFO dump
Comment 26 maximlevitsky 2011-08-27 14:30:25 UTC
Created attachment 50621 [details]
Another PFIFO dump

This time I made sure X doesn't submit anything and thus game channel was the only channel submitting commands.
Comment 27 maximlevitsky 2011-08-30 17:48:40 UTC
Thanks to Marcin Kościelnicki, I understand now that hung PFIFO dumps were just a consequence of an attempt to execute 'software' command in the command stream.
These commands indeed cause PFIFO errors, but that is normal as the error triggers an interrupt and kernel driver handles the error.

In this case software command was used to signal a pageflip, and it was issued for every full screen 3d application.

However I forgot to mention that I was running kernel nouveau driver with 'msi=1' option, an option that makes hardware use message signaled interrupts that are known to be broken on several systems because of chipsets bugs or else.
Due to that in rare case, the interrupt was't delivered to kernel, causing PFIFO to hang waiting for software to service that interrupt.
When I disabled the 'msi' option, which is disabled by default, the problem gone and I was able to test run supertuxkart for hour and 50 minutes, and I actually repeated that test few times with different variations, including running with nouveau's ctxprog and compiz. I am quite sure (knocks wood) that problem is gone.

A side note is that I was getting hangs on one small webgl demo the http://helloracer.com/webgl/, and I was using older version of mesa because latest had rendering issues (resolved now and believed to be chromium bug actually). That demo generated a lot of PGRAPH errors (doesn't now) and once again since each error triggers an interrupt, eventually same hang would happen.

Also running XV, syncs to VBlank and that also involves interrupts, and I had in past several hangs while using XV, so I won't be surprised if that was the cause (I used 'msi' option since it was introduced, and didn't expect this to happen)

Thats all, and it seems (knocks wood again) that my card is finally stable.
Comment 28 maximlevitsky 2011-08-30 18:56:38 UTC
Back to the hangs caused by unegine demos. I reproduced that once more without my patch of course.

I got:

[10459.780042] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10464.810946] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10469.826439] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10474.841967] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000
[10479.770036] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00c00603 0x0000014d 0x00005600 0x00000000

In kernel log. PFIFO appears to be idle, and PGRAPH hung, while its MPs appear active.
Comment 29 maximlevitsky 2011-08-30 18:57:13 UTC
Created attachment 50739 [details]
PFIFO after tropics induced hang
Comment 30 maximlevitsky 2011-08-30 18:57:57 UTC
Created attachment 50740 [details]
PGRAPH after tropics induced hangs.

This time it hung after 4 minutes btw.
Comment 31 maximlevitsky 2012-02-12 13:14:23 UTC
It really works for me for very very long time. Nouveau here is pretty much uncrashable.