Created attachment 87485 [details]
nouveau error lines in kernel log for three incidents
This is taking my bug usptream from the distro level bug I first filed at
I've encountered several system crashes in relation to nouveau and graphics work lately. Most notably three crashes in the past few days, with kernel version 3.11.4-gentoo. lspci reports the graphics card as
VGA compatible controller: NVIDIA Corporation G84 [GeForce 8400 GS] (rev a1)
I'm attaching an abridged version of the log files first, for a quick glance. It is the output of “grep -h -C1 -E '\] nouveau [^ 0]|Linux version'” applied to the currently available kernel logs. I can see no obious pattern there. I've read that you prefer untruncated dmesg output, but since this issue causes my system to reboot, calling dmesg is not an option, and since I'm reporting several incidents, this might be a quick starting point.
The crashes themselves aren't reported as Oops, or maybe they are but the file system is already unavailable by that time, so I don't notice. Unfortunately I don't have a serial terminal attached for logging at all times.
I'll attach the unabridged logs shortly, but I can't even know whether these successfully flushed all their data. I'll give a few details on each of these incidents with the corresponding unabridged log.
Created attachment 87486 [details]
Oct 08 incident
I don't recall the exact circumstances of this Oct 08 incident. First and last line of that error:
nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 2 [X] get 0x002002965c put 0x002002ae2c ib_get 0x00000015 ib_put 0x0000002f state 0x8000e684 (err: INVALID_CMD) push 0x00406040
nouveau E[ PGRAPH][0000:01:00.0] ch 2 [0x000fb33000 X] subc 7 class 0x8297 mthd 0x1358 data 0x00004303
Created attachment 87487 [details]
Oct 10 incident
The Oct 10 incident was while I was using Stellarium in windowed mode, and Firefox displaying and downsizing a pretty large JPEG of the orion nebula from Wikipedia: https://upload.wikimedia.org/wikipedia/commons/0/07/Orion_Nebula_-_Hubble_2006_mosaic.jpg. So lots of graphics activity, and a good deal of 3D rendering on the part of Stellarium.
nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 4 [stellarium] get 0x002016a6bc put 0x002017c714 ib_get 0x00000201 ib_put 0x00000202 state 0x80000000 (err: INVALID_CMD) push 0x00406040
nouveau E[ PGRAPH][0000:01:00.0] ch 2 [0x000fb33000 X] subc 7 class 0x8297 mthd 0x1288 data 0xffffffff
After that crash, I rebooted and just restored my Firefox session, without opening Stellarium. It crashed again. I'm really surprised that I can find no evidence of this boot in my logs. Strange.
Created attachment 87488 [details]
Oct 11 incident
The Oct 11 incident was mostly Firefox opening a large number of PDF documents with massive use of (rather low-res) graphics. Probably kept a number of them buffered somewhere? I'll quote the first and the last TWO lines, since the penultimate one seems similar to the last one of the previous crashes.
nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 2 [X] get 0x002002a90c put 0x002002be40 ib_get 0x0000012b ib_put 0x00000141 state 0x80004861 (err: INVALID_CMD) push 0x00406040
nouveau E[ PGRAPH][0000:01:00.0] ch 2 [0x000fb33000 X] subc 0 class 0x5039 mthd 0x0328 data 0x00000000
nouveau E[ PFB][0000:01:00.0] trapped read at 0x0022733000 on channel 0x0000fb33 [X] PGRAPH/DISPATCH/M2M_IN reason: PAGE_NOT_PRESENT
I noticed (after reading http://nouveau.freedesktop.org/wiki/Bugs/ the second time) that I forgot about the various version numbers besides kernel version.
Created attachment 87490 [details]
Oct 10 incident reproduced
(In reply to comment #2)
> Oct 10 incident
I was able to reproduce this incident, again with stellarium and that big Wikipedia image in Firefox. I had netconsole reporting to a different computer, with “dmesg -n 8”, which is where the attached log came from.
1. I didn't close all non-essential aspects of my session. Should have…
2. Opened the image in Firefox
3. Started stellarium
4. Reduced stellarium to windowed mode
5. Switched to firefox
6. Zoomed into image (i.e. 100% instead of fit window)
7. Scrolled around a bit
8. Might have zoomed back out, not sure
9. Noticed the mouse cursor movement to freeze for a moment
10. Received the attached error message on the netconsole receiver
11. System seemed responsive again, i.e. mouse cursor moved fluently
12. Tried to switch to firefox
13. System crashed without further output to netconsole
I guess that the freezing of the mouse cursor might have actually been due to the time it took to send those log messages. I.e. when logging to disk, that time might have been shorter. In any case, I wasn't aware that the actual problem might have occurred before the eventual crash the first time around.
The commonality here is that in all cases you get
[22145.017855] nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 4 [stellarium] get 0x0020049408 put 0x0020053fcc ib_get 0x00000349 ib_put 0x00000352 state 0x80007641 (err: INVALID_CMD) push 0x00406040
But 0x00406040 is a perfectly valid command, ever since the Riva TNT chips. It should decode as
method = 0x10
subchan = 3
count = 0x10
This makes sense. These 4 methods deal with semaphore manipulation.
<reg32 offset="0x10" name="SEMAPHORE_ADDRESS_HIGH" variants="NV84-"/>
<reg32 offset="0x14" name="SEMAPHORE_ADDRESS_LOW" variants="NV84-"/>
<reg32 offset="0x18" name="SEMAPHORE_SEQUENCE" variants="NV84-"/>
<reg32 offset="0x1c" name="SEMAPHORE_TRIGGER" variants="NV84-">
which happens a fair bit for CPU vs GPU syncs.
Not sure whether this will be of any use, but I'm trying to think of ways which might cause this, in my own pretty naive way of thinking about hardware drivers and graphics cards. Simply ignore dumb ideas. I don't need answers to these questions, as long as you know that the answers exists and don't help us here.
(In reply to comment #6)
> But 0x00406040 is a perfectly valid command, ever since the Riva TNT chips.
Could it be that the command is invalid AT THAT TIME? Perhaps due to a race condition? Could it be that the semaphore is already high?
> happens a fair bit for CPU vs GPU syncs.
Therefore the command likely won't fail all the time. The load pattern must be part of the issue.
Do things like stack overflows occur in GPU memory? Could excessive allocation of e.g. texture memory somehow clobber memory needed by GPU firmware for methods such like this?
What other resources might become exhausted by concurrent access from graphics-intensive applications? Would that exhaustion be reported in some way?
Of course, actual hardware breakage is always an option. But the thing doesn't seem random enough to suggest a simple hardware defect as the main cause, imho. Of course it would be nice if I had another card with the same kind of chips to plug into my system, just to rule out that alternative…
I must confess, I cannot follow how you decode 0x00406040 to method = 0x10. The way I read BEGIN_NV04, I can find your subc = 0, and size = 0x10, but I'd read mthd = 0x40 directly from the least significant bits. Did I miss something?
Right you are! Method = 0x40. Which doesn't correspond to a valid subchan method. But in that case, I'd expect to see a INVALID_MTHD on the next word -- the command itself is still valid. (Just looking at the pseudo-code at https://github.com/envytools/envytools/blob/master/hwdocs/fifo/dma-pusher.rst -- perhaps it's inaccurate.)
So... the mystery is where does this 0x40 method come from. There are a lot of potential culprites. Given that it's happening both to X and to some non-X process that just uses mesa, I'm going to place the blame on libdrm or the kernel... but that seems a bit far-fetched too. Dunno.
As for the memory management issues, of course, anything is possible. But it shouldn't happen :)
(In reply to comment #8)
> Right you are! Method = 0x40.
Glad my comment was not completely useless… :)
> So... the mystery is where does this 0x40 method come from.
Can we add a bit of debug code somewhere, to detect that specific value when it enters the ring, and emit useful information like a kernel stack trace? I don't mind recompiling my kernel to test stuff, and it seems chances of reproducing this are surprisingly high, but I haven't written any kernel code myself so far.
Created attachment 87553 [details] [review]
report 0x00406040 value in nouveau_bo_wr32
(In reply to comment #9)
> (In reply to comment #8)
> > So... the mystery is where does this 0x40 method come from.
> Can we add a bit of debug code somewhere, to detect that specific value when
> it enters the ring, and emit useful information like a kernel stack trace?
Gave this a stab, using a BUG_ON conditional the way this attachment describes it. Unless static function scope variables don't work the same in kernel space and in unser space, this should report if, BEGIN_NV04, OUT_RING or some other command was used to write this value.
Result: I managed to get that first DMA_PUSHER line all by itself:
nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 4 [stellarium] get 0x0020111a6c put 0x0020116e10 ib_get 0x00000070 ib_put 0x00000097 state 0x80007641 (err: INVALID_CMD) push 0x00406040
With no stack trace or other indication that my BUG_ON got triggered. Also with no subsequent error messages, or system crash.
So I'd conclude that the line does not neccessarily lead to a crash, and that the critical value does not get written via nouveau_bo_wr32. Where else could it come from? What is the most central location where I might put such code?
Perhaps I can interest you in WARN_ON_ONCE.
Your code should cover all in-kernel cases, but doesn't being to touch the user-submitted code. Take a look at nouveau_gem_pushbuf_validate for that -- it presently doesn't do any actual data validation. You could also do the check in nv50_dma_push. Basically the main ring buffer just contains pointers to "subrings" (for your generation of card).
[As an aside, there are a few other bugs that show attachments that include the same 0x00406040 thing... So you're not the only one seeing it.]
(In reply to comment #11)
> Perhaps I can interest you in WARN_ON_ONCE.
That is very useful, thanks a lot.
So far I left the BUG_ON in place, assuming that it wouldn't trigger in any case. I was wrong: I just got a kernel BUG report from what seems to be the BUG_ON I added.
[45018.412278] ------------[ cut here ]------------
[45018.416902] kernel BUG at drivers/gpu/drm/nouveau/nouveau_bo.c:465!
[45018.423162] invalid opcode: 0000 [#1] PREEMPT SMP
[45018.428001] Modules linked in: nls_cp850 vfat fat usb_storage tun autofs4 ipv6 btrfs xor zlib_deflate raid6_pq libcrc32c dm_mod fuse nfs lockd nf_conntrack_h323 nf_conntrack_sip nf_conntrack_irc nf_conntrack_ftp nf_conntrack uhci_hcd sunrpc loop nouveau usbhid snd_hda_codec_via snd_hda_intel ohci_pci snd_hda_codec ohci_hcd snd_hwdep snd_bt87x ehci_pci ehci_hcd snd_pcm video usbcore sr_mod cdrom mxm_wmi i2c_algo_bit kvm_amd kvm ttm drm_kms_helper snd_page_alloc drm microcode k10temp pcspkr evdev snd_timer snd ata_generic i2c_core r8169 asus_atk0110 sym53c8xx parport_pc scsi_transport_spi backlight mii parport wmi button usb_common pata_atiixp soundcore acpi_cpufreq mperf
[45018.488491] CPU: 0 PID: 2834 Comm: X Not tainted 3.11.4-gentoo #1
[45018.494579] Hardware name: System manufacturer System Product Name/M4A785TD-V EVO, BIOS 2105 07/23/2010
[45018.504216] task: ffff8803ebc09910 ti: ffff8803e7dc4000 task.ti: ffff8803e7dc4000
[45018.511689] RIP: 0010:[<ffffffffa038d7fb>] [<ffffffffa038d7fb>] nouveau_bo_wr32+0x4b/0x50 [nouveau]
[45018.520850] RSP: 0018:ffff8803e7dc5bd0 EFLAGS: 00010246
[45018.526157] RAX: 0000000000000000 RBX: ffff8803ec2d8780 RCX: 0000000000000000
[45018.533284] RDX: 0000000000406040 RSI: ffffc9001019f934 RDI: 0000000000000001
[45018.540409] RBP: 0000000000000000 R08: ffffc90010196000 R09: ffffc90010196000
[45018.547535] R10: 0000000000000000 R11: 0000000000000000 R12: 000000002001a020
[45018.554660] R13: 0000000000406040 R14: ffff8802933b2280 R15: 0000000000000000
[45018.561787] FS: 00007f9d09b26880(0000) GS:ffff8803ffc00000(0000) knlGS:00000000f6da2b90
[45018.569865] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[45018.575607] CR2: 00007ff639a65000 CR3: 00000003ec68f000 CR4: 00000000000007f0
[45018.584742] ffffffffa039343a ffffffffa038ce76 ffff8803e934c6c0 ffff8803e97bde00
[45018.592196] 0000000000000000 ffff8803e7dc5d18 ffffffffa038a02d ffff88015ebe80c0
[45018.599644] ffff8803ec2d8780 ffff8803ec2d8780 ffff8803e7dc5d18 0000000000000000
[45018.607098] Call Trace:
[45018.609550] [<ffffffffa039343a>] ? nv84_fence_emit32+0xda/0x1a0 [nouveau]
[45018.616424] [<ffffffffa038ce76>] ? nouveau_bo_placement_set+0x76/0x130 [nouveau]
[45018.623904] [<ffffffffa038a02d>] ? nouveau_fence_emit+0x3d/0xb0 [nouveau]
[45018.630780] [<ffffffffa038a8c4>] ? nouveau_fence_new+0x64/0xb0 [nouveau]
[45018.637568] [<ffffffffa0389898>] ? nv50_dma_push+0xc8/0xf0 [nouveau]
[45018.644012] [<ffffffffa038fdbb>] ? nouveau_gem_ioctl_pushbuf+0x35b/0x12f0 [nouveau]
[45018.651746] [<ffffffff811166f0>] ? __pollwait+0x110/0x110
[45018.657232] [<ffffffffa00ed105>] ? drm_ioctl+0x4b5/0x5b0 [drm]
[45018.663156] [<ffffffffa038fa60>] ? nouveau_gem_ioctl_new+0x1b0/0x1b0 [nouveau]
[45018.670456] [<ffffffff8111587b>] ? do_vfs_ioctl+0x8b/0x510
[45018.676025] [<ffffffff81104b45>] ? vfs_read+0x165/0x190
[45018.681333] [<ffffffff81115da0>] ? SyS_ioctl+0xa0/0xc0
[45018.686553] [<ffffffff813d7212>] ? system_call_fastpath+0x16/0x1b
[45018.692725] Code: 85 c0 75 04 89 16 c3 90 89 d7 66 0f 1f 44 00 00 e9 bb c8 e7 e0 0f b7 0d b4 30 06 00 8d 79 01 66 85 c9 66 89 3d a7 30 06 00 75 d5 <0f> 0b 0f 1f 00 41 57 41 bf ff 07 00 00 41 56 41 55 41 54 55 53
[45018.712668] RIP [<ffffffffa038d7fb>] nouveau_bo_wr32+0x4b/0x50 [nouveau]
[45018.719482] RSP <ffff8803e7dc5bd0>
[45018.735366] ---[ end trace 706c9cb9b21fa0ba ]---
[45033.732881] nouveau E[ X] failed to idle channel 0xcccc0000 [X]
[45048.725980] nouveau E[ X] failed to idle channel 0xcccc0000 [X]
I compared the machine code to a disassembly of nouveau_bo_wr32, and this is indeed in the code path conditioned by a comparison with 0x406040, even though that comparison itself is not among the dumped bytes.
This time, I recall no extraordinary graphics workload. Machine was mostly idle, with display in power save mode. Didn't wake up from that, though, and didn't react to NumLock either. I managed to ssh into the machine and save a dmesg before rebooting. So no automatic reboot this time, which might be because the problematic value didn't proceed down the pipe.
Does the stack trace provide any insight into what might be going on here? Does it tell us whether the bug is in kernel space or in user space?
(In reply to comment #11)
> Take a look at nouveau_gem_pushbuf_validate for that --
> it presently doesn't do any actual data validation.
Does that mean any unprivileged process with access to the video device can send garbage to the GPU and crash the system?
> You could also do the check in nv50_dma_push.
Had a look at that, and didn't understand what kind of data to inspect. But now it seems like this would pass through my bug reporting facility in any case. Unless the thing I reported was a false alarm, and would have been interpreted as something other than a command.
Can you figure out exactly which write this is pointing at? For some of the values, it could be perfectly legitimate to have the 0x406040 value (e.g. the sequence id, or the low bits of the address).
(Load up nouveau.ko in gdb, and run "disassemble nv84_fence_emit32". Note that gdb emits offsets in decimal as opposed to the above offsets which are in hex, for maximal confusion.)
Ideally we'd have some sort of (optional) command stream validator that made sure that there was no funny business going on.
My point about nv50_dma_push is that this is the path that user-submitted command streams take; they don't go through nouveau_bo_wr*, they are written by userspace via mmap'd sections, and then commands are written to the ring to jump to them.
(In reply to comment #13)
> Can you figure out exactly which write this is pointing at? For some of the
> values, it could be perfectly legitimate to have the 0x406040 value (e.g.
> the sequence id, or the low bits of the address).
It's the sequence id. The inlined OUT_RING function makes line numbers almost unusable, but I see the right shift for upper_32_bits, and it's the third nouveau_bo_wr32 call after that. False alarm?
> My point about nv50_dma_push is that this is the path that user-submitted
> command streams take; they don't go through nouveau_bo_wr*, they are written
> by userspace via mmap'd sections, and then commands are written to the ring
> to jump to them.
I see. And after all, I got the error message in comment #10 without triggering the BUG_ON, so there has to be a way around this check.
So what exactly do I examine in nv50_dma_push? I see arguments delta and length, and the delta is used to compute some offset. But all of these are declared as integers of various sizes and signedness, not as pointers. Should I cast one or the other to void* and iterate over length 32bit words? Or length/4 32bit words?
I'm seeing this bug a lot. I think it's the same bug, but it rarely shows in any of my logs, because the machine locks. It can happen with web browsing, movie watching or just about anything. Since I'm getting these crashes at least once a day on my main computer, I'm very much interested in solving the problem.
What I sometimes get are (the numbers vary, except for the last one):
kernel: nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 2 [X] get 0x002002b03c put 0x002002cca0 ib_get 0x0000015d ib_put 0x00000162 state 0x80004861 (err: INVALID_CMD) push 0x00406040
I'm seeing this on two different machines.
Machine 1: Gentoo with "NVIDIA NV96"
Machine 2: Fedora with "NVIDIA NV84"
I can't do any major debugging, but if there's anything simple I can do to help, please let me know.
Created attachment 95315 [details]
Mar 05 incident
My system keeps crashing on me, and quite often this DMA_PUSHER INVALID_CMD is the last line in ther kernel log. I also had one X lockup, where I could ssh into the machine and see a message in the log as well. Trying to reboot the system I got a number of "failed to idle channel" messages I'm attaching the log for that.
Today I got a crash without log message, but perhaps the machine was simply unable to flush the log. I triggered that by launching flightgear. Since that's the first time I launched flightgear, I guess this bug here makes that package unusable for me, which is very sad.
I'm still willing to help fix this, but I'll need instructions as I wrote in comment #14. If there is any other way to help, let me know.
3.13 and 3.14 are _much better_ wrt NV84.
Had some third party module incompatibilities with 3.13, but after finding suitable patches, I now upgraded to 3.13.6 and things look well so far. Which is to say, flightgear started several times without a single crash so far. Still see some lines about nouveau in dmesg, but haven't detected any adverse consequences so far. Nevertheless, here are the lines in question:
[25251.856093] nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 2 [X] get 0x00200111e4 put 0x0020011204 ib_get 0x000002e8 ib_put 0x0000035d state 0x80004861 (err: INVALID_CMD) push 0x00406040
[28962.591861] nouveau E[ PFIFO][0000:01:00.0] DMA_PUSHER - ch 2 [X] get 0x0020027e98 put 0x0020028470 ib_get 0x00000185 ib_put 0x000001ab state 0xc000ef05 (err: MEM_FAULT) push 0x00406040
[28962.592096] nouveau E[ PFB][0000:01:00.0] trapped read at 0xfffffffffc on channel 0x0000fcb0 [unknown] PFIFO/PFIFO_READ/PUSHBUF reason: PT_NOT_PRESENT
So the first line, the INVALID_CMD, is still there occasionally. At least once in my current session.
Just got flightgear (version 3 now) crash on me again, unfortunately without anything saved to the log on disk. My kernel is 3.13.7 at the moment; 3.14.0 on Gentoo still lacks support by the zfs module, but since I don't need that for anything essential, I might give it a try now nevertheless.
I also often experience my UI freezing for perhaps 10 seconds or so, but recovering after that. Apparently I haven't saved the dmesg for one such incident, but it was something about invalid map address if I remember correctly. Will try to save that next time around.
So on the whole, 3.13 is more stable for my NV84 than 3.12, but still less stable than older kernels were for a long time. 3.14 to be tested next.
flightgear 3 crashed my 3.14.0 as well, but I can't reproduce now with netconsole set up. I hate those problems which occur often enough to be annoying but not often enough to be easily reproducible!
*** Bug 72978 has been marked as a duplicate of this bug. ***
*** Bug 79532 has been marked as a duplicate of this bug. ***
This can reproduce on PPC32 machines easily:
Not sure if this is the same issue, since my system stays alive (only the java process terminates) and the command is 00400040 in my case:
nouveau 0000:01:00.0: gr: DATA_ERROR 0000000d [BEGIN_END_ACTIVE]
nouveau 0000:01:00.0: gr: 00100000  ch 7 [003f7c3000 java] subc 3 class 8597 mthd 12e4 data 00000000
nouveau 0000:01:00.0: gr: DATA_ERROR 0000000d [BEGIN_END_ACTIVE]
nouveau 0000:01:00.0: gr: 00100000  ch 7 [003f7c3000 java] subc 3 class 8597 mthd 1510 data 00000000
nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 7 [java] get 0020198ff8 put 0020213390 ib_get 000002ae ib_put 000002b5 state 80007641 (err: INVALID_CMD) push 00400040
nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 7 [java] get 002001aff4 put 0020094fd4 ib_get 000002b1 ib_put 000002b5 state 80007641 (err: INVALID_CMD) push 00400040
nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 7 [java] get 002001aff4 put 0020094fd4 ib_get 000002b4 ib_put 000002b5 state 80007641 (err: INVALID_CMD) push 00400040
nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 7 [java] get 0020115b4c put 0020193068 ib_get 000002b7 ib_put 000002b8 state 80007ac8 (err: INVALID_CMD) push 00400040
nouveau 0000:01:00.0: gr: DATA_ERROR 00000005 [INVALID_ENUM]
nouveau 0000:01:00.0: gr: 00100000  ch 7 [003f7c3000 java] subc 3 class 8597 mthd 1ac4 data 06360150
nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 7 [java] get 00201c0fa8 put 0020214e24 ib_get 000002b9 ib_put 000002ba state 80007641 (err: INVALID_CMD) push 00400040
nouveau 0000:01:00.0: java: failed to idle channel 7 [java]
Very easily reproduceable with sweethome3d.
Carsten, I have the same "00400040" code in my error log:
and it can be reproduced every time by playing youtube videos in the epiphany web browser (version 3.26).
Similar problem and symptoms here, X and keyboard freeze ssh is working @ NVIDIA GT218 (GeForce 210). I have two invalid commands however (00502031 and 00400040)
nouveau 0000:03:00.0: fifo: DMA_PUSHER - ch 3 [systemd-logind] get 0020026b94 put 002002accc ib_get 000002fd ib_put 000002fe state 80004861 (err: INVALID_CMD) push 00502031
nouveau 0000:03:00.0: fifo: DMA_PUSHER - ch 3 [systemd-logind] get 002002accc put 002002accc ib_get 000002fd ib_put 000002fe state 80000000 (err: INVALID_CMD) push 00400040