Created attachment 29769 [details]
lspci -vv output for the video card
When eg. moving cursor over kde menu, or moving a window, X hangs, and starts using 100% cpu. Sometimes it even hangs before kde has finished loading. It pretty much hangs in under a minute if i do anything at all.
The cursor still moves, but nothing else works.
More specific info, as obtained from "gdb --pid `pidof X`", follows.
The hanged X sometimes shuts down after i detach from it, the odds seem to rise the more i have used stepi in gdb.
Backtrace: (gdb's parameter string expansions removed, as they contained garbage. They are visible in the attached gdb-full.log, though)
#0 0x00007fc4c3af5127 in ioctl () from /lib/libc.so.6
#1 0x00007fc4c2d8bb26 in drmIoctl (fd=9, request=1074291845, arg=0x7fffec89f5c0) at xf86drm.c:188
#2 0x00007fc4c2d8bd3f in drmCommandWrite (fd=9, drmCommandIndex=<value optimized out>, data=0x7fffec89f5c0, size=18446744073709551615) at xf86drm.c:2402
#3 0x00007fc4c2930011 in nouveau_bo_wait (bo=0x2397790, cpu_write=0, no_wait=0, no_block=0) at nouveau_bo.c:399
#4 0x00007fc4c29301d5 in nouveau_bo_map_range (bo=0x2397790, delta=0, size=<value optimized out>, flags=0) at nouveau_bo.c:442
#5 0x00007fc4c2b43ce5 in NVAccelDownloadM2MF (pspix=0x2614d70, x=<value optimized out>, y=0, w=156, h=102,
dst=0x2614db0, dst_pitch=624) at nouveau_exa.c:125
#6 0x00007fc4c2b44d1e in nouveau_exa_download_from_screen (pspix=0x2614d70, x=0, y=0, w=156, h=102,
dst=0x2614db0, dst_pitch=624) at nouveau_exa.c:480
#7 0x00007fc4c10d271a in exaCopyDirty (migrate=0x7fffec89f950, pValidDst=0x2517ee8, pValidSrc=<value optimized out>, transfer=0x7fc4c2b44ca7 <nouveau_exa_download_from_screen>,
fallback_src=0x7fc4b9000b00 <Address 0x7fc4b9000b00 out of bounds>,
#8 0x00007fc4c10d2a69 in exaDoMoveOutPixmap (migrate=0x7fffec89f950) at exa_migration.c:256
#9 0x00007fc4c10d30b5 in exaDoMigration (pixmaps=0x7fffec89f950, npixmaps=1, can_accel=0) at exa_migration.c:677
#10 0x00007fc4c10cf329 in exaGetImage (pDrawable=0x2614d70, x=0, y=0, w=156, h=102, format=2, planeMask=4294967295, d=0x2624660) at exa_accel.c:1331
#11 0x00000000004cc141 in miSpriteGetImage (pDrawable=0x2614d70, sx=0, sy=0, w=156, h=102, format=2, planemask=4294967295, pdstLine=0x2624660) at misprite.c:281
#12 0x0000000000446116 in ProcGetImage (client=0x2528750) at dispatch.c:2067
#13 0x0000000000447c4a in Dispatch () at dispatch.c:454
#14 0x000000000043069d in main (argc=9, argv=0x7fffec89fc78, envp=<value optimized out>) at main.c:438
I've gotten the above(ish) backtrace multiple times, the last few functions are always the same:
nouveau_bo_wait() (was "?? ()" before i compiled everything with -ggdb, but probably same)
Right next in the stack trace after these, i have seen NVAccelDownloadM2MF and NVAccelUploadM2MF thus far. (again before recompile with -ggdb, it was always "?? ()", so i might've missed some callers, but probably mostly the same functions)
If i step the code one instruction at a time with 'stepi', the execution loops these:
ioctl () from /lib/libc.so.6
0x7fc4c3af5127 <ioctl+7>: cmp $0xfffffffffffff001,%rax
0x4df396 <SmartScheduleTimer>: mov 0x2b47a3(%rip),%rax # 0x793b40 <_DYNAMIC+3496>
0x4df39d <SmartScheduleTimer+7>: mov 0x2b4b54(%rip),%rdx # 0x793ef8 <_DYNAMIC+4448>
0x4df3a4 <SmartScheduleTimer+14>: mov (%rdx),%rdx
0x4df3a7 <SmartScheduleTimer+17>: add %rdx,(%rax)
0x4df3aa <SmartScheduleTimer+20>: retq
<signal handler called>
0x7fc4c5b8ea10 <__restore_rt>: mov $0xf,%rax
0x7fc4c5b8ea17 <__restore_rt+7>: syscall
So the line in ioctl() never seems to get executed. The assembly lines looped have so far been the same every time i have looked at X with gdb after a hang.
I am using 64bit gentoo, the git revisions of the installed packages are:
Like i said, it's very easy to trigger the hang, so on the positive side, i'm able to test patches quickly (as soon as i figure out how to tell emerge to use them :)
Created attachment 29770 [details]
xorg.conf in use
Created attachment 29771 [details]
Contains markers added afterwards so it can be seen at which point the messages occur.
For example "[* 91.26 ] *** Notice: after startx ***" means that startx has been run before that point
Created attachment 29772 [details]
The log file stopped changing before the hang happened
Created attachment 29773 [details]
log of gdb session
Probably not much use, but attaching for completeness :)
Contains backtrace and register values during the stepping
The trigger of this problem seems to be the PFIFO_DMA_PUSHER interrupt, after which the channel is stuck. The spinning afterwards is just a side-effect: it looks like the kernel is always returning EAGAIN (right?) from DRM_NOUVEAU_GEM_CPU_PREP ioctl and user space does not know to time out (and it probably does not need to, since we spin in user space and can kill it).
We'd have to find out why the command stream is grabled, that's what a pusher interrupt means AFAIK, bad command packet format. The randomness sounds like a race.
I hope this is (not?) just another manifestation of the kernel memory problems we've been seeing recently on certain setups. Unfortunately I don't have any further insight into this right now.
I decided to test if the current nouveau works for me - and it does, perfectly.
Much lighter on the processor when watching videos, than the nvidia driver.
Not sure if it was my hardware, the kernel, nouveau, or something else, but if I'd have to pick, I'd say hardware; At one point I decreased the memory speed from BIOS after noticing that the motherboard (which supports DDR2-667) had apparently decided to use the memory at 800MHz, which sounds like the kind of thing that could have caused the whole thing. When I was first trying to get nouveau to work, I hadn't even thought looking there, since the problem was dependent on the software I was running..
All in all, one more happy user of nouveau.
Many thanks to all the developers for an excellent driver.