24092 – X with nouveau hangs in nouveau_bo_map_range when doing anything

Bug 24092 - X with nouveau hangs in nouveau_bo_map_range when doing anything

Summary: X with nouveau hangs in nouveau_bo_map_range when doing anything

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-09-22 12:31 UTC by Aleksi Torhamo
Modified:	2010-10-02 19:39 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
lspci -vv output for the video card (2.16 KB, text/plain) 2009-09-22 12:31 UTC, Aleksi Torhamo	no flags	Details
xorg.conf in use (601 bytes, text/plain) 2009-09-22 12:32 UTC, Aleksi Torhamo	no flags	Details
dmesg output (36.92 KB, text/plain) 2009-09-22 12:39 UTC, Aleksi Torhamo	no flags	Details
Xorg log (25.13 KB, text/plain) 2009-09-22 12:44 UTC, Aleksi Torhamo	no flags	Details
log of gdb session (34.13 KB, text/plain) 2009-09-22 12:46 UTC, Aleksi Torhamo	no flags	Details
View All

Description Aleksi Torhamo 2009-09-22 12:31:24 UTC

Created attachment 29769 [details]
lspci -vv output for the video card

When eg. moving cursor over kde menu, or moving a window, X hangs, and starts using 100% cpu. Sometimes it even hangs before kde has finished loading. It pretty much hangs in under a minute if i do anything at all.
The cursor still moves, but nothing else works.

More specific info, as obtained from "gdb --pid `pidof X`", follows.
The hanged X sometimes shuts down after i detach from it, the odds seem to rise the more i have used stepi in gdb.

Backtrace: (gdb's parameter string expansions removed, as they contained garbage. They are visible in the attached gdb-full.log, though)
#0  0x00007fc4c3af5127 in ioctl () from /lib/libc.so.6
#1  0x00007fc4c2d8bb26 in drmIoctl (fd=9, request=1074291845, arg=0x7fffec89f5c0) at xf86drm.c:188
#2  0x00007fc4c2d8bd3f in drmCommandWrite (fd=9, drmCommandIndex=<value optimized out>, data=0x7fffec89f5c0, size=18446744073709551615) at xf86drm.c:2402
#3  0x00007fc4c2930011 in nouveau_bo_wait (bo=0x2397790, cpu_write=0, no_wait=0, no_block=0) at nouveau_bo.c:399
#4  0x00007fc4c29301d5 in nouveau_bo_map_range (bo=0x2397790, delta=0, size=<value optimized out>, flags=0) at nouveau_bo.c:442
#5  0x00007fc4c2b43ce5 in NVAccelDownloadM2MF (pspix=0x2614d70, x=<value optimized out>, y=0, w=156, h=102,
    dst=0x2614db0, dst_pitch=624) at nouveau_exa.c:125
#6  0x00007fc4c2b44d1e in nouveau_exa_download_from_screen (pspix=0x2614d70, x=0, y=0, w=156, h=102,
    dst=0x2614db0, dst_pitch=624) at nouveau_exa.c:480
#7  0x00007fc4c10d271a in exaCopyDirty (migrate=0x7fffec89f950, pValidDst=0x2517ee8, pValidSrc=<value optimized out>, transfer=0x7fc4c2b44ca7 <nouveau_exa_download_from_screen>,
    fallback_src=0x7fc4b9000b00 <Address 0x7fc4b9000b00 out of bounds>,
--
#8  0x00007fc4c10d2a69 in exaDoMoveOutPixmap (migrate=0x7fffec89f950) at exa_migration.c:256
#9  0x00007fc4c10d30b5 in exaDoMigration (pixmaps=0x7fffec89f950, npixmaps=1, can_accel=0) at exa_migration.c:677
#10 0x00007fc4c10cf329 in exaGetImage (pDrawable=0x2614d70, x=0, y=0, w=156, h=102, format=2, planeMask=4294967295, d=0x2624660) at exa_accel.c:1331
#11 0x00000000004cc141 in miSpriteGetImage (pDrawable=0x2614d70, sx=0, sy=0, w=156, h=102, format=2, planemask=4294967295, pdstLine=0x2624660) at misprite.c:281
#12 0x0000000000446116 in ProcGetImage (client=0x2528750) at dispatch.c:2067
#13 0x0000000000447c4a in Dispatch () at dispatch.c:454
#14 0x000000000043069d in main (argc=9, argv=0x7fffec89fc78, envp=<value optimized out>) at main.c:438

I've gotten the above(ish) backtrace multiple times, the last few functions are always the same:
ioctl()
drmIoctl()
drmCommandWrite()
nouveau_bo_wait() (was "?? ()" before i compiled everything with -ggdb, but probably same)
nouveau_bo_map_range()

Right next in the stack trace after these, i have seen NVAccelDownloadM2MF and NVAccelUploadM2MF thus far. (again before recompile with -ggdb, it was always "?? ()", so i might've missed some callers, but probably mostly the same functions)

If i step the code one instruction at a time with 'stepi', the execution loops these:

ioctl () from /lib/libc.so.6
0x7fc4c3af5127 <ioctl+7>:       cmp    $0xfffffffffffff001,%rax

utils.c
0x4df396 <SmartScheduleTimer>:  mov    0x2b47a3(%rip),%rax        # 0x793b40 <_DYNAMIC+3496>
0x4df39d <SmartScheduleTimer+7>:        mov    0x2b4b54(%rip),%rdx        # 0x793ef8 <_DYNAMIC+4448>
0x4df3a4 <SmartScheduleTimer+14>:       mov    (%rdx),%rdx
0x4df3a7 <SmartScheduleTimer+17>:       add    %rdx,(%rax)
0x4df3aa <SmartScheduleTimer+20>:       retq

<signal handler called>
0x7fc4c5b8ea10 <__restore_rt>:  mov    $0xf,%rax
0x7fc4c5b8ea17 <__restore_rt+7>:        syscall

So the line in ioctl() never seems to get executed. The assembly lines looped have so far been the same every time i have looked at X with gdb after a hang.

I am using 64bit gentoo, the git revisions of the installed packages are:
xf86-video-nouveau df94ebdbcd89c1678ac243217e7f5b20cbbe857c
nouveau-drm 3d6747a2b1576782fe74975a353b356cfc936505
libdrm ac71f0849928f4b2fbb69c01304ac6f9df8916a1

Like i said, it's very easy to trigger the hang, so on the positive side, i'm able to test patches quickly (as soon as i figure out how to tell emerge to use them :)

Comment 1 Aleksi Torhamo 2009-09-22 12:32:43 UTC

Created attachment 29770 [details]
xorg.conf in use

Comment 2 Aleksi Torhamo 2009-09-22 12:39:49 UTC

Created attachment 29771 [details]
dmesg output

Contains markers added afterwards so it can be seen at which point the messages occur.
For example "[*  91.26    ] *** Notice: after startx ***" means that startx has been run before that point

Comment 3 Aleksi Torhamo 2009-09-22 12:44:59 UTC

Created attachment 29772 [details]
Xorg log

The log file stopped changing before the hang happened

Comment 4 Aleksi Torhamo 2009-09-22 12:46:49 UTC

Created attachment 29773 [details]
log of gdb session

Probably not much use, but attaching for completeness :)
Contains backtrace and register values during the stepping

Comment 5 Pekka Paalanen 2009-09-23 10:21:34 UTC

The trigger of this problem seems to be the PFIFO_DMA_PUSHER interrupt, after which the channel is stuck. The spinning afterwards is just a side-effect: it looks like the kernel is always returning EAGAIN (right?) from DRM_NOUVEAU_GEM_CPU_PREP ioctl and user space does not know to time out (and it probably does not need to, since we spin in user space and can kill it).

We'd have to find out why the command stream is grabled, that's what a pusher interrupt means AFAIK, bad command packet format. The randomness sounds like a race.

I hope this is (not?) just another manifestation of the kernel memory problems we've been seeing recently on certain setups. Unfortunately I don't have any further insight into this right now.

Comment 6 Aleksi Torhamo 2010-10-02 19:39:41 UTC

I decided to test if the current nouveau works for me - and it does, perfectly.
Much lighter on the processor when watching videos, than the nvidia driver.

Not sure if it was my hardware, the kernel, nouveau, or something else, but if I'd have to pick, I'd say hardware; At one point I decreased the memory speed from BIOS after noticing that the motherboard (which supports DDR2-667) had apparently decided to use the memory at 800MHz, which sounds like the kind of thing that could have caused the whole thing. When I was first trying to get nouveau to work, I hadn't even thought looking there, since the problem was dependent on the software I was running..

All in all, one more happy user of nouveau.
Many thanks to all the developers for an excellent driver.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.