Summary: | Radeon + DRI on r300: X goes 99.9% CPU | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Doron <doron.fediuck> | ||||||||||||||
Component: | General | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||
Severity: | critical | ||||||||||||||||
Priority: | high | CC: | dusanc | ||||||||||||||
Version: | XOrg git | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux (All) | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Description
Doron
2008-08-11 02:42:52 UTC
Created attachment 18215 [details]
Configuration file.
The relevant device for my layout is with Identifier "Alone".
Created attachment 18216 [details]
X Log File.
Does it also happen without Option "DynamicClocks"? Hi Michel, Thanks for the quick response. Sadly, yes. I just remarked DynamicClocks (default is off), and the same behavior occurred. Hi, Almost a month has past, and no comments... Any chance to fix this bug? How can I help ? Thanks, Doron. (In reply to comment #5) > Hi, > Almost a month has past, and no comments... Any chance to fix this bug? > How can I help ? > > Thanks, > Doron. > Did you try with EXA instead of XAA? (Add 'Option "AccelMethod" "exa"' to xorg.conf) Thanks Giacomo, But no change- same behavior. Also I moved to the new xorg-server, and it remains the same. I'll attach the latest version's log. Version details: doronf ~ # Xorg -version X.Org X Server 1.5.0 Release Date: X Protocol Version 11, Revision 0 Build Operating System: Linux 2.6.25-gentoo-r7 i686 Current Operating System: Linux doronf 2.6.25-gentoo-r7 #6 PREEMPT Fri Aug 15 19:13:18 IDT 2008 i686 Build Date: 07 September 2008 10:43:07AM I hope someone can give me a hand here, since this will help many r300 users. Thanks, Doron Created attachment 18719 [details]
X 1.5.0 log file
(In reply to comment #7) > > I hope someone can give me a hand here, > since this will help many r300 users. How do you know that? This is the only report I've seen of a problem like this. Does the problem also occur with the radeon kernel module from drm Git instead of from the Linux kernel? BTW, you don't need to disable acceleration completely to disable the DRI, you can use Option "DRI" "off". Does that still avoid the problem? (In reply to comment #9) Hi Michel, > (In reply to comment #7) > > > > I hope someone can give me a hand here, > > since this will help many r300 users. > > How do you know that? This is the only report I've seen of a problem like this. > I saw many users with this issue. Not all of them knows how to pin-point the problem, so they open various issues in places like X mailing lists, distro forums, etc. You can just google for it and see- http://www.google.com/search?hl=en&q=X+hang+%2Bcpu+r300&btnG=Search Since it looks like a busy loop (which causes X to go 100%), breaking this loop will help others as well. > Does the problem also occur with the radeon kernel module from drm Git instead > of from the Linux kernel? No. This is company's laptop and I can't afford harming it with bleeding edge sources. Sorry... if there's a "safe" branch I may give it a go, but I didn't see such branch so far. > > BTW, you don't need to disable acceleration completely to disable the DRI, you > can use Option "DRI" "off". Does that still avoid the problem? > I'll try that, thanks ! Doron (In reply to comment #10) > Since it looks like a busy loop (which causes X to go 100%), breaking > this loop will help others as well. Those are typical symptoms of a GPU lockup, which can be caused by any number of different things. The usual causes result in the hang after e.g. running certain 3D applications or on X server startup, it's rare on VT switches. BTW, were the log files captured after reproducing the problem? If not, please attach one that was. Also, which version of xf86-video-ati is this? If it's older than 6.9.0 or at least 6.8.0, please try a newer one. (In reply to comment #11) Dear Michel, > Those are typical symptoms of a GPU lockup, which can be caused by any number > of different things. The usual causes result in the hang after e.g. running > certain 3D applications or on X server startup, it's rare on VT switches. I'm sorry, from a programmer's point of view it behaves like it's in a loop. I'm always ready to learn ;) I just want to add that I do not need to run a heavy-duty 3D application; I only need to turn off the xdm service. Then, from console I run startx and switch back to console. When I'll try to switch back to X (vt 7) X will hang. Same behavior of course, occurs when I start xdm (kde), login and switch to vt1. X will hang the minute I'll try to back to X. > > BTW, were the log files captured after reproducing the problem? If not, please > attach one that was. Also, which version of xf86-video-ati is this? If it's > older than 6.9.0 or at least 6.8.0, please try a newer one. > All logs were captured during X hanging. I have no choice when it hangs, so I reboot the machine. These are the logs which X reproduces. I checked the .xsession-errors file, but nothing significant there. As for xf86-video-ati, I'm using version 6.9.0. Since I'm using Gentoo, I can recompile anything with the relevant USE flag. So if you have a debug USE flag somewhere, I can turn it on. Just tell me how to help. Thanks again ! Doron. (In reply to comment #12) > I'm sorry, from a programmer's point of view it behaves like it's in a loop. It is in a loop, waiting for the GPU to finish processing the commands emitted to it previously, but that never happens because the GPU is locked up. The loop is just a symptom of the actual problem. Hi Michel, We may have some progress due to your DRI suggestion... I remarked NoAccl, and added DRI false. For some strange reason, I had acceleration (window moving didn't flicker as I'm used to...). Also the hang behavior maintained. IE- switching vt's caused X to go 99.9% CPU. I double checked the X log. You can see the DRI is off, and a mesa driver is being used for GLX. So this leaves me confused- is mesa causing this behavior ? I'm attaching the relevant log. Thanks, Doron Created attachment 18737 [details]
X with Accel on, DRI off log file
(In reply to comment #14) > For some strange reason, I had acceleration (window moving didn't flicker as > I'm used to...). Nothing strange, you only disabled the DRI, not all acceleration. > Also the hang behavior maintained. IE- switching vt's caused X to > go 99.9% CPU. Hmm, then I'm not sure anymore this really is a GPU lockup... it would be interesting if when the X server is hanging, you could log in via ssh, attach gdb to the X server process and attach the output of 'bt full'. It'll be much more useful if the X server binaries have debugging symbols. (In reply to comment #16) > Hmm, then I'm not sure anymore this really is a GPU lockup... it would be > interesting if when the X server is hanging, you could log in via ssh, attach > gdb to the X server process and attach the output of 'bt full'. It'll be much > more useful if the X server binaries have debugging symbols. > Michel, I'm not sure gdb will attach to the X process since it's consuming most of the CPU... But I can give it a go. As for the symbols, I found that xorg-server has a debug USE flag I can turn on and recompile. Any other binaries which should be recompiled in debug mode ? (ie- xf86-video-ati, etc.) (In reply to comment #17) > Any other binaries which should be recompiled in debug mode ? (ie- > xf86-video-ati, etc.) Yeah, xserver and xf86-video-ati for starters. OK, here's the gdb output and then my explanations: ==================================================== Continuing. Program received signal SIGUSR1, User defined signal 1. [Switching to Thread 0xb7a796c0 (LWP 7591)] 0xb7f26424 in __kernel_vsyscall () Continuing. Program received signal SIGUSR1, User defined signal 1. 0xb7f26424 in __kernel_vsyscall () Continuing. Program received signal SIGINT, Interrupt. 0xb7f26424 in __kernel_vsyscall () #0 0xb7f26424 in __kernel_vsyscall () No symbol table info available. #1 0xb7b34fe9 in ioctl () from /lib/libc.so.6 No symbol table info available. #2 0xb79d6a71 in drmCommandNone () from /usr/lib/libdrm.so.2 No symbol table info available. #3 0x00006444 in ?? () No symbol table info available. #4 0x00000000 in ?? () No symbol table info available ==================================================== Explanations: 1. I recompiled X and x11-drivers with the debug USE flag, but it looks like a lot is still missing. I'm not sure which. 2. I tried following your scenario, ie- attaching gdb to X when it hangs, but as I expected, gdb fails to attach. So instead I attached gdb before the X hang and then switched vt's to hang. 3. I had to use ctrl+c in order to make gdb stop and give me a prompt. 4. I used the original configuration- ie including DRI. That's it. If there's anything I can do better just let me know. Doron. I'd like to report that I had pretty similar problems. Had them with kernels 2.6.25 and .26, but they happen randomly, when I use some 3D app (tremulous, flightgear) and when I finish after some time system becomes unresponsive, only mouse moves, no keyboard, and when I log in through nxserver on that machine everything works fine, only X consumes 100% on dualcore cpu. Only reboot solves it. Happened to me with exa and xaa. GPU is radeon X800AIW There's nothing in log files. I'm using: X -version X Window System Version 1.3.0 dmesg|grep "drm" [drm] Initialized drm 1.1.0 20060810 [drm] Initialized radeon 1.29.0 20080528 on minor 0 In xorg.conf: Identifier "Card0" Driver "radeon" VendorName "ATI Technologies Inc" BoardName "R430 [Radeon X800 XL] (PCIe)" BusID "PCI:1:0:0" Option "EnablePageFlip" "on" Option "ColorTiling" "1" # Option "AccelMethod" "EXA" Option "AccelDFS" "1" Had these same symptoms way earlier in time of 2.6.23 kernel but those freezes were during playing 3D games (ET:RTCW), not after. Please ask if you need more info. Have a nice day OK, better luck this time... I read about no stripping and some compiler flags and re-compiled X, DRM and xf86-video-ati (again...). I also found the way to turn on the debug of the drm module. So here it is, this is what gdb shows while X hangs: (I caused the SIGINT with ctrl+c in order to see where X hangs...) ============================================================================== Continuing. Program received signal SIGINT, Interrupt. 0xb7f84424 in __kernel_vsyscall () #0 0xb7f84424 in __kernel_vsyscall () No symbol table info available. #1 0xb7b92fe9 in ioctl () from /lib/libc.so.6 No symbol table info available. #2 0xb7a34af2 in drmCommandNone (fd=10, drmCommandIndex=4) at xf86drm.c:2247 No locals. #3 0xb79d2534 in RADEONWaitForIdleCP (pScrn=0xa126fd0) at radeon_commonfuncs.c:697 _ret = <value optimized out> ret = -16 info = (RADEONInfoPtr) 0xa129f08 i = 186 __FUNCTION__ = "RADEONWaitForIdleCP" #4 0xb7a162d3 in RADEONSyncCP (pScreen=0xa130048, marker=2462) at radeon_exa_funcs.c:80 pScrn = (ScrnInfoPtr) 0xa126fd0 #5 0xb78ab803 in exaWaitSync (pScreen=0xa130048) at exa.c:1036 No locals. #6 0xb78acaab in ExaDoPrepareAccess (pDrawable=0xa2b35f8, index=0) at exa.c:495 pExaScr = (ExaScreenPrivPtr) 0xa12f008 pPixmap = (PixmapPtr) 0xa17dff8 offscreen = 1 #7 0xb78acbab in exaPrepareAccessReg (pDrawable=0xa2b35f8, index=0, pReg=0xa17e0cc) at exa.c:520 pixmaps = {{as_dst = 1, as_src = 0, pPix = 0xa17dff8, pReg = 0xa17e0cc}} #8 0xb78ad0a1 in exaImageGlyphBlt (pDrawable=0xa2b35f8, pGC=0xa32d320, x=2, y=947, nglyph=6, ppciInit=0xa326f18, pglyphBase=0x0) at exa_accel.c:895 pPriv = (FbGCPrivPtr) 0xa329a44 ppci = <value optimized out> pci = <value optimized out> pglyph = <value optimized out> gWidth = <value optimized out> gHeight = <value optimized out> opaque = <value optimized out> gx = <value optimized out> gy = <value optimized out> glyph = (void (*)(FbBits *, FbStride, int, FbStip *, FbBits, int, int)) 0xb78bdba0 <fbGlyph32> dst = <value optimized out> dstStride = <value optimized out> dstBpp = <value optimized out> dstXoff = <value optimized out> dstYoff = <value optimized out> depthMask = <value optimized out> pPixmap = (PixmapPtr) 0xa17dff8 pending_damage = (RegionPtr) 0xa17e0cc xoff = 0 yoff = 0 #9 0x0816e0e6 in damageText (pDrawable=0xa2b35f8, pGC=0xa32d320, x=2, y=13, count=6, chars=0xa2aaa78 "", fontEncoding=TwoD16Bit, textType=3) at damage.c:1466 info = <value optimized out> i = 6 n = 6 w = 0 imageblt = 1 #10 0x0816e18d in damageImageText16 (pDrawable=0xa2b35f8, pGC=0xa32d320, x=2, y=13, count=6, chars=0xa2aaa78) at damage.c:1547 pGCPriv = (DamageGCPrivPtr) 0xa32d38c oldFuncs = (GCFuncs *) 0x81c99c0 #11 0x0808ba1b in doImageText (client=0xa2a4148, c=0xbf99d940) at dixfonts.c:1561 err = <value optimized out> lgerr = <value optimized out> fpe = <value optimized out> #12 0x0808bbb4 in ImageText (client=0xa2a4148, pDraw=0xa2b35f8, pGC=0x6444, nChars=6, data=0xa2aaa78 "", xorg=2, yorg=13, reqType=<value optimized out>, did=10485784) at dixfonts.c:1612 local_closure = {client = 0xa2a4148, pDraw = 0xa2b35f8, pGC = 0xa32d320, nChars = 6 '\006', data = 0xa2aaa78 "", xorg = 2, yorg = 13, reqType = 77 'M', imageText = 0x816e0f0 <damageImageText16>, itemSize = 2, did = 10485784, slept = 0} #13 0x08086803 in ProcImageText16 (client=0xa2a4148) at dispatch.c:2231 err = -16 pDraw = (DrawablePtr) 0x6444 pGC = (GC *) 0x0 #14 0x08089144 in Dispatch () at dispatch.c:454 result = <value optimized out> client = <value optimized out> nready = 0 start_tick = 27860 #15 0x0806f98b in main (argc=9, argv=0xbf99db04, envp=Cannot access memory at address 0x644c ) at main.c:441 pScreen = <value optimized out> i = 1 error = 134673718 xauthfile = <value optimized out> alwaysCheckForInput = {0, 1} The program is running. Quit anyway (and detach it)? (y or n) Detaching from program: /usr/bin/X, process 9236 ============================================================================== As for drm kernel module, these are the debug messages I got: a lot of these message loops- ============================================================================== Sep 9 10:56:23 doronf [drm:drm_unlocked_ioctl] pid=9236, cmd=0x6444, nr=0x44, dev 0xe200, auth=1 Sep 9 10:56:23 doronf [drm:radeon_cp_idle] Sep 9 10:56:23 doronf [drm:radeon_do_cp_idle] Sep 9 10:56:23 doronf [drm:drm_unlocked_ioctl] ret = -16 Sep 9 10:56:23 doronf [drm:drm_unlocked_ioctl] pid=9236, cmd=0x6444, nr=0x44, dev 0xe200, auth=1 Sep 9 10:56:23 doronf [drm:radeon_cp_idle] Sep 9 10:56:23 doronf [drm:radeon_do_cp_idle] Sep 9 10:56:23 doronf [drm:drm_unlocked_ioctl] ret = -16 Sep 9 10:56:23 doronf [drm:drm_unlocked_ioctl] pid=9236, cmd=0x6444, nr=0x44, dev 0xe200, auth=1 Sep 9 10:56:23 doronf [drm:radeon_cp_idle] ============================================================================== (Here I rebooted). How can we proceed from here ? Thanks ! Doron (In reply to comment #21) > #3 0xb79d2534 in RADEONWaitForIdleCP (pScrn=0xa126fd0) at > radeon_commonfuncs.c:697 Okay, so this does look like a GPU lockup after all. Can you get a backtrace with the DRI disabled as well? GPU lockups are unusual with the DRI disabled. One more idea though - does Option "RenderAccel" "off" avoid the problem? BTW, as I requested, please attach backtraces (and generally larger bits of information) instead of cluttering up the comments with them. (In reply to comment #20) > I'd like to report that I had pretty similar problems. > Had them with kernels 2.6.25 and .26, but they happen randomly, when I use some > 3D app (tremulous, flightgear) and when I finish after some time [...] Did you read comment #11 and comment #13? It's not the eventual symptoms that matter (they tend to be the same or similar) but what triggers them - VT switching from console back to X for this bug report. (In reply to comment #22) > Okay, so this does look like a GPU lockup after all. Can you get a backtrace > with the DRI disabled as well? GPU lockups are unusual with the DRI disabled. > One more idea though - does Option "RenderAccel" "off" avoid the problem? I tried several times repeating it with DRI off. The problem is much more intense. ie- Once X hangs, gdb hangs as well, including the ssh session... At this stage I can only press the power button and wait... I need some help to try and set a break point just before the hang occurs. Can you give me a function name or similar to use as a break point before vt switch ? As for RenderAccel, I'll give it a go in the next session (with the break point), since I want to cold boot as less as possible. I'm afraid I'll harm the hard-disk or anything else... > > BTW, as I requested, please attach backtraces (and generally larger bits of > information) instead of cluttering up the comments with them. > My apologies. Accepted starting now. (In reply to comment #24) > I tried several times repeating it with DRI off. The problem is much more > intense. ie- Once X hangs, gdb hangs as well, including the ssh session... Sounds like there may even be two separate problems with the DRI enabled or disabled. > Can you give me a function name or similar to use as a break point before vt > switch ? RADEONEnterVT is the driver function called when switching from console to X. Hi Michel. I tried RenderAccel off, but no real change. So I remarked NoAccel, and added DRI off. I'm attaching gdb log. I used RADEONEnterVT as a break point. It looks like I managed to get the loop, but I may be wrong, since it may be a loop over icons. Can you have a look and see ? Doron Created attachment 18782 [details]
gdb tracing with break point.
(In reply to comment #26) > Can you have a look and see ? I've never seen this kind of gdb output, and looking at it I can't help but feeling like looking for a needle in a haystack. If you tell gdb to 'finish' at the RADEONEnterVT breakpoint with the DRI disabled, do you get a gdb prompt back or does it hang before that? If the former, the problem is somewhere outside that function. (In reply to comment #28) > If you tell gdb to 'finish' at the RADEONEnterVT breakpoint with the DRI > disabled, do you get a gdb prompt back or does it hang before that? If the > former, the problem is somewhere outside that function. > Well, I actually get this break twice: first time when I switch to VT1 (always OK). Then when I try to switch back to VT7 (will hang). I did what you suggested, and on the 2nd break I got gdm prompt. So I wrote finish and got the following: Program received signal SIGUSR1, User defined signal 1. 0xb7f47424 in __kernel_vsyscall () Run till exit from #0 0xb7f47424 in __kernel_vsyscall () 0xb7b5695d in select () from /lib/libc.so.6 at this point screen was black and I got another prompt, so I quit gdb. X returned (and I quickly turned it off to avoid hangs...) What do you make of it ? (In reply to comment #29) > Program received signal SIGUSR1, User defined signal 1. > 0xb7f47424 in __kernel_vsyscall () > Run till exit from #0 0xb7f47424 in __kernel_vsyscall () > 0xb7b5695d in select () from /lib/libc.so.6 That's not the RADEONEnterVT breakpoint but a SIGUSR1, which is part of any VT switch. You can tell gdb not to stop on SIGUSR1 using 'handle SIGUSR1 nostop'. Created attachment 18801 [details]
gdb avoiding normal switch, using finish
Hi Michel,
OK, used the handle command I got to the break point.
I used finish 7 times, ang X hanged !
Can you see something meaningful here ?
Doron.
(In reply to comment #31) > Can you see something meaningful here ? I'm afraid not - Dispatch() is the main protocol request processing function, so it could still be pretty much anything, we've only ruled out that the problem is triggered from RADEONEnterVT directly. Does the problem also happen with a 'naked' X server (without any clients)? Other than that I'm running out of ideas, maybe someone else can chime in... (In reply to comment #32) > > Does the problem also happen with a 'naked' X server (without any clients)? No. I can switch out of and into X with no issues. > > Other than that I'm running out of ideas, maybe someone else can chime in... > I can only hope... (In reply to comment #33) > > Does the problem also happen with a 'naked' X server (without any clients)? > No. I can switch out of and into X with no issues. Interesting - so the problem could be related to a specific acceleration primitive which isn't hit by the root weave. With XAA, you can disable specific acceleration primitives using Option "XaaNo..." documented in the xorg.conf manpage. Basically, for each primitive listed after the log line (II) RADEON(0): Using XFree86 Acceleration Architecture (XAA) try the corresponding XaaNo... option and see if you can find a single such option which avoids the problem. Some good news ! Today I emerged the latest drm released- x11-base/x11-drm-20080710. So far I've been using x11-base/x11-drm-20071019. I'm very happy to say the problem is gone. I want to test it for a day or so and then I'll close this bug. In the meanwhile cross your fingers ;) Doron. Hi Michel and others, Everything is up and running. Even managed to run compiz-fusion, sleep and resume. All fine. Here's a list of relevant configuration for future reference: The main packages' working versions: x11-base/xorg-server-1.5.0 x11-base/x11-drm-20080710 x11-apps/mesa-progs-7.1 x11-libs/libdrm-2.3.1 x11-drivers/xf86-video-ati-6.9.0 x11-drivers/xf86-input-keyboard-1.3.1 x11-drivers/xf86-input-mouse-1.3.0 Device settings (in xorg.conf): Identifier "Alone" Driver "radeon" VendorName "ATI Technologies Inc" BoardName "M24GL [Mobility FireGL V3200] rev 128" #Option "NoAccel" # [<bool>] Option "MonitorLayout" "AUTO,NONE" # [<str>] #Option "DynamicClocks" "on" # [<bool>] Option "AccelMethod" "EXA" #"XAA" #Option "AccelDFS" "1" BusID "PCI:1:0:0" I hope others will benefit this as well. Thanks a lot for all the help ! Doron. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.