Summary: | reproducible GPU ioctl lockup OR cpu spin (if compiz ON/OFF) on with -radeon and r300-r500 cards | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | martin <mnemo> | ||||||||
Component: | Driver/Radeon | Assignee: | Xorg Project Team <xorg-team> | ||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||
Severity: | critical | ||||||||||
Priority: | medium | CC: | michel.brabants, oyvind | ||||||||
Version: | 7.4 (2008.09) | ||||||||||
Hardware: | Other | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
martin
2009-05-06 08:22:48 UTC
Created attachment 25555 [details]
gdb trace show xorg CPU spin caused by visiting website mundoplus.tv
If I turn off compiz and open that URL in firefox xorg still locks up but instead of being stuck permanently blocking in drmIoctl() it goes into a CPU spin.
At this time the backtrace is essentially:
#1 0xb7d35ea9 in ioctl () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7b30a6d in drmDMA () from /usr/lib/libdrm.so.2
#3 0xb7aa1948 in RADEONCPGetBuffer (pScrn=0x9e575c8) at ../../src/radeon_accel.c:651
#4 0xb7af60fb in RADEONPrepareSolidCP (pPix=0xa508620, alu=3, pm=4294967295, fg=0) at ../../src/radeon_exa_funcs.c:92
#5 0xb78ce96a in exaFillRegionSolid (pDrawable=0xa508620, pRegion=0xa4fef40, pixel=0, planemask=4294967295, alu=<value optimized out>)
at ../../exa/exa_accel.c:939
#6 0xb78d0312 in exaPolyFillRect (pDrawable=0xa508620, pGC=0xa0ba0f8, nrect=1, prect=0xa4865cc) at ../../exa/exa_accel.c:751
#7 0x08180b94 in damagePolyFillRect (pDrawable=0xa508620, pGC=0xa0ba0f8, nRects=1, pRects=0xa4865cc) at ../../../miext/damage/damage.c:1404
#8 0x0808a4f0 in ProcPolyFillRectangle (client=0xa4e4008) at ../../dix/dispatch.c:1769
#9 0x0808d57f in Dispatch () at ../../dix/dispatch.c:437
If I put breakpoints on the three top most stack frames I see ioctl() and drmDMA() being hit constantly but the breakpoint on RADEONCPGetBuffer() is never hit so I don't think that function ever exits.
Breakpoint 1, 0xb7d35e90 in ioctl () from /lib/tls/i686/cmov/libc.so.6
Continuing.
Breakpoint 2, 0xb7b309f5 in drmDMA () from /usr/lib/libdrm.so.2
Continuing.
Breakpoint 1, 0xb7d35e90 in ioctl () from /lib/tls/i686/cmov/libc.so.6
Continuing.
Breakpoint 2, 0xb7b309f5 in drmDMA () from /usr/lib/libdrm.so.2
Continuing.
etc etc
I'm attaching a full gdb showing this trace.
We're getting _a lot_ of duplicate bug reports (and confirms) for this bug in the downstream bug tracker. Around 10 bug reports each containing 1-4 users confirming the problem. Hardware affected (all confirmed using live cd, i.e. no configuration problems on the machines): 01:00.0 VGA compat: ATI Technologies Inc RV516 [Mobility Radeon X1350] 01:00.0 VGA compat [0300]: ATI Technologies Inc RV350 AP [Radeon 9600] [1002:4150] 01:00.0 VGA compat [0300]: ATI Technologies Inc Radeon R350 [Radeon 9800 Pro] [1002:4e48] 01:00.0 VGA compat [0300]: ATI Technologies Inc Radeon Mobility X1400 [1002:7145] 01:00.0 VGA compat [0300]: ATI Technologies Inc M56P [Radeon Mobility X1600] [1002:71c5] Does Option "RenderAccel" "off" or Option "DRI" "off" or any other usual suspect work around the problem? With DRI off the bug does not repro any more. If I remove DRI "off" and use RenderAccel "off" instead then bug comes back. Option "AccelMethod" "XAA" also makes the bug go away (I guess it's just hitting a different execution path). As I mentioned earlier, if you have compiz ON this turns into a ioctl() that blocks indefinitely with 0% CPU activity in xorg. If you have compiz OFF it instead turns info a CPU spin inside Xorg hitting the breakpoints in the order I explained in comment #1. This latter fact means that on modern dual core machines people basically see "50% CPU in use by Xorg" while their system is still sluggish but it can be operated still. Whereas quad core users see "xorg hogs 25% CPU constantly". By looking more carefully into the downstream bug reports I think I have another 10 duplicates of this bug, it's just that quad core users tend to report "a performance problem" rather than a freeze/lockup so I didn't realized that these bugs where proabably duplicates at first. There is still one thing that makes me not want to dup all those bugs against my bug though, and that is the fact that some of these "25% CPU in xorg" bugs show a spinning stacktrace rooted in exaGlyphs() where as I explained above that the CPU spin that I see for _this_ bug is rooted in RADEONPrepareSolidCP(). However, both of these stacks have drmIoctl() and ioctl() and their two topmost stack frames. FWIW, an example of this (a CPU spinning xorg stack rooted in exaGlyphs function) in the downstream tracker is bug 347078: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-ati/+bug/347078 So, I even believe that upstream FDO bug 21683 is a potential duplicate of this bug: https://bugs.freedesktop.org/show_bug.cgi?id=21683 all defaults except EXANoComposite==true.... still freezes all defaults except EXANoDownloadFromScreen==true.... still freezes all defaults except EXANoUploadToScreen==true.... DOES NOT FREEZE Let me know if there is anything else that can help narrow it down. I've downloaded all the files on that website using "wget -m" and then I opened them one by one in Firefox. The offending item is a bitmap with 8K pixels height: http://mundoplus.tv/tpl/v3/panBR.png I put a copy here in case they change their website: http://pages.minimum.se/crashers/ddx_stressers/panBR.png An interesting detail is that the following 10K pixels wide bitmap opens just fine on the same machine: http://pages.minimum.se/crashers/ddx_stressers/Singapore_port_panorama.jpg (this latter 10K wide bitmap actually crashes the intel DDX driver when running in UXA mode but that's another story) otaylor said in #radeon that he was able to repro this bug on Fedora 10 + r500 card. Created attachment 25789 [details] [review] UploadToScreen coordinate paranoia Does this patch help? (In reply to comment #5) > So, I even believe that upstream FDO bug 21683 is a potential duplicate of this > bug: > https://bugs.freedesktop.org/show_bug.cgi?id=21683 Hold your horses... that only talks about slowdowns, not a lockup. Backtraces are generally not useful for diagnosing GPU lockups because they will just more or less randomly show one of the places where the drivers wait for the GPU to catch up (which never happens because it's locked up). Instead one has to focus on what triggers the lockup. (In reply to comment #9) > Created an attachment (id=25789) [details] > UploadToScreen coordinate paranoia > > Does this patch help? <snip> Yes, it does. With these extra coordinate checks the crash no longer occours here. Tested both the website and the individual PNG image in Firefox. I have an ATI X1400 Radeon Mobility [1002:7145] w/128 MB RAM. Patch applied to master radeon branch as of today. For the hell of it, I also tested with a 10x6000 PNG image in Firefox (which should be *within* the limits introduced by this patch, no?), and that works as well, no crash .. Also, no crash for a 10x7000 PNG image. This problem also affects ATI Radeon Mobility X1600/M56P [1002:71c5] with 256MB RAM. (In reply to comment #10) > With these extra coordinate checks the crash no longer occours here. So far, so good. Can you set a gdb breakpoint on the RADEONUploadToScreenCP() line that returns FALSE after these new checks, and attach a backtrace from when it triggers? (In reply to comment #10) > (In reply to comment #9) > > Created an attachment (id=25789) [details] [details] > > UploadToScreen coordinate paranoia > > > > Does this patch help? > <snip> > > Yes, it does. With these extra coordinate checks the crash no longer occours > here. Tested both the website and the individual PNG image in Firefox. I have > an ATI X1400 Radeon Mobility [1002:7145] w/128 MB RAM. Patch applied to master > radeon branch as of today. > > For the hell of it, I also tested with a 10x6000 PNG image in Firefox (which > should be *within* the limits introduced by this patch, no?), and that works as > well, no crash .. Also, no crash for a 10x7000 PNG image. > > This problem also affects ATI Radeon Mobility X1600/M56P [1002:71c5] with 256MB > RAM. > Also tested without the "..(x + w) > 8191" criterion (that is only y+height check), and still no crash. Tested website, plus a 12000x10 png and a 10x12000 png (both PNGs have transparency). And also tested with a 9000x9000 PNG (also w/transparency). I'll see if can get you the debugging-info you have requested .. (In reply to comment #11) > (In reply to comment #10) > > With these extra coordinate checks the crash no longer occours here. > > So far, so good. Can you set a gdb breakpoint on the RADEONUploadToScreenCP() > line that returns FALSE after these new checks, and attach a backtrace from > when it triggers? > I succeeded in attaching gdb to the Xorg process and setting the required breakpoint. I did this from a VT. Then I told gdb to continue execution (with c command), but when switching back to Xorg all I get is a black screen (had to reboot). (In reply to comment #13) > (In reply to comment #11) > > (In reply to comment #10) > > > With these extra coordinate checks the crash no longer occours here. > > > > So far, so good. Can you set a gdb breakpoint on the RADEONUploadToScreenCP() > > line that returns FALSE after these new checks, and attach a backtrace from > > when it triggers? > > > > I succeeded in attaching gdb to the Xorg process and setting the required > breakpoint. I did this from a VT. Then I told gdb to continue execution (with c > command), but when switching back to Xorg all I get is a black screen (had to > reboot). > I'll have a look at this: http://www.x.org/wiki/Development/Documentation/ServerDebugging ... (In reply to comment #14) > (In reply to comment #13) > > (In reply to comment #11) > > > (In reply to comment #10) > > > > With these extra coordinate checks the crash no longer occours here. > > > > > > So far, so good. Can you set a gdb breakpoint on the RADEONUploadToScreenCP() > > > line that returns FALSE after these new checks, and attach a backtrace from > > > when it triggers? > > > > > > > I succeeded in attaching gdb to the Xorg process and setting the required > > breakpoint. I did this from a VT. Then I told gdb to continue execution (with c > > command), but when switching back to Xorg all I get is a black screen (had to > > reboot). > > > > I'll have a look at this: > http://www.x.org/wiki/Development/Documentation/ServerDebugging > ... > Perhaps this will do it for me: handle SIGUSR1 nostop Apparently gdb halts things on VT switch. I don't have the possibility of logging in via ssh where I'm currently at. (In reply to comment #11) > (In reply to comment #10) > > With these extra coordinate checks the crash no longer occours here. > > So far, so good. Can you set a gdb breakpoint on the RADEONUploadToScreenCP() > line that returns FALSE after these new checks, and attach a backtrace from > when it triggers? > This is as far as I got today: (gdb) break radeon_exa_funs.c:276 Breakpoint 1 at 0xb78d0b0e: file ../../src/radeon_exa_funcs.c, line 276. (gdb) continue Continuing. [Switching to Thread 0xb79a36d0 (LWP 2978)] Breakpoint 1, RADEONUploadToScreenCP (pDst=0xa0580008, x=0, y=8190, w=16, h=2, src=0xa3bef70 "<BINARY GARBAGE>", src_pitch=64) at ../../src/radeon_exa_funcs.c:276 276 return FALSE; (gdb) backtrace full #0 RADEONUploadToScreenCP (pDst=0xa0580008, x=0, y=8190, w=16, h=2, src=0xa3bef70 "<BINARY GARBAGE>", src_pitch=64) at ../../src/radeon_exa_funcs.c:276 pScrn = (ScrnInfoPtr) 0x9776e18 info = (RADEONInfoPtr) 0x9777320 bpp = 32 hpass = 3213989268 buf_pitch = 3213989272 dst_pitch_off = 2690121736 __FUNCTION__ = "RADEONUploadToScreenCP" #1 0xb765b255 in ?? () from /usr/lib/xorg/modules//libexa.so No symbol table info available. #2 0x0818258d in ?? () No symbol table info available. #3 0x0808a20e in ProcPutImage () No symbol table info available. #4 0x0808d57f in Dispatch () No symbol table info available. #5 0x080722ed in main () No symbol table info available. The breakpoint was triggered by opening this image: http://mundoplus.tv/tpl/v3/panBR.png (In reply to comment #16) > (In reply to comment #11) > > (In reply to comment #10) > > > With these extra coordinate checks the crash no longer occours here. > > > > So far, so good. Can you set a gdb breakpoint on the RADEONUploadToScreenCP() > > line that returns FALSE after these new checks, and attach a backtrace from > > when it triggers? > > > > This is as far as I got today: <snip> > The breakpoint was triggered by opening this image: > http://mundoplus.tv/tpl/v3/panBR.png > Oh, and I compiled the radeon-driver with no gcc optimization (-O0). I did some more testing with sizes because of the of number 8192 (=2^13), which is the height of the image that triggers this. Here are the results: 16x8191 PNG image: No crash. [http://folk.uio.no/oyvinst/fdsbug21598/16x8191.png] 16x8192 PNG image: FREEZE [http://folk.uio.no/oyvinst/fdsbug21598/16x8192.png] 16x8193 PNG image: No crash. [http://folk.uio.no/oyvinst/fdsbug21598/16x8193.png] It looks like this bug has something to do with height being exactly 8192. I created various PNG test-images of different dimensions, they can be found here: http://folk.uio.no/oyvinst/fdsbug21598/ Of all these images, *only* the 16x8192 image triggers the crash/freeze. All images created in Gimp with transparency (don't know if that's really necessary..). Hope this might help somewhat in tracking this down. I modified my -ati driver according to the patch in comment #9 and then I put a breakpoint on the "return FALSE" just below it. This coordinate hack patch indeed makes the bug go away! And when I surf to mundoplus.tv I do hit the proposed breakpoint and here is the "bt full" from that breakpoint. I compiled with DEB_BUILD_OPTIONS="noopt nostrip" which I think means essentially "-O0 -g3" or something like that. Program received signal SIGINT, Interrupt. 0xb7f6a430 in __kernel_vsyscall () (gdb) break radeon_exa_funcs.c:277 Breakpoint 2 at 0xb796c956: file ../../src/radeon_exa_funcs.c, line 277. (gdb) info breakpoints Num Type Disp Enb Address What 2 breakpoint keep y 0xb796c956 in RADEONUploadToScreenCP at ../../src/radeon_exa_funcs.c:277 (gdb) c Continuing. Breakpoint 2, RADEONUploadToScreenCP (pDst=0xa2c94008, x=0, y=8190, w=16, h=2, src=0x8a97228 "��������������������������������������������������������������������������������������������������������������������������������7", src_pitch=64) at ../../src/radeon_exa_funcs.c:277 277 return FALSE; (gdb) bt full #0 RADEONUploadToScreenCP (pDst=0xa2c94008, x=0, y=8190, w=16, h=2, src=0x8a97228 "��������������������������������������������������������������������������������������������������������������������������������7", src_pitch=64) at ../../src/radeon_exa_funcs.c:277 pScrn = (ScrnInfoPtr) 0x85fd5e8 info = (RADEONInfoPtr) 0x85fbea8 bpp = 32 hpass = 2731098120 buf_pitch = 3077628745 dst_pitch_off = 3213388776 __func__ = "RADEONUploadToScreenCP" #1 0xb770f344 in exaPutImage (pDrawable=0xa2c94008, pGC=0x8a71738, depth=24, x=0, y=8190, w=16, h=2, leftPad=0, format=2, bits=0x8a97228 "��������������������������������������������������������������������������������������������������������������������������������7") at ../../exa/exa_accel.c:211 No locals. #2 0x08182a10 in damagePutImage (pDrawable=0xa2c94008, pGC=0x8a71738, depth=24, x=0, y=8190, w=16, h=2, leftPad=0, format=2, pImage=0x8a97228 "��������������������������������������������������������������������������������������������������������������������������������7") at ../../../miext/damage/damage.c:905 pGCPriv = (DamageGCPrivPtr) 0x8a054b0 oldFuncs = (GCFuncs *) 0x8213a80 #3 0x0808a301 in ProcPutImage (client=0x8980830) at ../../dix/dispatch.c:1897 pGC = (GC *) 0x8a71738 pDraw = (DrawablePtr) 0xa2c94008 length = <value optimized out> #4 0x0808cff7 in Dispatch () at ../../dix/dispatch.c:437 result = <value optimized out> client = (ClientPtr) 0x8980830 nready = 0 start_tick = 660 #5 0x080722fd in main (argc=10, argv=0xbf886f24, envp=0xbf886f50) at ../../dix/main.c:397 i = <value optimized out> alwaysCheckForInput = {0, 1} (gdb) Actually when I run the repro with the patch from comment #9 it hits the subsequent "return FALSE" twice. The second time the "bt full" is: (gdb) bt full #0 RADEONUploadToScreenCP (pDst=0xa2896008, x=0, y=8190, w=16, h=2, src=0x8a97228 "��������������������������������������������������������������������������������������������������������������������������������7\t\005", src_pitch=64) at ../../src/radeon_exa_funcs.c:277 pScrn = (ScrnInfoPtr) 0x85fd5e8 info = (RADEONInfoPtr) 0x85fbea8 bpp = 32 hpass = 2726912008 buf_pitch = 3077628745 dst_pitch_off = 3213388776 __func__ = "RADEONUploadToScreenCP" #1 0xb770f344 in exaPutImage (pDrawable=0xa2896008, pGC=0x899cc58, depth=24, x=0, y=8190, w=16, h=2, leftPad=0, format=2, bits=0x8a97228 "��������������������������������������������������������������������������������������������������������������������������������7\t\005") at ../../exa/exa_accel.c:211 No locals. #2 0x08182a10 in damagePutImage (pDrawable=0xa2896008, pGC=0x899cc58, depth=24, x=0, y=8190, w=16, h=2, leftPad=0, format=2, pImage=0x8a97228 "��������������������������������������������������������������������������������������������������������������������������������7\t\005") at ../../../miext/damage/damage.c:905 pGCPriv = (DamageGCPrivPtr) 0x89aa428 oldFuncs = (GCFuncs *) 0x8213a80 #3 0x0808a301 in ProcPutImage (client=0x8980830) at ../../dix/dispatch.c:1897 pGC = (GC *) 0x899cc58 pDraw = (DrawablePtr) 0xa2896008 length = <value optimized out> #4 0x0808cff7 in Dispatch () at ../../dix/dispatch.c:437 result = <value optimized out> client = (ClientPtr) 0x8980830 nready = 0 start_tick = 680 #5 0x080722fd in main (argc=10, argv=0xbf886f24, envp=0xbf886f50) at ../../dix/main.c:397 i = <value optimized out> alwaysCheckForInput = {0, 1} (gdb) c I can confirm that 16x8191.png and 16x8193.png load just fine on unpatched jaunty versions whereas 16x8192.png triggers the bug. I also grepped in the -ati driver for the number 8192 and that turns up several very interesting source lines that uses this number as maximum texture sizes and vport_scissor for example. Maybe the "viewport cropping" has a off by one error? With <width>x8193, does the breakpoint get hit? What about 8191x<height> vs. 8192x<height> vs. 8193x<height>? P.S. Please try to keep comments tidy to avoid cluttering up reports. Using the -ati driver modified with the coord paranoia patch as per comment #9: For 16x8192 the breakpoint gets hit. For 16x8191 and 16x8193 the breakpoint does not get hit. Using the same patched -ati driver and breakpoint set I also tried (after rotating oyvind's PNGs in GIMP, i'm unsure about what this transformation did to transparency if that matters): 8191x16 (no crash, no breakpoint hit) 8192x16 (no crash, no breakpoint hit) 8193x16 (no crash, no breakpoint hit) I also reverted to the unpatched jaunty driver and just as I suspected 8191x16, 8192x16 and 8193x16 loads just fine there as well. If needed, you can find my rotated PNGs here: http://pages.minimum.se/crashers/ddx_stressers/ati_bug_21598/ Created attachment 25819 [details] [review] EXA coordinate limit fixups This is probably the real fix. Not sure about the R600 changes, but they make everything consistent with the stricter checks in R600CheckComposite(). (In reply to comment #23) > Using the same patched -ati driver and breakpoint set I also tried (after > rotating oyvind's PNGs in GIMP, i'm unsure about what this transformation did > to transparency if that matters): > 8191x16 (no crash, no breakpoint hit) > 8192x16 (no crash, no breakpoint hit) > 8193x16 (no crash, no breakpoint hit) I don't think the transparency matters, but I realized in the meantime that it hits the pitch limit at much lower width. I can confirm that the patch from comment #24 makes it possible to open all PNGs mentioned in this bug report and no lockups or issues was seen (nothing barfed in xorg.log or dmesg). Is this suitable for cherry picking onto the 6.12.2 we currently have in Ubuntu? The patch clearly applies cleanly and works nicely from what I can tell with my limited testing so far at least. Thanks a ton MrCooper. Fix pushed to the Git master and 6.12 branches, thanks for all the testing. I haven't pushed the R600 changes as they don't seem to be necessary. Maybe the stricter check in R600CheckComposite() should be relaxed instead. Hello, I'm still having the firefox-lockup with the mentionned png-image and my xorg was apparently released in August. Maybe this is normal, my not, but I just wanted to share it. My xorg-info: X.Org X Server 1.6.3.901 (1.6.4 RC 1) Release Date: 2009-8-25 X Protocol Version 11, Revision 0 Build Operating System: Linux 2.6.30-ARCH x86_64 Current Operating System: Linux natuur 2.6.30-ARCH #1 SMP PREEMPT Wed Sep 9 14:16:44 CEST 2009 x86_64 Build Date: 04 September 2009 05:45:43PM (--) PCI:*(0:1:0:0) 1002:5b63:17ee:0373 ATI Technologies Inc RV370 [Sapphire X550 Silent] rev 0, Mem @ 0xd8000000/134217728, 0xfe9f0000/65536, I/O @ 0x0000c 000/256, BIOS @ 0x????????/131072 (--) PCI: (0:1:0:1) 1002:5b73:17ee:0372 ATI Technologies Inc RV370 secondary [Sapphire X550 Silent] rev 0, Mem @ 0xfe9e0000/65536 Kind regards, Michel My ati-module-version: II) Module ati: vendor="X.Org Foundation" compiled for 1.6.1, module version = 6.12.2 Module class: X.Org Video Driver ABI class: X.Org Video Driver, version 5.0 It seems to be 6.12 like mentionned above. Kind regards, Michel (In reply to comment #28) > My ati-module-version: > > II) Module ati: vendor="X.Org Foundation" > compiled for 1.6.1, module version = 6.12.2 The fix is only in 6.12.3. Hello, thank you. I'll upgrade then :). Kind regards, Michel |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.