Hi, I am experiencing a problem with Xorg, aiglx and a GL compositing manager, namely compiz or metacity. To reproduce: -Enable aiglx -'startx' -'killall twm' -'compiz --replace --indirect-rendering --strict-binding gconf&' OR metacity& with the compositor enabled in gconf -hit ctrl-alt-F1 -hit ctrl-alt-F7 => the machine stops to respond, the vt 1 is still displayed on the screen, the keyboard leds do not work anymore. The only thing to do is a hard reboot. I am using : -xserver retrieved from anonymous git the 08-12-2006 (patched with http://people.freedesktop.org/~krh/compiz-on-aiglx/xorg-x11-server-1.1.0-gl-include-inferiors.patch to make compiz work with it) -mesa retrieved from cvs the same day -ati 6.6.1 (opensource, not binary) for my Radeon Mobility 7500 -compiz from quinn's cvs patched with what is available on the above page to make it work with aiglx -metacity from cvs I tried debugging it in gdb, using the script proposed by the wiki page 'DebuggingTheXserver' on freesedktop.org. I had to add '-kb' to the server args. Unfortunately, there does not seem to be any segfault, so gdb does not give any backtrace. The server log read does not contain any error message either. I am attaching my xorg.conf, my Xorg.0.log whose end corresponds to the freeze, and the gdb_log.$PID as produced by the debugging script, although it does not contain a backtrace.
Created attachment 6598 [details] gdb log file, (mostly useless as there is no backtrace)
Created attachment 6599 [details] Xorg log file
Created attachment 6600 [details] Xorg config file
Ok, I managed to have a backtrace : I noticed that although the keyboard does not respond, the acpi buttons do ! So I programmed my power button to execute a little script that attaches gdb to X and ask for a backtrace. Attached is the full output of gdb, and here is the last calls of the backtrace : #1 0xb7df5e79 in ioctl () from /lib/libc.so.6 #2 0xb7f96dd5 in drmGetLock (fd=9, context=3, flags=3216313812) at xf86drm.c:1221 #3 0xb2ec4db5 in radeonGetLock (rmesa=0x83e6290, flags=3216313812) at radeon_lock.c:78 #4 0xb2ecec41 in radeonUploadTexImages (rmesa=0x83e6290, t=0x85434b8, face=0) at radeon_texmem.c:353 Seems this is a race condition. Now that I know how to debug it, feel free to ask me for whatever info you need. P.S. : maybe we can put the trick about the acpi button on the wiki ? It seems to me a nice way to debug a freeze like this one when you only have one machine.
Created attachment 6602 [details] 'ps aux' output at freeze time
It'll be interesting to see the full backtrace (you seem to have attached the output of ps instead; could you also attach a log file from when this happened?), but I'd guess that this is a deadlock between the 2D and 3D drivers. The 2D driver holds the DRI lock between LeaveVT and EnterVT to prevent 3D clients from touching the hardware while switched away, but AIGLX relies on calling DRIUnlock() to enable the 3D driver to take the lock within the X server process. Apparently, AIGLX calls into the 3D driver before the 2D driver releases the lock in EnterVT. Hopefully this can be solved with some reordering.
You're right, I attached the wrong file. I am now attaching the right backtrace file, and the log file retrieved exactly at the same moment.
Created attachment 6613 [details] gdb backtrace when attaching to the frozen X server
Created attachment 6614 [details] Xorg.0.log at freeze time
I realized that X is waiting for the dead lock even before I hit ctrl-alt-F7 to come back to it. The dispatcher must be processing some pending call coming from aiglx and requiring the dri lock after LeaveVT has been called. I am attaching a full backtrace obtained just after going to tty1.
Created attachment 6618 [details] full gdb backtrace when attaching to the frozen X server
Thanks. The problem seems to be that AIGLX calls into the 3D driver while the X server is switched away. As I suspect we can't switch existing AIGLX contexts from hardware acceleration to software rendering when switching VTs, the only solution I can think of is to freeze AIGLX clients using ClientSleep() when switching away and reviving them when switching back, e.g. in an EnableDisableFBAccess wrapper. Kristian, do you agree with this analysis?
Another idea: Instead of doing the current 'unlock server context - lock GLX context - unlock GLX context - lock server context' dance, we could add an interface to the DRI drivers to tell them 'the lock will be held whenever you are called, so you don't ever need to grab it'. That should also avoid clip list races with glucose that Adam Jackson pointed out on IRC.
Created attachment 6679 [details] [review] Happy VT Switch Here's a patch that fixes the problem for me. What we do here is to block all GLX clients when switching away from the X server VT and resume them when switching back. Also, when switched away, new GLX clients are immediately blocked. One thing that could be improved with this patch is to only resume those clients that we put to sleep ourselves, but that should be a stylistic detail.
I just tested the patch and it fixes the problem. Thanks Kristian.
Created attachment 6680 [details] full gdb backtrace when attaching to frozen patched X server I spoke a little too quicky. Although it does not freeze on every VT switch, I was able to freeze it as before. The attached backtrace is slightly different though.
(In reply to comment #16) > Created an attachment (id=6680) [edit] > full gdb backtrace when attaching to frozen patched X server > > I spoke a little too quicky. Although it does not freeze on every VT switch, I > was able to freeze it as before. The attached backtrace is slightly different > though. Yeah, that's the one case I'm not handling - when we get a callback from the resource manager to clean up after a client has gone, we end up calling into the DRI driver even though we've blocked all clients. The fix here is to queue up contexts for destruction when were called to destroy one while switched off the vt. Once we switch back, we can loop through the list and clean up properly. Will attach updated patch tomorrow.
Created attachment 6681 [details] [review] Updated patch Here's an updated version that implements the idea described above, it should fix the lockup you see. Depending on the order XF86 and DRI registers their block and release handlers, the DRI lock may or may not be taken when the VT switch callback is invoked. If you see lockups, try removing the __glXleaveServer() __glXenterServer calls from this part of the code (in glxResumeClients): + __glXleaveServer(); + for (cx = glxPendingDestroyContexts; cx != NULL; cx = next) { + next = cx->next; + + cx->destroy(cx); + } + glxPendingDestroyContexts = NULL; + __glXenterServer();
(In reply to comment #18) > Created an attachment (id=6681) [edit] > Updated patch I should say that I haven't tested this patch at all...
Kristian, as I alluded to on IRC, I'm not sure suspending the clients is such a good idea after all. E.g., when the server is switched away for an extended period of time, won't the GLX client display connections time out? (OTOH, direct rendering clients will block on the hardware lock as well, but they can at least theoretically keep the display connection alive in a different thread). Do you think the other approach outlined in comment #13 is not feasible? Seems to me like it could be simpler to implement and would provide benefits beyond fixing this problem. There's at least one minor drawback though: the DRI drivers will usually sleep when they find empty cliprects. Probably not a big issue though when the server is switched away. Then again, this approach could serve as a second line of defence with older DRI drivers that don't support the other approach in any case.
(In reply to comment #20) > Kristian, as I alluded to on IRC, I'm not sure suspending the clients is such a > good idea after all. E.g., when the server is switched away for an extended > period of time, won't the GLX client display connections time out? (OTOH, direct > rendering clients will block on the hardware lock as well, but they can at least > theoretically keep the display connection alive in a different thread). I don't think this is a problem - I left my workstation on overnight with compiz running on a switched away X server. When I got back next day and switch back to the X server everything was still running. Blocking on the X connection and blocking on the DRI lock is two different things but they look similar to the application (some glx call end up blocking). The only thing that would cause apps to timeout in this case would be if Xlib has some kind of built-in time out when the display doesn't reply, but I don't know that that exists. > Do you think the other approach outlined in comment #13 is not feasible? Seems > to me like it could be simpler to implement and would provide benefits beyond > fixing this problem. There's at least one minor drawback though: the DRI drivers > will usually sleep when they find empty cliprects. Probably not a big issue > though when the server is switched away. > > Then again, this approach could serve as a second line of defence with older DRI > drivers that don't support the other approach in any case. I think the current patch is pretty much what we want. It's not only a case of avoiding the deadlock, it's also about preventing anyting from touching the hw while we're switched away. Since we can't switch to sw-rendering (like core X drawing does) blocking the clients seems like the best choice.
(In reply to comment #21) > The only thing that would cause apps to timeout in this case would be if Xlib > has some kind of built-in time out when the display doesn't reply, but I don't > know that that exists. Won't TCP connections time out, e.g.? Good to know this isn't an issue for local connections though. > I think the current patch is pretty much what we want. I agree it's a good solution for the deadlock for now, we can always experiment with the other approach later on. > It's not only a case of avoiding the deadlock, it's also about preventing > anyting from touching the hw while we're switched away. Since we can't switch > to sw-rendering (like core X drawing does) blocking the clients seems like the > best choice. The DRI drivers should detect the empty cliprects and not touch the hardware in that case. All we need is a mechanism that allows doing this while the server holds the HW lock.
Hi, guys. Is Kristian's temporary patch included in git? If yes; Where can I downloaded the latest files I need to get this to work? If not; What source do I have to download to apply the patch too? Thanks! And good work, Kristian! :) This problem was giving me a big headache.
(In reply to comment #23) > Hi, guys. Is Kristian's temporary patch included in git? > If yes; Where can I downloaded the latest files I need to get this to work? > If not; What source do I have to download to apply the patch too? It's not in git yet - the patch applies to the latest git xserver tree, or at least the tree from around when I posted the patch. I haven't committed it yet, because I was wondering if we should try Michels approach where we zero out the clip rects instead of blocking the clients. > Thanks! And good work, Kristian! :) This problem was giving me a big headache. Good to hear it works for you.
(In reply to comment #24) > It's not in git yet - the patch applies to the latest git xserver tree, or at least the tree from around when > I posted the patch. I haven't committed it yet, because I was wondering if we should try Michels > approach where we zero out the clip rects instead of blocking the clients. I see. I know nothing about the technical part of this patch, so I'll just trust you guys to work out the best solution :) > Good to hear it works for you. Well, I haven't tried it yet -- but I will try it tomorrow to see if it works for me Either way, I appreciate the work you put into it.
(In reply to comment #24) > I haven't committed it yet, because I was wondering if we should try Michels > approach As far as I'm concerned, it would be a good idea to commit your patch in any case. > where we zero out the clip rects instead of blocking the clients. Note that this isn't the point of my idea; the X server already clears all cliprects when switching away. The idea would be to make it possible for the X server to call into the 3D driver with the hardware lock held. In the special case of the server being switched away, the 3D driver should then detect the empty cliprects and 'do nothing', but this approach would have more general implications on AIGLX performance and/or correctness, in particular with something like glucose. It might be better to discuss this on a mailing list.
OK, committed the latest patch, closing this bug.
i have to ask one incredibly stupid question here, sorry .. but how do i install the patch ?
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.