7916 – aiglx+{compiz|metacity&compositing}+switch to VT = freeze

Bug 7916 - aiglx+{compiz|metacity&compositing}+switch to VT = freeze

Summary: aiglx+{compiz|metacity&compositing}+switch to VT = freeze

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Server/General (show other bugs)
Version:	git
Hardware:	x86 (IA32) All

Importance:	high major
Assignee:	Xorg Project Team
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-08-18 10:16 UTC by Timothée Lecomte
Modified:	2006-10-29 18:10 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
gdb log file, (mostly useless as there is no backtrace) (4.58 KB, text/plain) 2006-08-18 10:18 UTC, Timothée Lecomte	no flags	Details
Xorg log file (41.64 KB, text/plain) 2006-08-18 10:19 UTC, Timothée Lecomte	no flags	Details
Xorg config file (6.25 KB, text/plain) 2006-08-18 10:20 UTC, Timothée Lecomte	no flags	Details
'ps aux' output at freeze time (4.99 KB, text/plain) 2006-08-19 00:17 UTC, Timothée Lecomte	no flags	Details
gdb backtrace when attaching to the frozen X server (5.75 KB, text/plain) 2006-08-19 13:47 UTC, Timothée Lecomte	no flags	Details
Xorg.0.log at freeze time (42.27 KB, text/plain) 2006-08-19 13:48 UTC, Timothée Lecomte	no flags	Details
full gdb backtrace when attaching to the frozen X server (7.52 KB, text/plain) 2006-08-19 23:44 UTC, Timothée Lecomte	no flags	Details
Happy VT Switch (6.42 KB, patch) 2006-08-24 21:09 UTC, Kristian Høgsberg	no flags	Details \| Splinter Review
full gdb backtrace when attaching to frozen patched X server (5.83 KB, text/plain) 2006-08-24 23:19 UTC, Timothée Lecomte	no flags	Details
Updated patch (6.94 KB, patch) 2006-08-25 00:12 UTC, Kristian Høgsberg	no flags	Details \| Splinter Review
Show Obsolete (3) View All

Description Timothée Lecomte 2006-08-18 10:16:04 UTC

Hi,

I am experiencing a problem with Xorg, aiglx and a GL compositing manager,
namely compiz or metacity.

To reproduce:
-Enable aiglx
-'startx'
-'killall twm'
-'compiz --replace --indirect-rendering --strict-binding gconf&'
OR metacity& with the compositor enabled in gconf
-hit ctrl-alt-F1
-hit ctrl-alt-F7
=> the machine stops to respond, the vt 1 is still displayed on the screen, the
keyboard leds do not work anymore. The only thing to do is a hard reboot.

I am using :
-xserver retrieved from anonymous git the 08-12-2006
(patched with
http://people.freedesktop.org/~krh/compiz-on-aiglx/xorg-x11-server-1.1.0-gl-include-inferiors.patch
to make compiz work with it)
-mesa retrieved from cvs the same day
-ati 6.6.1 (opensource, not binary) for my Radeon Mobility 7500
-compiz from quinn's cvs patched with what is available on the above page to
make it work with aiglx
-metacity from cvs

I tried debugging it in gdb, using the script proposed by the wiki page
'DebuggingTheXserver' on freesedktop.org. I had to add '-kb' to the server args.
Unfortunately, there does not seem to be any segfault, so gdb does not give any
backtrace. The server log read does not contain any error message either.

I am attaching my xorg.conf, my Xorg.0.log whose end corresponds to the freeze,
and the gdb_log.$PID as produced by the debugging script, although it does not
contain a backtrace.

Comment 1 Timothée Lecomte 2006-08-18 10:18:42 UTC

Created attachment 6598 [details]
gdb log file, (mostly useless as there is no backtrace)

Comment 2 Timothée Lecomte 2006-08-18 10:19:30 UTC

Created attachment 6599 [details]
Xorg log file

Comment 3 Timothée Lecomte 2006-08-18 10:20:04 UTC

Created attachment 6600 [details]
Xorg config file

Comment 4 Timothée Lecomte 2006-08-19 00:14:34 UTC

Ok, I managed to have a backtrace : I noticed that although the keyboard does
not respond, the acpi buttons do ! So I programmed my power button to execute a
little script that attaches gdb to X and ask for a backtrace.

Attached is the full output of gdb, and here is the last calls of the backtrace :

#1  0xb7df5e79 in ioctl () from /lib/libc.so.6
#2  0xb7f96dd5 in drmGetLock (fd=9, context=3, flags=3216313812)
    at xf86drm.c:1221
#3  0xb2ec4db5 in radeonGetLock (rmesa=0x83e6290, flags=3216313812)
    at radeon_lock.c:78
#4  0xb2ecec41 in radeonUploadTexImages (rmesa=0x83e6290, t=0x85434b8, face=0)
    at radeon_texmem.c:353

Seems this is a race condition. Now that I know how to debug it, feel free to
ask me for whatever info you need.

P.S. : maybe we can put the trick about the acpi button on the wiki ? It seems
to me a nice way to debug a freeze like this one when you only have one machine.

Comment 5 Timothée Lecomte 2006-08-19 00:17:09 UTC

Created attachment 6602 [details]
'ps aux' output at freeze time

Comment 6 Michel Dänzer 2006-08-19 07:52:37 UTC

It'll be interesting to see the full backtrace (you seem to have attached the
output of ps instead; could you also attach a log file from when this
happened?), but I'd guess that this is a deadlock between the 2D and 3D drivers.
The 2D driver holds the DRI lock between LeaveVT and EnterVT to prevent 3D
clients from touching the hardware while switched away, but AIGLX relies on
calling DRIUnlock() to enable the 3D driver to take the lock within the X server
process. Apparently, AIGLX calls into the 3D driver before the 2D driver
releases the lock in EnterVT. Hopefully this can be solved with some reordering.

Comment 7 Timothée Lecomte 2006-08-19 13:46:38 UTC

You're right, I attached the wrong file. I am now attaching the right backtrace
file, and the log file retrieved exactly at the same moment.

Comment 8 Timothée Lecomte 2006-08-19 13:47:51 UTC

Created attachment 6613 [details]
gdb backtrace when attaching to the frozen X server

Comment 9 Timothée Lecomte 2006-08-19 13:48:52 UTC

Created attachment 6614 [details]
Xorg.0.log at freeze time

Comment 10 Timothée Lecomte 2006-08-19 23:43:54 UTC

I realized that X is waiting for the dead lock even before I hit ctrl-alt-F7 to
come back to it. The dispatcher must be processing some pending call coming from
aiglx and requiring the dri lock after LeaveVT has been called.

I am attaching a full backtrace obtained just after going to tty1.

Comment 11 Timothée Lecomte 2006-08-19 23:44:39 UTC

Created attachment 6618 [details]
full gdb backtrace when attaching to the frozen X server

Comment 12 Michel Dänzer 2006-08-20 07:20:25 UTC

Thanks. The problem seems to be that AIGLX calls into the 3D driver while the X
server is switched away.

As I suspect we can't switch existing AIGLX contexts from hardware acceleration
to software rendering when switching VTs, the only solution I can think of is to
freeze AIGLX clients using ClientSleep() when switching away and reviving them
when switching back, e.g. in an EnableDisableFBAccess wrapper. Kristian, do you
agree with this analysis?

Comment 13 Michel Dänzer 2006-08-21 09:15:51 UTC

Another idea: Instead of doing the current 'unlock server context - lock GLX
context - unlock GLX context - lock server context' dance, we could add an
interface to the DRI drivers to tell them 'the lock will be held whenever you
are called, so you don't ever need to grab it'. That should also avoid clip list
races with glucose that Adam Jackson pointed out on IRC.

Comment 14 Kristian Høgsberg 2006-08-24 21:09:53 UTC

Created attachment 6679 [details] [review]
Happy VT Switch

Here's a patch that fixes the problem for me.  What we do here is to block all
GLX clients when switching away from the X server VT and resume them when
switching back.  Also, when switched away, new GLX clients are immediately
blocked.  One thing that could be improved with this patch is to only resume
those clients that we put to sleep ourselves, but that should be a stylistic
detail.

Comment 15 Timothée Lecomte 2006-08-24 22:12:03 UTC

I just tested the patch and it fixes the problem. Thanks Kristian.

Comment 16 Timothée Lecomte 2006-08-24 23:19:18 UTC

Created attachment 6680 [details]
full gdb backtrace when attaching to frozen patched X server

I spoke a little too quicky. Although it does not freeze on every VT switch, I
was able to freeze it as before. The attached backtrace is slightly different
though.

Comment 17 Kristian Høgsberg 2006-08-24 23:45:35 UTC

(In reply to comment #16)
> Created an attachment (id=6680) [edit]
> full gdb backtrace when attaching to frozen patched X server
> 
> I spoke a little too quicky. Although it does not freeze on every VT switch, I
> was able to freeze it as before. The attached backtrace is slightly different
> though.

Yeah, that's the one case I'm not handling - when we get a callback from the resource manager to clean 
up after a client has gone, we end up calling into the DRI driver even though we've blocked all clients.

The fix here is to queue up contexts for destruction when were called to destroy one while switched off 
the vt.  Once we switch back, we can loop through the list and clean up properly.  Will attach updated 
patch tomorrow.

Comment 18 Kristian Høgsberg 2006-08-25 00:12:30 UTC

Created attachment 6681 [details] [review]
Updated patch

Here's an updated version that implements the idea described above, it should
fix the lockup you see.

Depending on the order XF86 and DRI registers their block and release handlers,
the DRI lock may or may not be taken when the VT switch callback is invoked. 
If you see lockups, try removing the __glXleaveServer() __glXenterServer calls
from this part of the code (in glxResumeClients):

+    __glXleaveServer();
+    for (cx = glxPendingDestroyContexts; cx != NULL; cx = next) {
+	next = cx->next;
+
+	cx->destroy(cx);
+    }
+    glxPendingDestroyContexts = NULL;
+    __glXenterServer();

Comment 19 Kristian Høgsberg 2006-08-25 00:13:55 UTC

(In reply to comment #18)
> Created an attachment (id=6681) [edit]
> Updated patch

I should say that I haven't tested this patch at all...

Comment 20 Michel Dänzer 2006-08-25 02:19:56 UTC

Kristian, as I alluded to on IRC, I'm not sure suspending the clients is such a
good idea after all. E.g., when the server is switched away for an extended
period of time, won't the GLX client display connections time out? (OTOH, direct
rendering clients will block on the hardware lock as well, but they can at least
theoretically keep the display connection alive in a different thread).

Do you think the other approach outlined in comment #13 is not feasible? Seems
to me like it could be simpler to implement and would provide benefits beyond
fixing this problem. There's at least one minor drawback though: the DRI drivers
will usually sleep when they find empty cliprects. Probably not a big issue
though when the server is switched away.

Then again, this approach could serve as a second line of defence with older DRI
drivers that don't support the other approach in any case.

Comment 21 Kristian Høgsberg 2006-08-28 16:08:44 UTC

(In reply to comment #20)
> Kristian, as I alluded to on IRC, I'm not sure suspending the clients is such a
> good idea after all. E.g., when the server is switched away for an extended
> period of time, won't the GLX client display connections time out? (OTOH, direct
> rendering clients will block on the hardware lock as well, but they can at least
> theoretically keep the display connection alive in a different thread).

I don't think this is a problem - I left my workstation on overnight with compiz
running on a switched away X server.  When I got back next day and switch back
to the X server everything was still running.  Blocking on the X connection and
blocking on the DRI lock is two different things but they look similar to the
application (some glx call end up blocking).  The only thing that would cause
apps to timeout in this case would be if Xlib has some kind of built-in time out
when the display doesn't reply, but I don't know that that exists.

> Do you think the other approach outlined in comment #13 is not feasible? Seems
> to me like it could be simpler to implement and would provide benefits beyond
> fixing this problem. There's at least one minor drawback though: the DRI drivers
> will usually sleep when they find empty cliprects. Probably not a big issue
> though when the server is switched away.
> 
> Then again, this approach could serve as a second line of defence with older DRI
> drivers that don't support the other approach in any case.

I think the current patch is pretty much what we want.  It's not only a case of
avoiding the deadlock, it's also about preventing anyting from touching the hw
while we're switched away.  Since we can't switch to sw-rendering (like core X
drawing does) blocking the clients seems like the best choice.

Comment 22 Michel Dänzer 2006-08-28 17:33:16 UTC

(In reply to comment #21)
> The only thing that would cause apps to timeout in this case would be if Xlib
> has some kind of built-in time out when the display doesn't reply, but I don't
> know that that exists.

Won't TCP connections time out, e.g.? Good to know this isn't an issue for local
connections though.


> I think the current patch is pretty much what we want.  

I agree it's a good solution for the deadlock for now, we can always experiment
with the other approach later on.

> It's not only a case of avoiding the deadlock, it's also about preventing
> anyting from touching the hw while we're switched away.  Since we can't switch
> to sw-rendering (like core X drawing does) blocking the clients seems like the
> best choice.

The DRI drivers should detect the empty cliprects and not touch the hardware in
that case. All we need is a mechanism that allows doing this while the server
holds the HW lock.

Comment 23 Wesley Stessens 2006-09-03 11:02:07 UTC

Hi, guys. Is Kristian's temporary patch included in git?
If yes; Where can I downloaded the latest files I need to get this to work?
If not; What source do I have to download to apply the patch too?

Thanks! And good work, Kristian! :) This problem was giving me a big headache.

Comment 24 Kristian Høgsberg 2006-09-03 13:43:37 UTC

(In reply to comment #23)
> Hi, guys. Is Kristian's temporary patch included in git?
> If yes; Where can I downloaded the latest files I need to get this to work?
> If not; What source do I have to download to apply the patch too?

It's not in git yet - the patch applies to the latest git xserver tree, or at least the tree from around when 
I posted the patch.  I haven't committed it yet, because I was wondering if we should try Michels 
approach where we zero out the clip rects instead of blocking the clients. 
 
> Thanks! And good work, Kristian! :) This problem was giving me a big headache.

Good to hear it works for you.

Comment 25 Wesley Stessens 2006-09-03 16:54:16 UTC

(In reply to comment #24)
> It's not in git yet - the patch applies to the latest git xserver tree, or at
least the tree from around when 
> I posted the patch.  I haven't committed it yet, because I was wondering if we
should try Michels 
> approach where we zero out the clip rects instead of blocking the clients. 

I see. I know nothing about the technical part of this patch, so I'll just trust
you guys to work out the best solution :)

> Good to hear it works for you.

Well, I haven't tried it yet --
but I will try it tomorrow to see if it works for me
Either way, I appreciate the work you put into it.

Comment 26 Michel Dänzer 2006-09-04 02:57:57 UTC

(In reply to comment #24)
> I haven't committed it yet, because I was wondering if we should try Michels 
> approach 

As far as I'm concerned, it would be a good idea to commit your patch in any case.

> where we zero out the clip rects instead of blocking the clients. 

Note that this isn't the point of my idea; the X server already clears all
cliprects when switching away. The idea would be to make it possible for the X
server to call into the 3D driver with the hardware lock held. In the special
case of the server being switched away, the 3D driver should then detect the
empty cliprects and 'do nothing', but this approach would have more general
implications on AIGLX performance and/or correctness, in particular with
something like glucose. It might be better to discuss this on a mailing list.

Comment 27 Kristian Høgsberg 2006-09-07 12:39:15 UTC

OK, committed the latest patch, closing this bug.

Comment 28 Øivind Lund 2006-10-29 18:10:51 UTC

i have to ask one incredibly stupid question here, sorry .. but how do i install
the patch ?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.