Bug 15884 - Multiple serious flaws in abstract socket support introduced in xtrans 1.1+
Summary: Multiple serious flaws in abstract socket support introduced in xtrans 1.1+
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Lib/xtrans (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: high major
Assignee: Xorg Project Team
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: xorg-7.4
  Show dependency treegraph
 
Reported: 2008-05-08 23:38 UTC by David Sainty
Modified: 2008-05-13 07:30 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Description David Sainty 2008-05-08 23:38:30 UTC
On my Linux EeePC, building with xtrans 1.1 or 1.2 breaks every single X client.  The clients can't connect to the server at all.

The new abstract socket support has two serious problems.

Problem number one: My kernel supports abstract socket namespaces but the standard X server does not.  The connect returns ECONNREFUSED, which causes the Unix domain connect attempt to fail fatally.  It should fall back to Unix domain (like the Changelog claims), but it only does that for ENOENT (see Xtranssock.c).

So ok, lets try adding ECONNREFUSED to the same case as ENOENT in SocketUNIXConnect() and see where that gets us.

Problem number two: If you do that you discover that the fall-back to Unix domain socket needs attention too.  It uses TRANS_TRY_CONNECT_AGAIN to trigger the reconnect.  If libX11 sees that it reasonably applies a sleep(1) between connects (better than spinning).  But in OUR case we don't want to wait a second!  Now every single X client takes an extra second to start up.

So any poor individual using clients using xtrans 1.1 where the fallback actually WORKS will still have to put up with all their X clients now taking exactly one second longer to run.

(It's a little awkward to fix this properly in xtrans.  My local fix is to disable abstract socket support entirely.)
Comment 1 James Cloos 2008-05-09 01:12:02 UTC
Did you recompile the server after installing xtrans 1.2?

The server, libX11 need to be compiled with the same version of xtrans
installed.  Or at least both in the same partition of the “xtrans
abstract-socket continuum”.  :-/

The recent bug fix to the abstract sockets was, in my case, done on a
kernel that had CONFIG_NAMESPACES=y but CONFIG_*_NS=n.  

Is it CONFIG_NET_NS that allows for abstract socket namespaces?
Do I also need CONFIG_SECURITY to replicate what you found?

(I’d prefer to minimize kernel recompiles to help fix this; they take
quite a bit of time on my (*old* laptop).

Do your server and clients exist in different namespaces?

As for the delay when dropping back from abstract to non-abstract,
I don’t see any problem with having two RETRY return codes, one
suggesting a pause (as TRANS_TRY_CONNECT_AGAIN currently does) and
one suggesting an immediate retry.  We’d need to do a coordinated
release of new version of libX11 and xtrans to do it, but that isn’t
too huge of a burden.

Or perhaps SoctetUNIXConnect() should make the 2nd attempt itself
rather than punting that decision up to whatever called it?

If you only recompiled libX11, please also recompile the xserver
to confirm whether that fixes abstract sockets for your box.

Either way we’ll figure out a better fallback stratagy that having to
have a sleep(1) between the connect(2) attempts.
Comment 2 David Sainty 2008-05-09 17:04:52 UTC
(In reply to comment #1)
> Did you recompile the server after installing xtrans 1.2?
> 
> The server, libX11 need to be compiled with the same version of xtrans
> installed.  Or at least both in the same partition of the "xtrans
> abstract-socket continuum".  :-/

No I didn't.  That's actually a hard ask - I'm using the standard Eee PC
image, which includes an X server.  I don't want to replace the X server, I
just want to compile X clients.  I'm using Pkgsrc for software management,
which is why I get a newer libX11/xtrans than the base system.

Obviously the newer library should work with the older server though, in the
best tradition of X!

> The recent bug fix to the abstract sockets was, in my case, done on a
> kernel that had CONFIG_NAMESPACES=y but CONFIG_*_NS=n.  
> 
> Is it CONFIG_NET_NS that allows for abstract socket namespaces?
> Do I also need CONFIG_SECURITY to replicate what you found?
> 
> (I'd prefer to minimize kernel recompiles to help fix this; they take
> quite a bit of time on my (*old* laptop).

I'm not sure, I'm using a stock kernel.  But netstat shows this, which
suggests that abstract socket support is present:

unix  2      [ ]         DGRAM       1996     @/org/kernel/udev/udevd

> Do your server and clients exist in different namespaces?

They must do.  The (stock) X server build predates the release of xtrans 1.1.
xdpyinfo says:

  X.Org version: 1.3.0

netstat doesn't list an abstract socket for the X server.

> As for the delay when dropping back from abstract to non-abstract,
> I don't see any problem with having two RETRY return codes, one
> suggesting a pause (as TRANS_TRY_CONNECT_AGAIN currently does) and
> one suggesting an immediate retry.  We'd need to do a coordinated
> release of new version of libX11 and xtrans to do it, but that isn't
> too huge of a burden.
> 
> Or perhaps SoctetUNIXConnect() should make the 2nd attempt itself
> rather than punting that decision up to whatever called it?

Yes, those are the two obvious options.  The former option changes the xtrans
interface though, it wasn't clear to me that that was a good thing.

I made a patch that implements the latter option, and that did fix the problem
for me.  But the patch used connect() a second time on the same socket
(because SocketUNIXConnect() doesn't create the socket it is harder to repeat
that step), which I seem to recall isn't portable, though it looks like
Linux doesn't mind.  But perhaps it wouldn't violate the xtrans interface to
conditionally create a new socket in SocketUNIXConnect() for this case.  It
might make the code look a little odd though.

> If you only recompiled libX11, please also recompile the xserver
> to confirm whether that fixes abstract sockets for your box.

That's kind of awkward to do, given that I'm using the stock server.  But we
can probably safely assume that doing that would successfully hide the bugs
though :)  Or if it didn't work it presumably wouldn't be related to the bug
reported here.

If the abstract socket connect was successful then neither the ECONNREFUSED
nor the sleep(1) problems would hit.  Essentially the bug reported here is in
the fall-back portions of the abstract socket changes.
Comment 3 James Cloos 2008-05-10 05:33:09 UTC
>> Did you recompile the server after installing xtrans 1.2?

> No I didn't.  That's actually a hard ask - I'm using the standard Eee
> PC image, which includes an X server.  I don't want to replace the X
> server, I just want to compile X clients.  I'm using Pkgsrc for
> software management, which is why I get a newer libX11/xtrans than the
> base system.

I got that impression; just wanted to be sure.

> Obviously the newer library should work with the older server though,
> in the best tradition of X!

Yes.  We definitely want a smooth experience for all.

>> Do your server and clients exist in different namespaces?

> netstat doesn't list an abstract socket for the X server.

OK.  An xserver built against xtrans 1.1 creates bogus abstract sockets,
and as you see xtrans 1.0 and older don’t support them at all.

When I helped debug the 1.1 issue I didn’t notice the delay.  But I
didn’t try out backing xtrans down to 1.0, compiling an xserver against
that, upgrading back to 1.2 and compiling libX11 against that.  So I
never saw the issue with the return values from the syscall.

The references to namespaces made me think of CONFIG_NAMESPACES, which
is used by the kernel’s support for guest OSs.  Since that is not
relevant here please ignore that part of comment #1.  :-/

> Yes, those are the two obvious options.  The [new return code] option
> changes the xtrans interface though, it wasn't clear to me that that
> was a good thing.

It does require coordinated releases.  We already have to do that for
some other stuff, so while not a perfect solution it is doable if we
need to.

I think more discussion is in order though.  I don’t want to push such
a change unilaterally.

OTOH, it’d likely also help the case where the client has unix support
but the server only listens on tcp....

> I made a patch that implements the [retry in SocketUNIXConnect()]
> option, and that did fix the problem for me.

Good.

> But the patch used connect() a second time on the same socket (because
> SocketUNIXConnect() doesn't create the socket it is harder to repeat
> that step), which I seem to recall isn't portable,

I wonder whether it is portable w/in the set of OSs which support
abstract unix domain sockets?  That asked, I see that XtransSock.c only
#defines HAVE_ABSTRACT_SOCKETS #ifdef linux, so portability isn’t as
huge of an issue as it could be.  At least until HAVE_ABSTRACT_SOCKETS
is more widely #defined....

> But perhaps it wouldn't violate the xtrans interface to conditionally
> create a new socket in SocketUNIXConnect() for this case.  It might
> make the code look a little odd though.

I think this will need a bit of discussion on xorg@lists.freedesktop.org.

> If the abstract socket connect was successful then neither the
> ECONNREFUSED nor the sleep(1) problems would hit.  Essentially the
> bug reported here is in the fall-back portions of the abstract socket
> changes.

Indeed.

Another possibility — especially for your use case — is to add a
--disable-abstract-unix-transport configure option to xtrans.m4.
That would allow one to build libX11 a/o the xserver w/o abstract
socket support.

For now I’d suggest patching XtransSock.c before compiling libX11 to
#undef HAVE_ABSTRACT_SOCKETS.  But we *will* get this to work w/o the
unnecessary delay.
Comment 4 Adam Jackson 2008-05-13 07:30:52 UTC
I've fixed xtrans to treat ECONNREFUSED and ENOENT identically for unix sockets.

I'm just going to remove the sleeps from the xlib code here.  For non-local transports, there's enough delay in the network already, and for local transports we should be able to figure out connection failure immediately.  accept() backlog is really only an issue over the network.  Besides, libraries shouldn't sleep.

Fixed in git, thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.