Bug 69026

Summary:	DBus daemon hangup
Product:	dbus	Reporter:	Yannick Lanz <yannick.lanz>
Component:	core	Assignee:	Havoc Pennington <hp>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium	CC:	chengwei.yang.cn, yannick.lanz
Version:	unspecified	Keywords:	patch
Hardware:	ARM
OS:	Linux (All)
Whiteboard:	review?
i915 platform:		i915 features:
Attachments:	Log and result [PATCH] Check EINVAL for accept4() [PATCH v2] Check EINVAL for accept4()

Description Yannick Lanz 2013-09-06 11:01:51 UTC

Created attachment 85321 [details]
Log and result

Hi, 

DBus version: 1.6.14
Linux kernel: 2.6.31

I don't know if it's a bug or a bad cross-compilation but I have compiled DBus for arm (IM.X25). I firstly launch the DBus daemon with debug support:

$ dbus-daemon --system

The output is in the attached file.

Then I launch this programm:

int main()
{
	DBusError error;
	DBusConnection *conn;

	printf("D-Bus bus testing application\n");

	dbus_error_init (&error);
	conn = dbus_bus_get (DBUS_BUS_SYSTEM, &error);
	if (!conn) {
    	fprintf (stderr, "%s: %s\n",
             error.name, error.message);
		return 1;
    }
	return 0;
}

After that, the test application is blocked and the DBus daemon is hangup:
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  586     1 root     R     2756  4.8   0 70.0 dbus-daemon --system

When I launch the daemon with strace, I have a infinite loop trying to do a socket pair:

fcntl64(-1, F_GETFD)                    = -1 EBADF (Bad file descriptor)
write(2, "595: ", 5595: )                    = 5
write(2, "[dbus-server-socket.c(203):socke"..., 48[dbus-server-socket.c(203):socket_handle_watch] ) = 48
write(2, "Failed to accept a client connec"..., 58Failed to accept a client connection: Bad file descriptor
) = 58
clock_gettime(CLOCK_MONOTONIC, {7887, 312515930}) = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
clock_gettime(CLOCK_MONOTONIC, {7887, 318238546}) = 0
ewrite(2, "595: ", 5595: )                    = 5
write(2, "[dbus-server-socket.c(181):socke"..., 48[dbus-server-socket.c(181):socket_handle_watch] ) = 48
write(2, "Handling client connection, flag"..., 38Handling client connection, flags 0x1
) = 38
socketpair(PF_AX25, SOCK_CLOEXEC|0xbed3a7e0, 3202066396, 0x80000) = -1 EINVAL (Invalid argument)
write(2, "595: ", 5595: )                    = 5
write(2, "[dbus-sysdeps-unix.c(1956):_dbus"..., 41[dbus-sysdeps-unix.c(1956):_dbus_accept] ) = 41
write(2, "client fd -1 accepted\n", 22client fd -1 accepted
) = 22

Comment 1 Simon McVittie 2013-09-06 14:39:03 UTC

(In reply to comment #0)
> socketpair(PF_AX25, SOCK_CLOEXEC|0xbed3a7e0, 3202066396, 0x80000) = -1
> EINVAL (Invalid argument)

Either your strace executable is wrong, or your dbus-daemon is wrong, because we shouldn't be using AX25 sockets (whatever those are). All our calls to socketpair() are for AF_UNIX sockets (and that's hard-coded, so it seems unlikely that it could go wrong).

I think your cross-compilation may have gone wrong: did you perhaps cross-compile such that headers from a different architecture were included?

What was your configure command line for cross-compilation?

Which D-Bus version?

Comment 2 Simon McVittie 2013-09-06 14:41:59 UTC

> write(2, "Failed to accept a client connec"..., 58Failed to accept a client
> connection: Bad file descriptor

The dbus-daemon is probably busy-looping on a poll() of an invalid fd. In principle it shouldn't be possible for an invalid fd to get into the main loop at all, but perhaps some sort of "can't happen" socket syscall failure might have that effect?

I think this is probably miscompilation.

Comment 3 Yannick Lanz 2013-09-10 20:01:07 UTC

Hello,

Thanks for your help. I have finally found where is the problem with GDB. 
The problem is in the file "dbus-sysdeps-unix.c" in the function "_dbus_accept". 

The functions accept4 is supported by my cross toolchain and the define SOCK_CLOEXEC too but... the line

/* We assume that if accept4 is available SOCK_CLOEXEC is too */
client_fd = accept4 (listen_fd, &addr, &addrlen, SOCK_CLOEXEC);
cloexec_done = client_fd >= 0;

return always a errno 22 (Invalid argument). I don't know why but I have remplaced this call by accept and I set the variable cloexec_done to 0 like that:

#ifdef HAVE_ACCEPT4
  /* We assume that if accept4 is available SOCK_CLOEXEC is too */
  //client_fd = accept4 (listen_fd, &addr, &addrlen, SOCK_CLOEXEC);
  //cloexec_done = client_fd >= 0;

  client_fd = accept (listen_fd, &addr, &addrlen);
  cloexec_done = 0;

  //if (client_fd < 0 && errno == ENOSYS)
#endif
  //{
    //client_fd = accept (listen_fd, &addr, &addrlen);
  //}

For the moment, there is no problem visibility due to this modification.
Thank at all

Comment 4 Chengwei Yang 2013-09-11 03:34:32 UTC

(In reply to comment #3)
> Hello,
> 
> Thanks for your help. I have finally found where is the problem with GDB. 
> The problem is in the file "dbus-sysdeps-unix.c" in the function
> "_dbus_accept". 
> 
> The functions accept4 is supported by my cross toolchain and the define
> SOCK_CLOEXEC too but... the line
> 
> /* We assume that if accept4 is available SOCK_CLOEXEC is too */
> client_fd = accept4 (listen_fd, &addr, &addrlen, SOCK_CLOEXEC);
> cloexec_done = client_fd >= 0;
> 
> return always a errno 22 (Invalid argument). I don't know why but I have

So seems we need a patch to check EINVAL for the errno of accept4, currently, only ENOSYS checked.

According to the above behavior, we need drop the below comment.
/* We assume that if accept4 is available SOCK_CLOEXEC is too */

> remplaced this call by accept and I set the variable cloexec_done to 0 like
> that:
> 
> #ifdef HAVE_ACCEPT4
>   /* We assume that if accept4 is available SOCK_CLOEXEC is too */
>   //client_fd = accept4 (listen_fd, &addr, &addrlen, SOCK_CLOEXEC);
>   //cloexec_done = client_fd >= 0;
> 
>   client_fd = accept (listen_fd, &addr, &addrlen);
>   cloexec_done = 0;
> 
>   //if (client_fd < 0 && errno == ENOSYS)
> #endif
>   //{
>     //client_fd = accept (listen_fd, &addr, &addrlen);
>   //}
> 
> For the moment, there is no problem visibility due to this modification.
> Thank at all

Comment 5 Chengwei Yang 2013-09-11 05:06:17 UTC

Created attachment 85600 [details] [review]
[PATCH] Check EINVAL for accept4()

Yannick Lanz, could you help to verify this patch? Thank you in advance.

Comment 6 Simon McVittie 2013-09-11 09:56:48 UTC

(In reply to comment #4)
> According to the above behavior, we need drop the below comment.
> /* We assume that if accept4 is available SOCK_CLOEXEC is too */

There are two levels of "available" going on here.

One is what's in the libc headers, and will compile successfully: that's the only thing that Autoconf can check. The other is what the running kernel actually supports.

I think that comment would be better phrased as:

    At compile-time, we assume that if accept4() is available in
    libc headers, SOCK_CLOEXEC is too. At runtime, it is still
    not necessarily true that either is supported by the running kernel.

(In reply to comment #5)
> Created attachment 85600 [details] [review]
> [PATCH] Check EINVAL for accept4()

This is a good start (assuming it works for Yannick - I won't apply it until that's confirmed), but something is still wrong here: when the accept4() fails, it shouldn't result in a busy-loop.

I suspect what's going on might be something like this:

* we're watching the listening fd
* when it becomes readable, we try to accept4() on it
* on failure, it's still readable (there's still an incoming connection waiting)
* because select() and poll() are level-triggered, as soon as we go back to the main loop, we notice it's still readable and try to accept4() again
* busy-loop

If the accept operation fails with a "fatal-looking" error, we should probably issue a _dbus_warn() and stop watching the fd.

Unfortunately:

       Linux accept() (and accept4()) passes already-pending network errors on
       the new socket as an error code from accept().  This  behavior  differs
       from  other  BSD  socket  implementations.   For reliable operation the
       application should detect the network errors defined for  the  protocol
       after  accept() and treat them like EAGAIN by retrying.  In the case of
       TCP/IP, these are ENETDOWN,  EPROTO,  ENOPROTOOPT,  EHOSTDOWN,  ENONET,
       EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH.

so we should probably only stop watching the fd on specific errors: from the Linux accept4(2) man page, EBADF, EINVAL, ENOTSOCK, EOPNOTSUPP look like candidates.

Comment 7 Yannick Lanz 2013-09-11 11:26:20 UTC

> Yannick Lanz, could you help to verify this patch? Thank you in advance.
(In reply to comment #6)
I can't verify it before tonight

Concerning the incompatibility, I have found a similar thread about udev (https://github.com/gentoo/eudev/issues/7). It seems that the kernel 2.6.31 doesn't support accept4 so it's not a ARM toolchain problem but definitely the kernel so all architectures are potentially affected ?

Concerning the solution, I'm available for testing yours patch

Comment 8 Yannick Lanz 2013-09-11 17:01:56 UTC

(In reply to comment #5)
> Created attachment 85600 [details] [review] [review]
> [PATCH] Check EINVAL for accept4()
> 
> Yannick Lanz, could you help to verify this patch? Thank you in advance.

I have successfully patched dbus-1.6.14 and coss-compiled then ran without problem.

Comment 9 Chengwei Yang 2013-09-12 05:36:33 UTC

(In reply to comment #6)
> (In reply to comment #4)
> > According to the above behavior, we need drop the below comment.
> > /* We assume that if accept4 is available SOCK_CLOEXEC is too */
> 
> There are two levels of "available" going on here.
> 
> One is what's in the libc headers, and will compile successfully: that's the
> only thing that Autoconf can check. The other is what the running kernel
> actually supports.

Yes, exactly, so I didn't drop the comment in my patch.

> 
> I think that comment would be better phrased as:
> 
>     At compile-time, we assume that if accept4() is available in
>     libc headers, SOCK_CLOEXEC is too. At runtime, it is still
>     not necessarily true that either is supported by the running kernel.

Sure, the above comment looks good to me, I'll adopt it in the patch v2.

> 
> (In reply to comment #5)
> > Created attachment 85600 [details] [review] [review]
> > [PATCH] Check EINVAL for accept4()
> 
> This is a good start (assuming it works for Yannick - I won't apply it until
> that's confirmed), but something is still wrong here: when the accept4()
> fails, it shouldn't result in a busy-loop.

Yes, neither accept4() or accept().

> 
> I suspect what's going on might be something like this:
> 
> * we're watching the listening fd
> * when it becomes readable, we try to accept4() on it
> * on failure, it's still readable (there's still an incoming connection
> waiting)
> * because select() and poll() are level-triggered, as soon as we go back to
> the main loop, we notice it's still readable and try to accept4() again
> * busy-loop
> 
> If the accept operation fails with a "fatal-looking" error, we should
> probably issue a _dbus_warn() and stop watching the fd.
> 
> Unfortunately:
> 
>        Linux accept() (and accept4()) passes already-pending network errors
> on
>        the new socket as an error code from accept().  This  behavior 
> differs
>        from  other  BSD  socket  implementations.   For reliable operation
> the
>        application should detect the network errors defined for  the 
> protocol
>        after  accept() and treat them like EAGAIN by retrying.  In the case
> of
>        TCP/IP, these are ENETDOWN,  EPROTO,  ENOPROTOOPT,  EHOSTDOWN, 
> ENONET,
>        EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH.
> 
> so we should probably only stop watching the fd on specific errors: from the
> Linux accept4(2) man page, EBADF, EINVAL, ENOTSOCK, EOPNOTSUPP look like
> candidates.

That's somehow a fatal error, especially in common using scenario, the dbus-daemon only listen on one address. If the only address removed, it becomes useless.

Comment 10 Chengwei Yang 2013-09-12 05:39:42 UTC

Created attachment 85689 [details] [review]
[PATCH v2] Check EINVAL for accept4()

applied code comments from Simon.

Comment 11 Simon McVittie 2013-09-16 12:09:18 UTC

Fixed in git for 1.6.16, 1.7.6 - thanks

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.