Bug 87999 - dbus-daemon on Windows: Failed to bind socket "localhost:59367": Address already in use
Summary: dbus-daemon on Windows: Failed to bind socket "localhost:59367": Address alre...
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: 1.8
Hardware: Other Windows (All)
: medium normal
Assignee: D-Bus Maintainers
QA Contact: D-Bus Maintainers
URL:
Whiteboard: review?
Keywords: patch
: 88858 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-01-03 17:52 UTC by Ralf Habacker
Modified: 2015-02-07 12:05 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
threaded test case (1.30 KB, text/plain)
2015-01-03 17:52 UTC, Ralf Habacker
Details
CMake build system file (675 bytes, text/plain)
2015-01-03 17:52 UTC, Ralf Habacker
Details
non threaded test case (804 bytes, text/plain)
2015-01-03 17:52 UTC, Ralf Habacker
Details
Patch (3.16 KB, patch)
2015-01-03 17:53 UTC, Ralf Habacker
Details | Splinter Review
Fix AF_UNSPEC issue (969 bytes, patch)
2015-01-27 22:52 UTC, Ralf Habacker
Details | Splinter Review
Simplified test case (1.83 KB, patch)
2015-01-28 22:21 UTC, Ralf Habacker
Details | Splinter Review
Add manual tcp test case (3.05 KB, patch)
2015-01-29 14:10 UTC, Ralf Habacker
Details | Splinter Review
Add manual tcp test case (update 1) (3.41 KB, patch)
2015-01-29 15:00 UTC, Ralf Habacker
Details | Splinter Review
Fix dbus-daemon on Windows Failed to bind socket (2.15 KB, patch)
2015-01-30 19:26 UTC, Ralf Habacker
Details | Splinter Review
[1/2] _dbus_listen_tcp_socket: use NI_NUMERICSERV to determine port number (1.24 KB, patch)
2015-02-04 13:20 UTC, Simon McVittie
Details | Splinter Review
[2/2] _dbus_listen_tcp_socket: comment on another reason to ignore EADDRINUSE (1.74 KB, patch)
2015-02-04 13:23 UTC, Simon McVittie
Details | Splinter Review
_dbus_listen_tcp_socket: Keep windows code to fetch port from tcp listening socket in sync with unix. (1.91 KB, patch)
2015-02-04 14:04 UTC, Ralf Habacker
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Ralf Habacker 2015-01-03 17:52:15 UTC
Created attachment 111692 [details]
threaded test case

Reported by Thomas Sondergaard <ts@medical-insight.com>:

When I run the attached test programs I get a handful or so of cases where the dbus-daemon fails to start with a message like

Failed to bind socket "localhost:59367": Address already in use

This surprises me. First I thought it was some kind of race condition when starting multiple buses at the same time (dbus-test.cc), but it also fails a handful of times out of a thousand when only one process is started at a time (dbus-test-nothreads.cc).

This is using D-Bus Message Bus Daemon 1.8.10 on Windows.

Steps to reproduce:

1. Compile the small test program on Windows
2. Make sure dbus-daemon is in the path
3. Run the test program

The nothreads version takes about 4 minutes to run.
Comment 1 Ralf Habacker 2015-01-03 17:52:40 UTC
Created attachment 111693 [details]
CMake build system file
Comment 2 Ralf Habacker 2015-01-03 17:52:55 UTC
Created attachment 111694 [details]
non threaded test case
Comment 3 Ralf Habacker 2015-01-03 17:53:46 UTC
Created attachment 111695 [details] [review]
Patch
Comment 4 Simon McVittie 2015-01-05 09:46:17 UTC
Comment on attachment 111695 [details] [review]
Patch

Review of attachment 111695 [details] [review]:
-----------------------------------------------------------------

::: dbus/dbus-sysdeps-win.c
@@ +1760,5 @@
>               to use the same port */
>            if (!port || !strcmp(port, "0"))
>              {
> +              int result;
> +              struct sockaddr_storage addr;

I'd prefer to keep using mysockaddr_gen, adding a struct sockaddr_storage member to it if necessary...

@@ +1762,5 @@
>              {
> +              int result;
> +              struct sockaddr_storage addr;
> +              socklen_t addrlen;
> +              char portbuf[50];

Why 50? Numeric TCP ports fit in 6 bytes ("65535" + '\0') and you could use NI_NUMERICSERV to ensure that it's returned as a number.

If we happen to end up on port 5432, say, I'm pretty sure "127.0.0.1:postgresql" is not the form we want.

(Perhaps Windows never returns non-numeric service names, I don't know.)

@@ +1765,5 @@
> +              socklen_t addrlen;
> +              char portbuf[50];
> +
> +              addrlen = sizeof (addr);
> +              result = getsockname (fd, (struct sockaddr*) &addr, &addrlen);

... because then you could use &addr.Address here and not need the cast.

@@ +1768,5 @@
> +              addrlen = sizeof (addr);
> +              result = getsockname (fd, (struct sockaddr*) &addr, &addrlen);
> +
> +              if (result == -1 ||
> +                  (res = getnameinfo ((struct sockaddr*)&addr, addrlen, NULL, 0,

Likewise here.
Comment 5 Simon McVittie 2015-01-05 09:49:53 UTC
(In reply to Simon McVittie from comment #4)
> > +              int result;
> > +              struct sockaddr_storage addr;
> 
> I'd prefer to keep using mysockaddr_gen, adding a struct sockaddr_storage
> member to it if necessary...

The other half of this review comment is of course

> @@ +1765,5 @@
> > +              result = getsockname (fd, (struct sockaddr*) &addr, &addrlen);
> 
> ... because then you could use &addr.Address here and not need the cast.

this one.

Casting to (struct sockaddr*) is, strictly speaking, not allowed in standard C (it violates "strict aliasing" which can lead the compiler to make incorrect optimizations).

On gcc, we specifically tell the compiler not to make the assumptions about aliasing that standard C would otherwise allow; but that's not ideal, and if we can easily replace a cast with a union, we should.

On non-gcc, it's entirely possible that this sort of thing is silently mis-optimized.
Comment 6 Ralf Habacker 2015-01-05 10:59:51 UTC
(In reply to Simon McVittie from comment #4)
> Comment on attachment 111695 [details] [review] [review]
> Patch
...
> @@ +1762,5 @@
> >              {
> > +              int result;
> > +              struct sockaddr_storage addr;
> > +              socklen_t addrlen;
> > +              char portbuf[50];
> 
> Why 50? 

I don't know. You may ask the author of the related parts in dbus-sysdeps-unix.c, from which this fragement is copied into

>Numeric TCP ports fit in 6 bytes ("65535" + '\0') and you could use
> NI_NUMERICSERV to ensure that it's returned as a number.

which is used later
Comment 7 Ralf Habacker 2015-01-05 13:16:59 UTC
(In reply to Ralf Habacker from comment #6)
> (In reply to Simon McVittie from comment #4)
> > Comment on attachment 111695 [details] [review] [review] [review]
> > Patch
> ...
> > @@ +1762,5 @@
> > >              {
> > > +              int result;
> > > +              struct sockaddr_storage addr;
> > > +              socklen_t addrlen;
> > > +              char portbuf[50];
> > 
> > Why 50? 
> 
> I don't know. You may ask the author of the related parts in
> dbus-sysdeps-unix.c, from which this fragement is copied into

http://cgit.freedesktop.org/dbus/dbus/tree/dbus/dbus-sysdeps-unix.c#n1508
Comment 8 Thomas Sondergaard 2015-01-05 15:40:02 UTC
Does _dbus_listen_tcp_socket() really have to be so complicated in the first place? I'm not really sure why it is doing the whole rebind thing. If it is because of IPv6 there must be a better way to do it.
Comment 9 Thomas Sondergaard 2015-01-16 07:28:39 UTC
What is required to move the fix for this bug forward?
Comment 10 Simon McVittie 2015-01-16 13:25:26 UTC
(In reply to Thomas Sondergaard from comment #8)
> Does _dbus_listen_tcp_socket() really have to be so complicated in the first
> place?

"Maybe". The TCP support originally existed (approx. a decade ago) for the benefit of people who were sharing home directories via NFS and displays via X11 and trying to have a shared dbus-daemon for the whole mess; later, it was re-used as the main transport on Windows, because we prefer AF_LOCAL (aka AF_UNIX) sockets but Windows doesn't have those.

The core use-cases for D-Bus are that it provides the session and system message buses on modern Linux (and other Unixes like *BSD) over AF_LOCAL sockets, neither of which requires TCP; so I'm afraid TCP is a bit of a second class citizen.

(In reply to Thomas Sondergaard from comment #9)
> What is required to move the fix for this bug forward?

Please adapt the patch to keep using a union instead of violating strict aliasing, and use NI_NUMERICSERV to guarantee getting a number rather than potentially a non-numeric service name. http://msdn.microsoft.com/en-us/library/windows/desktop/ms738532%28v=vs.85%29.aspx says that NI_NUMERICSERV exists on Windows (and in fact also that not using it would usually make this code fail on Windows Server 2003 or older).

Optionally, also size the buffer more sensibly for the resulting requirement (5 digits + \0), but I'm willing to leave that bit alone.

As for simplifications, I'd review a follow-up patch to simplify, but it looks as though the complexity might in fact be necessary (so my review of that change would be "sorry, but no"), because what we seem to be trying to do here is:

* resolve the hostname from the requested address to one or more IPv4 and/or IPv6 addresses;
* listen on each of those addresses.

Suppose you try to listen on "tcp:host=myserver.example.com" and myserver.example.com resolves to 1.1.1.1, 2.2.2.2, 6666::6666 and 7777::7777 due to some sort of multi-homed arrangement. We want to end up listening on all of 1.1.1.1:54321, 2.2.2.2:54321, [6666::6666]:54321 and [7777::7777]:54321, where 54321 is arbitrary but consistent; so that we can tell our clients "hey I'm listening on tcp:host=myserver.example.com,port=54321" and whichever of the multiple resolutions of our name they choose, they can connect to us.

The more likely scenario in practice is that you try to listen on "localhost", which resolves to 127.0.0.1 and maybe also ::1, or on "0.0.0.0", or on "::". However, the code seems to be trying to deal with connecting by arbitrarily complicated hostname for the benefit of the NFS/X11/TCP crowd... this is what you get when features were added a decade ago without the use-cases being made clear :-(
Comment 11 Ralf Habacker 2015-01-27 22:52:33 UTC
Created attachment 112893 [details] [review]
Fix AF_UNSPEC issue
Comment 12 Ralf Habacker 2015-01-27 23:00:18 UTC
(In reply to Ralf Habacker from comment #0)
> Created attachment 111692 [details]
> threaded test case
> 
> Reported by Thomas Sondergaard <ts@medical-insight.com>:
> 
> When I run the attached test programs I get a handful or so of cases where
> the dbus-daemon fails to start with a message like
> 
> Failed to bind socket "localhost:59367": Address already in use

I got the same error running cross compiled test-dbus-daemon.exe with wine.

I rerun after applying your patch, which then returns 

     /creds: Failed to start message bus: Failed to bind socket "localhost:40361": Invalid argument

I only get rid of the error message in the test-dbus-daemon case with the 'Fix AF_UNSPEC issue' patch applied (with or without your patch). 

According to http://dbus.freedesktop.org/doc/dbus-specification.html#transports-tcp-sockets a workaround would be to use the following listen address: 

    tcp:host=localhost,family=ipv4

without any additional patch from this bug applied. Can you check that ?
Comment 13 Ralf Habacker 2015-01-27 23:13:46 UTC
(In reply to Ralf Habacker from comment #12)
> (In reply to Ralf Habacker from comment #0)
> > Created attachment 111692 [details]
...
> I rerun after applying your patch, which then returns 
> 
>      /creds: Failed to start message bus: Failed to bind socket
> "localhost:40361": Invalid argument
> 
This indicates that the dbus code tries to use a ipv6 address, which has been already documentated at bug 61922.
Comment 14 Ralf Habacker 2015-01-27 23:24:46 UTC
(In reply to Simon McVittie from comment #10)
> (In reply to Thomas Sondergaard from comment #8)
... 
> The more likely scenario in practice is that you try to listen on
> "localhost", which resolves to 127.0.0.1 and maybe also ::1, or on
> "0.0.0.0", or on "::".

in the test-dbus-daemon case this is: 

/creds: _dbus_listen_tcp_socket  localhost 0 (null)
initial: host 127.0.0.1 port 0

first address: 
bind to -> host 127.0.0.1 port 6304 => okay

second address 
bind to -> host ::1 port 6304 => fails

Failed to start message bus: Failed to bind socket 1 "localhost:6304": Invalid argument
Comment 15 Ralf Habacker 2015-01-27 23:43:46 UTC
(In reply to Ralf Habacker from comment #0)
> This is using D-Bus Message Bus Daemon 1.8.10 on Windows.

BTW: On which Windows version and variant (32bit /64bit) do you get this error ?
Comment 16 Ralf Habacker 2015-01-28 00:32:39 UTC
(In reply to Simon McVittie from comment #10)
> (In reply to Thomas Sondergaard from comment #8)
... 
> Optionally, also size the buffer more sensibly for the resulting requirement
> (5 digits + \0), but I'm willing to leave that bit alone.

from https://msdn.microsoft.com/de-de/library/windows/desktop/ms738532%28v=vs.85%29.aspx
... 

"To simplify determining buffer requirements for the host and serv parameters, the following values for maximum host name length and maximum service name are defined in the Ws2tcpip.h header file."
...
#define NI_MAXSERV    32
#define NI_MAXHOST  1025
Comment 17 Thomas Sondergaard 2015-01-28 07:34:51 UTC
(In reply to Ralf Habacker from comment #15)
> (In reply to Ralf Habacker from comment #0)
> > This is using D-Bus Message Bus Daemon 1.8.10 on Windows.
> 
> BTW: On which Windows version and variant (32bit /64bit) do you get this
> error ?

Windows 7 Pro 64-bit, dbus build as 32-bit with MSVC 2013.
Comment 18 Ralf Habacker 2015-01-28 22:21:23 UTC
Created attachment 112923 [details] [review]
Simplified test case

The appended patch contains a simplified test case.
Comment 19 Ralf Habacker 2015-01-29 10:20:52 UTC
(In reply to Ralf Habacker from comment #18)
> Created attachment 112923 [details] [review] [review]
> Simplified test case
> 
> The appended patch contains a simplified test case.

Here are a few results: 

running on opensuse 13.1 x86_64 wine-32bit (1.7.35)
285: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:47288": Address already in use
450: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:44718": Address already in use
640: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:56540": Address already in use
660: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:59881": Address already in use
661: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:48059": Address already in use

running on opensuse 13.1 x86_64 wine-64bit (1.7.35)
293: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:59367": Address already in use
411: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:59881": Address already in use
708: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:42919": Address already in use

running on Windows 7 32bit
227: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:63993": Address already in use
483: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:64250": Address already in use
738: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:64507": Address already in use
992: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:64764": Address already in use

running on Windows 7 64bit
188: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:51657": Address already in use
444: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:51914": Address already in use
699: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:52171": Address already in use
953: org.freedesktop.DBus.Error.Failed Failed to bind socket "127.0.0.1:52428": Address already in use
Comment 20 Simon McVittie 2015-01-29 11:22:21 UTC
(In reply to Ralf Habacker from comment #12)
>      /creds: Failed to start message bus: Failed to bind socket
> "localhost:40361": Invalid argument

For this to be useful information I think we'd need to know the exact parameters to the failing system call: strace would probably be a useful tool.
Comment 21 Simon McVittie 2015-01-29 11:23:46 UTC
(In reply to Simon McVittie from comment #10)
> Optionally, also size the buffer more sensibly for the resulting requirement
> (5 digits + \0), but I'm willing to leave that bit alone.

(In reply to Ralf Habacker from comment #16)
> "To simplify determining buffer requirements for the host and serv
> parameters, the following values for maximum host name length and maximum
> service name are defined in the Ws2tcpip.h header file."
> ...
> #define NI_MAXSERV    32
> #define NI_MAXHOST  1025

Making it NI_MAXSERV + 1 bytes long would also be fine.
Comment 22 Simon McVittie 2015-01-29 11:29:19 UTC
(In reply to Ralf Habacker from comment #19)
> Here are a few results: 

With which combination of patches applied?
Comment 23 Simon McVittie 2015-01-29 11:30:24 UTC
Comment on attachment 112893 [details] [review]
Fix AF_UNSPEC issue

Review of attachment 112893 [details] [review]:
-----------------------------------------------------------------

If this is helpful I'd be fine with applying it. Bug #61922 apparently breaks IPv6 anyway.
Comment 24 Simon McVittie 2015-01-29 11:36:49 UTC
Comment on attachment 112923 [details] [review]
Simplified test case

Review of attachment 112923 [details] [review]:
-----------------------------------------------------------------

Obviously if this is just throwaway code it's fine, but if you want to merge it:

Please call it manual-tcp or something. I've been trying to be fully consistent about test-* = automated test, manual-* = manual test.

Also please add it to test/Makefile.am. If I'm reading the CMake correctly, its Automake self can be the same as manual-dir-iter:

manual_dir_iter_SOURCES = manual-dir-iter.c
manual_dir_iter_CPPFLAGS = $(static_cppflags)
manual_dir_iter_LDADD = $(top_builddir)/dbus/libdbus-internal.la

(but with the obvious name change).

::: test/test-tcp.c
@@ +13,5 @@
> +    for (i = 0; i < 1000; i++)
> +    {
> +        dbus_error_init (&error);
> +        server = _dbus_server_new_for_tcp_socket ("localhost", "localhost", "0", "ipv4", &error, FALSE);
> +        if (server == 0)

minor: absence of a server isn't 0, it's NULL (just for clarity, the result is functionally identical)

@@ +18,5 @@
> +            printf("%d: %s %s\n",i, error.name, error.message);
> +        else
> +            printf("%d: %s \n",i, dbus_server_get_address(server));
> +
> +    }

Please unref the server (if non-NULL) or free the error (otherwise) each time round the loop, if you want to commit this as a useful future thing.
Comment 25 Simon McVittie 2015-01-29 11:38:38 UTC
(In reply to Simon McVittie from comment #24)
> I've been trying to be fully
> consistent about test-* = automated test, manual-* = manual test

... and by "manual test" I mean tests that are useless without human interpretation of their results, not just tests that need help to run (although as it happens, both our current manual tests are manual in both ways).
Comment 26 Ralf Habacker 2015-01-29 12:39:23 UTC
(In reply to Simon McVittie from comment #22)
> (In reply to Ralf Habacker from comment #19)
> > Here are a few results: 
> 
> With which combination of patches applied?

attachment 112923 [details] [review] only (I hope bugzilla could resolve it, I did not found the doc page describing how to use bugs/atachments keys)
Comment 27 Ralf Habacker 2015-01-29 14:10:00 UTC
Created attachment 112938 [details] [review]
Add manual tcp test case

- fix of review issues
- autotools support
Comment 28 Simon McVittie 2015-01-29 14:21:21 UTC
Comment on attachment 112938 [details] [review]
Add manual tcp test case

Review of attachment 112938 [details] [review]:
-----------------------------------------------------------------

Build system is fine, but you have committed test-tcp.c (without fixes) instead of manual-tcp.c (with fixes).
Comment 29 Ralf Habacker 2015-01-29 15:00:59 UTC
Created attachment 112939 [details] [review]
Add manual tcp test case (update 1)

- fix including wrong file
Comment 30 Simon McVittie 2015-01-29 18:48:40 UTC
Comment on attachment 112939 [details] [review]
Add manual tcp test case (update 1)

Review of attachment 112939 [details] [review]:
-----------------------------------------------------------------

Much better :-) Please merge this to master if you think you'll want this test-case in future.
Comment 31 Ralf Habacker 2015-01-29 19:34:06 UTC
(In reply to Ralf Habacker from comment #19)
> (In reply to Ralf Habacker from comment #18)
> > Created attachment 112923 [details] [review] [review] [review]
> > Simplified test case
> > 
> > The appended patch contains a simplified test case.
> 
> Here are a few results: 
> 
> running on opensuse 13.1 x86_64 wine-32bit (1.7.35)
> 285: org.freedesktop.DBus.Error.Failed Failed to bind socket
> "127.0.0.1:47288": Address already in use
> 450: org.freedesktop.DBus.Error.Failed Failed to bind socket
> "127.0.0.1:44718": Address already in use
> 640: org.freedesktop.DBus.Error.Failed Failed to bind socket
> "127.0.0.1:56540": Address already in use
> 660: org.freedesktop.DBus.Error.Failed Failed to bind socket
> "127.0.0.1:59881": Address already in use
> 661: org.freedesktop.DBus.Error.Failed Failed to bind socket
> "127.0.0.1:48059": Address already in use

Thanks to wine's build in trace support I got the following with:

WINEDEBUG=trace+winsock wine bin/manual-tcp.exe > out.log 2>&1 
...
...

i=148: 
trace:winsock:WS_getaddrinfo "localhost", "0" 0x6ffc80 -> 0x6ffc7c 0
trace:winsock:WS_getaddrinfo => 0x1245b0, flags 0x1, family 2, type 1, protocol 6, len 16, name (null), addr { family AF_INET, address 127.0.0.1, port 0 }
trace:winsock:WS_socket af=2 type=1 protocol=0
trace:winsock:WSASocketA af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW        created 005c
trace:winsock:WS_bind socket 005c, ptr 0x1115d0 { family AF_INET, address 127.0.0.1, port 0 }, length 16
-> note: binding to port 0 => bad

trace:winsock:WS_listen socket 005c, backlog 30
trace:winsock:WS_getsockname socket: 005c, ptr 0x6ffc60, len 0000001c
trace:winsock:WS_getsockname => { family AF_INET, address 127.0.0.1, port 54992 }
trace:winsock:WS_getaddrinfo "localhost", "53462" 0x6ffc80 -> 0x6ffc7c 0
trace:winsock:WS_getaddrinfo => 0x1245b0, flags 0x1, family 2, type 1, protocol 6, len 16, name (null), addr { family AF_INET, address 127.0.0.1, port 53462 }
trace:winsock:WS_socket af=2 type=1 protocol=0
trace:winsock:WSASocketA af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW        created 0060
trace:winsock:WS_bind socket 0060, ptr 0x1115d0 { family AF_INET, address 127.0.0.1, port 53462 }, length 16
-> note: binding to requested port => good

trace:winsock:WS_listen socket 0060, backlog 30
trace:winsock:WSAIoctl 005c, _IOW('f', 126, 4), 0x6ffc1c, 4, 0x6ffc1c, 4, 0x6ffbbc, (nil), (nil)
trace:winsock:WSAIoctl -> FIONBIO (1)
trace:winsock:WSAIoctl 0060, _IOW('f', 126, 4), 0x6ffc1c, 4, 0x6ffc1c, 4, 0x6ffbbc, (nil), (nil)
trace:winsock:WSAIoctl -> FIONBIO (1)
trace:winsock:WS_closesocket socket 005c
trace:winsock:WS_closesocket socket 0060

148: tcp:host=localhost,port=53462,guid=f28c08fde3728dc00372f7c754ca87f3 

trace:winsock:WS_getaddrinfo "localhost", "0" 0x6ffc80 -> 0x6ffc7c 0
trace:winsock:WS_getaddrinfo => 0x1245b0, flags 0x1, family 2, type 1, protocol 6, len 16, name (null), addr { family AF_INET, address 127.0.0.1, port 0 }
trace:winsock:WS_socket af=2 type=1 protocol=0
trace:winsock:WSASocketA af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW        created 005c
trace:winsock:WS_bind socket 005c, ptr 0x1115d0 { family AF_INET, address 127.0.0.1, port 0 }, length 16

-> binding again to port 0

trace:winsock:WS_listen socket 005c, backlog 30
trace:winsock:WS_getsockname socket: 005c, ptr 0x6ffc60, len 0000001c
trace:winsock:WS_getsockname => { family AF_INET, address 127.0.0.1, port 44461 }
trace:winsock:WS_getaddrinfo "localhost", "44461" 0x6ffc80 -> 0x6ffc7c 0
trace:winsock:WS_getaddrinfo => 0x1245b0, flags 0x1, family 2, type 1, protocol 6, len 16, name (null), addr { family AF_INET, address 127.0.0.1, port 44461 }
trace:winsock:WS_socket af=2 type=1 protocol=0
trace:winsock:WSASocketA af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW af=2 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW        created 0060
trace:winsock:WS_bind socket 0060, ptr 0x1115d0 { family AF_INET, address 127.0.0.1, port 44461 }, length 16

-> binding to requested port

trace:winsock:WS_closesocket socket 0060
trace:winsock:WS_closesocket socket 005c

149: org.freedesktop.DBus.Error.Failed Failed to bind socket "localhost:44461": Address already in use

conclusion: The error message is misleading, it should be 

149: org.freedesktop.DBus.Error.Failed Failed to bind socket "localhost:0": Address already in use

According to http://hea-www.harvard.edu/~fine/Tech/addrinuse.html a possible reuse may be delayed, which causes the "address already in use" error.

-> This is an implementation error in _dbus_listen_tcp_socket()
Comment 32 Ralf Habacker 2015-01-29 20:08:45 UTC
(In reply to Ralf Habacker from comment #31)
....
> According to http://hea-www.harvard.edu/~fine/Tech/addrinuse.html a possible
> reuse may be delayed, which causes the "address already in use" error.
> 
> -> This is an implementation error in _dbus_listen_tcp_socket()

and here is an answer: http://stackoverflow.com/questions/20789091/get-a-random-port-for-tcp-connection
Comment 33 Simon McVittie 2015-01-30 00:07:39 UTC
(In reply to Ralf Habacker from comment #31)
> trace:winsock:WS_bind socket 005c, ptr 0x1115d0 { family AF_INET, address
> 127.0.0.1, port 0 }, length 16
> -> note: binding to port 0 => bad

No, on the first time through the loop (when you have no idea what you want), this is absolutely fine - binding to port 0 is precisely how you ask the kernel to give you an arbitrary unused port.

> trace:winsock:WS_bind socket 005c, ptr 0x1115d0 { family AF_INET, address
> 127.0.0.1, port 0 }, length 16
> 
> -> binding again to port 0

This is binding socket 0x5C to "127.0.0.1:0", which means "hello kernel, please give me a real port number, I don't care which one". Fine.

> trace:winsock:WS_getsockname socket: 005c, ptr 0x6ffc60, len 0000001c
> trace:winsock:WS_getsockname => { family AF_INET, address 127.0.0.1, port
> 44461 }

Next, we use getsockname() find out which port the kernel has given us. The answer turns out to be 44461.

> trace:winsock:WSASocketW        created 0060
> trace:winsock:WS_bind socket 0060, ptr 0x1115d0 { family AF_INET, address
> 127.0.0.1, port 44461 }, length 16

*This* is bad though. We're trying to bind a different socket, 0x60, to exactly the same port that socket 0x5C already has, 127.0.0.1:44461 - there's no way that can possibly work...

> 149: org.freedesktop.DBus.Error.Failed Failed to bind socket
> "localhost:44461": Address already in use

... and indeed, it fails, with a correct error message - we tried to bind a second to localhost:44461 and it didn't work.

> According to http://hea-www.harvard.edu/~fine/Tech/addrinuse.html a possible
> reuse may be delayed, which causes the "address already in use" error.

Perhaps that is possible, but that isn't the bug seen in your trace.

I think what's going on here is this:

* First we do the getaddrinfo for the requested address, localhost:0.
  Let's suppose it returns 127.0.0.1:0 and [::1]:0.

* OK, so we go into the while loop, do a socket() (in your trace this is
  0x5C), bind it to 127.0.0.1:0 and listen(). Good so far.

* Now, under "if (!_dbus_string_get_length(retport))", we do a getsockname()
  and find that we have been given port 44461.

* Now we do a goto back to redo_lookup_with_port, and look up
  localhost:44461. We get 127.0.0.1:44461 and [::1]:44461 (let's say).

* Back into the while loop, and now we try to make two more socket()s,
  and bind the first (0x60 in your trace) to 127.0.0.1:44461 and the
  second to [::1]:44461 - but we never get as far as the second,
  because the first one quite reasonably fails, because we already have
  our original socket 0x5C listening on that same port!

Based on that analysis, what I can't understand is how this function could ever work correctly with port 0 - as far as I can see, it should *always* hit this error!

On Unix, we specifically ignore EADDRINUSE with a slightly misleading comment about IPv4 vs. IPv6. It is true that we can conceivably get EADDRINUSE for that reason, but I think it's also masking the fact that we consistently get EADDRINUSE because we're trying to bind to the same IP:port repeatedly and that can never work.

Looking at your trace where it *did* work, to try to find out why it could ever succeed, here's something interesting:

trace:winsock:WS_listen socket 005c, backlog 30
trace:winsock:WS_getsockname socket: 005c, ptr 0x6ffc60, len 0000001c
trace:winsock:WS_getsockname => { family AF_INET, address 127.0.0.1, port 54992 }
trace:winsock:WS_getaddrinfo "localhost", "53462" 0x6ffc80 -> 0x6ffc7c 0
trace:winsock:WS_getaddrinfo => 0x1245b0, flags 0x1, family 2, type 1, protocol 6, len 16, name (null), addr { family AF_INET, address 127.0.0.1, port 53462 }

WTF? 53462 is not the same as 54992. Why are we looking up localhost:53462?

$ python
>>> hex(53462)
'0xd0d6'
>>> hex(54992)
'0xd6d0'

... it's because we're not doing the correct ntohs() to fix the endianness of the port number!

So the fact that we are getting the endianness of the port number wrong has *mostly* masked the fact that the function's logic is wrong - it fails if and only if both bytes happen to be the same. Sure enough:

>>> hex(44461)
'0xadad'

Your failing case is failing because in this case ntohs(sin_port) == sin_port.

Incidentally, I thin I've also just spotted why IPv6 doesn't work:

snprintf( portbuf, sizeof( portbuf ) - 1, "%d", addr.AddressIn.sin_port );

Even with the ntohs() added, that line can only be correct for an IPv4 address. If it is really an IPv6 address, you are "type-punning" - accessing the wrong member of a union - and instead of the port number, you're going to get whatever 2 bytes of an IPv6 address happen to be at the same offset as the port number of an IPv4 address.
Comment 34 Simon McVittie 2015-01-30 00:16:16 UTC
If you change Attachment #111695 [details] as I suggested (use a union instead of casts, and use NI_NUMERICSERV) I think that should indeed fix this, and possibly also Bug #61922.

Now that I understand why these changes work, I'm a bit happier about applying Attachment #111695 [details] as-is, if you'll review and test a follow-up patch from me to apply what I suggested in Comment #4 (to both Unix and Windows - the Unix implementation has the same issues) on master.
Comment 35 Ralf Habacker 2015-01-30 08:34:29 UTC
Comment on attachment 112939 [details] [review]
Add manual tcp test case (update 1)

committed
Comment 36 Ralf Habacker 2015-01-30 19:26:58 UTC
Created attachment 112983 [details] [review]
Fix dbus-daemon on Windows Failed to bind socket
Comment 37 Ralf Habacker 2015-01-30 19:32:50 UTC
Comment on attachment 112893 [details] [review]
Fix AF_UNSPEC issue

committed
Comment 38 Ralf Habacker 2015-01-30 19:58:21 UTC
(In reply to Simon McVittie from comment #33)
> (In reply to Ralf Habacker from comment #31)
> > trace:winsock:WS_bind socket 005c, ptr 0x1115d0 { family AF_INET, address
> > 127.0.0.1, port 0 }, length 16
> > -> note: binding to port 0 => bad
> 
> No, on the first time through the loop (when you have no idea what you
> want), this is absolutely fine - binding to port 0 is precisely how you ask
> the kernel to give you an arbitrary unused port.

you'r right, I thought this trace where on a lower function level inside wine, where the real port is used. 

... 
> Your failing case is failing because in this case ntohs(sin_port) ==
> sin_port.

nice catch :-)

> Incidentally, I thin I've also just spotted why IPv6 doesn't work:
> 
> snprintf( portbuf, sizeof( portbuf ) - 1, "%d", addr.AddressIn.sin_port );
> 
> Even with the ntohs() added, that line can only be correct for an IPv4
> address. If it is really an IPv6 address, you are "type-punning" - accessing
> the wrong member of a union - and instead of the port number, you're going
> to get whatever 2 bytes of an IPv6 address happen to be at the same offset
> as the port number of an IPv4 address.
I fixed that hopefully in the updated patch.
Comment 39 Simon McVittie 2015-01-30 20:37:10 UTC
Comment on attachment 112983 [details] [review]
Fix dbus-daemon on Windows Failed to bind socket

Review of attachment 112983 [details] [review]:
-----------------------------------------------------------------

Looks good, ship it. Please leave the bug open for now, to remind me that I still need to fix the issues pointed out in my earlier review for both Unix and Windows (unless you feel like preparing that patch yourself of course).
Comment 40 Simon McVittie 2015-01-30 20:38:01 UTC
(In reply to Ralf Habacker from comment #37)
> Comment on attachment 112893 [details] [review]
> Fix AF_UNSPEC issue
> 
> committed

You can probably revert this, now that you've (hopefully) fixed the reason why IPv6 was broken.
Comment 41 Ralf Habacker 2015-01-30 21:06:21 UTC
Comment on attachment 112983 [details] [review]
Fix dbus-daemon on Windows Failed to bind socket

committed
Comment 42 Ralf Habacker 2015-01-30 21:25:04 UTC
(In reply to Simon McVittie from comment #40)
> (In reply to Ralf Habacker from comment #37)
> > Comment on attachment 112893 [details] [review] [review]
> > Fix AF_UNSPEC issue
> > 
> > committed
> 
> You can probably revert this, now that you've (hopefully) fixed the reason
> why IPv6 was broken.

Unfortunally it is not fixed.

wine bin/manual-tcp.exe ipv6 

trace:winsock:DllMain 0x7ec90000 0x1 0x1
trace:winsock:WSAStartup verReq=2
trace:winsock:WSAStartup succeeded starts: 1
trace:winsock:WS_getaddrinfo "localhost", "0" 0x6ffc80 -> 0x6ffc7c 0
trace:winsock:WS_inet_ntop family 23, addr (0x111628), buffer (0x6ffbce), len 46
trace:winsock:WS_getaddrinfo => 0x1115f8, flags 0x1, family 23, type 1, protocol 6, len 28, name (null), addr { family AF_INET6, address ::1, port 0 }
trace:winsock:WS_socket af=23 type=1 protocol=0
trace:winsock:WSASocketA af=23 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW af=23 type=1 protocol=0 protocol_info=(nil) group=0 flags=0x1
trace:winsock:WSASocketW        created 005c
trace:winsock:WS_inet_ntop family 23, addr (0x111628), buffer (0x6ffb7c), len 46
trace:winsock:WS_bind socket 005c, ptr 0x111620 { family AF_INET6, address ::1, port 0 }, length 28
trace:winsock:WS_closesocket socket 005c

0: org.freedesktop.DBus.Error.Failed Failed to bind socket "localhost:0": Invalid argument

on linux: 

test/manual-tcp ipv6 |
less0: org.freedesktop.DBus.Error.FileNotFound Failed to lookup host/port: "localhost:0": Name or service not known (-2)
Comment 43 Ralf Habacker 2015-01-31 11:43:19 UTC
(In reply to Ralf Habacker from comment #42)
> (In reply to Simon McVittie from comment #40)
> > (In reply to Ralf Habacker from comment #37)
> > > Comment on attachment 112893 [details] [review] [review] [review]
> > > Fix AF_UNSPEC issue
> > > 
> > > committed
> > 
> > You can probably revert this, now that you've (hopefully) fixed the reason
> > why IPv6 was broken.
> 
> Unfortunally it is not fixed.
> 
> wine bin/manual-tcp.exe ipv6 
> 
...
> 0: org.freedesktop.DBus.Error.Failed Failed to bind socket "localhost:0":
> Invalid argument
> 
> on linux: 
> 
> test/manual-tcp ipv6 |
> less0: org.freedesktop.DBus.Error.FileNotFound Failed to lookup host/port:
> "localhost:0": Name or service not known (-2)

ipv6 works on one machine I have access to, on the other machine not. The message indicates the problem that this system do not has ipv6 enabled. I guess a better error message would be useful.
Comment 44 Ralf Habacker 2015-01-31 11:55:43 UTC
(In reply to Ralf Habacker from comment #43)

> ipv6 works on one machine I have access to, on the other machine not. The
> message indicates the problem that this system do not has ipv6 enabled. I
> guess a better error message would be useful.

Moved further ipv6 issue fixing to bug 61922.
Comment 45 Ralf Habacker 2015-02-04 07:23:47 UTC
(In reply to Ralf Habacker from comment #36)
> Created attachment 112983 [details] [review] [review]
> Fix dbus-daemon on Windows Failed to bind socket

As this affects basic dbus functionality on windows, any problems with backporting to 1.8 branch ?
Comment 46 Simon McVittie 2015-02-04 11:46:57 UTC
(In reply to Ralf Habacker from comment #45)
> As this affects basic dbus functionality on windows, any problems with
> backporting to 1.8 branch ?

No problem, please go ahead, after doing whatever testing you feel is appropriate.

(In reply to Simon McVittie from comment #39)
> Please leave the bug open for now, to remind me that I
> still need to fix the issues pointed out in my earlier review for both Unix
> and Windows (unless you feel like preparing that patch yourself of course).

This part (when done) should not be backported, I don't think.
Comment 47 Ralf Habacker 2015-02-04 12:41:17 UTC
(In reply to Simon McVittie from comment #46)
> (In reply to Ralf Habacker from comment #45)
> > As this affects basic dbus functionality on windows, any problems with
> > backporting to 1.8 branch ?
> 
> No problem, please go ahead, after doing whatever testing you feel is
> appropriate.
which would be manual-tcp.

> (In reply to Simon McVittie from comment #39)
> > Please leave the bug open for now, to remind me that I
> > still need to fix the issues pointed out in my earlier review for both Unix
> > and Windows (unless you feel like preparing that patch yourself of course).
> 
> This part (when done) should not be backported, I don't think.

The remaining patches for unix are 
1. the comment change (minor) 
2. use NI_NUMERICSERV in getnameinfo() without manual-tcp returns something like this on unix (also minor ?): 

3: tcp:host=localhost,port=reachout,guid=456aa710a90a078f7c15178754d21233 
41: tcp:host=localhost,port=reachout,guid=ee6934d12f069cc09153faf954d21233 
480: tcp:host=localhost,port=csccredir,guid=57c65a486dbbfa6906443e4654d21234 
648: tcp:host=localhost,port=allpeers,guid=08bfd44456a7d793ee532a2954d21234 
930: tcp:host=localhost,port=guttersnex,guid=ce5b90a831cbb0326f6de9f154d21235
Comment 48 Simon McVittie 2015-02-04 13:03:58 UTC
(In reply to Ralf Habacker from comment #47)
> The remaining patches for unix are 
> 1. the comment change (minor) 
> 2. use NI_NUMERICSERV in getnameinfo() without manual-tcp returns something
> like this on unix (also minor ?): 
> 
> 3: tcp:host=localhost,port=reachout,guid=456aa710a90a078f7c15178754d21233 
> 41: tcp:host=localhost,port=reachout,guid=ee6934d12f069cc09153faf954d21233 
> 480: tcp:host=localhost,port=csccredir,guid=57c65a486dbbfa6906443e4654d21234 
> 648: tcp:host=localhost,port=allpeers,guid=08bfd44456a7d793ee532a2954d21234 
> 930: tcp:host=localhost,port=guttersnex,guid=ce5b90a831cbb0326f6de9f154d21235

Yes. Neither of those should go in 1.8 but it would be nice to fix them for master (on both platforms).
Comment 49 Simon McVittie 2015-02-04 13:20:35 UTC
Created attachment 113156 [details] [review]
[1/2] _dbus_listen_tcp_socket: use NI_NUMERICSERV to determine  port number

If we happen to have been given (say) port 30865, we want to
represent that as host=localhost,port=30865 and not
host=localhost,port=csync2.
Comment 50 Simon McVittie 2015-02-04 13:23:02 UTC
Created attachment 113157 [details] [review]
[2/2] _dbus_listen_tcp_socket: comment on another reason to  ignore EADDRINUSE

Not being aware of the second reason described here caused bug #87999
in the equivalent code on Windows.

---

I thought I'd have to fix one or both of those in Windows too (and ask you to test it), but I was misremembering: some of the earlier patches had these issues, but the one you eventually merged was fine. So I'm not touching the Windows code right now, and would appreciate a review for the Unix side.

If you later do a patch to make the Windows side use getnameinfo() as well, I'd like it to use NI_NUMERICSERV, but that change isn't important.
Comment 51 Ralf Habacker 2015-02-04 13:32:08 UTC
Comment on attachment 113156 [details] [review]
[1/2] _dbus_listen_tcp_socket: use NI_NUMERICSERV to determine  port number

Review of attachment 113156 [details] [review]:
-----------------------------------------------------------------

looks good
Comment 52 Ralf Habacker 2015-02-04 13:32:57 UTC
Comment on attachment 113157 [details] [review]
[2/2] _dbus_listen_tcp_socket: comment on another reason to  ignore EADDRINUSE

Review of attachment 113157 [details] [review]:
-----------------------------------------------------------------

looks good
Comment 53 Ralf Habacker 2015-02-04 14:04:49 UTC
Created attachment 113159 [details] [review]
_dbus_listen_tcp_socket: Keep windows code to fetch port from  tcp listening socket in sync with unix.

The usage of NI_MAXSERV is according to the example located at https://msdn.microsoft.com/de-de/library/windows/desktop/ms738532%28v=vs.85%29.aspx
Comment 54 Simon McVittie 2015-02-04 15:09:16 UTC
Comment on attachment 113159 [details] [review]
_dbus_listen_tcp_socket: Keep windows code to fetch port from  tcp listening socket in sync with unix.

Review of attachment 113159 [details] [review]:
-----------------------------------------------------------------

::: dbus/dbus-sysdeps-win.c
@@ +1826,4 @@
>  
> +              if (getsockname(fd, &addr.Address, &addrlen) == SOCKET_ERROR ||
> +                (res = getnameinfo (&addr.Address, addrlen, NULL, 0,
> +                                    portbuf, NI_MAXSERV,

I'd still prefer (..., portbuf, sizeof (portbuf), ...) (even though the numeric value is identical) because that way it's instantly obvious that there can't be any overflow, without having to backtrack to the declaration of portbuf.

@@ +1826,5 @@
>  
> +              if (getsockname(fd, &addr.Address, &addrlen) == SOCKET_ERROR ||
> +                (res = getnameinfo (&addr.Address, addrlen, NULL, 0,
> +                                    portbuf, NI_MAXSERV,
> +                                    NI_NUMERICSERV)) != 0)

Strictly speaking, getsockname and getnameinfo have different error semantics - getnameinfo isn't guaranteed to set errno unless it returns EAI_SYSTEM - so they should probably each have their own error-handling block with different inputs to dbus_set_error().

But this is good enough until we fix that in Unix too, which we should do at some point.
Comment 55 Simon McVittie 2015-02-04 15:21:39 UTC
(In reply to Simon McVittie from comment #54)
> Strictly speaking, getsockname and getnameinfo have different error
> semantics - getnameinfo isn't guaranteed to set errno unless it returns
> EAI_SYSTEM

... but looking at MSDN, Microsoft's getnameinfo has the additional guarantee that it sets the result of WSAGetLastError(), so I think this is wrong on Unix but your patch is fine for Windows.
Comment 56 Simon McVittie 2015-02-04 15:22:49 UTC
Comment on attachment 113156 [details] [review]
[1/2] _dbus_listen_tcp_socket: use NI_NUMERICSERV to determine  port number

fixed for 1.9.10, thanks
Comment 57 Simon McVittie 2015-02-04 15:22:58 UTC
Comment on attachment 113157 [details] [review]
[2/2] _dbus_listen_tcp_socket: comment on another reason to  ignore EADDRINUSE

fixed for 1.9.10, thanks
Comment 58 Ralf Habacker 2015-02-05 10:52:02 UTC
Comment on attachment 113159 [details] [review]
_dbus_listen_tcp_socket: Keep windows code to fetch port from  tcp listening socket in sync with unix.

committed to git master with mentioned NI_MAXSERV to sizeof(portbuf) change.
Comment 59 Ralf Habacker 2015-02-07 12:05:29 UTC
*** Bug 88858 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.