Bug 66867

Summary: Gnome session won't start because d-bus auth fails abjectly
Product: dbus Reporter: Jim Carter <jimc>
Component: coreAssignee: Havoc Pennington <hp>
Status: RESOLVED DUPLICATE QA Contact:
Severity: normal    
Priority: medium    
Version: 1.5   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description Jim Carter 2013-07-13 00:59:54 UTC
Versions:
OpenSuSE 12.3
Kernel 3.7.10-1.16
gnome-session-core-3.6.2-2.2.1
dbus-1-1.6.8-2.3.1 (sorry, the bug report form lacks an item for 1.6.x)

Symptom: Our users' home directories reside on file servers and are mounted by NFS (root squashed) on workstations.  Whenever any user tries to start a GNOME session on any workstation with SuSE 12.3 (OK on SuSE 11.4),  it pops a box with a graphic of a computer with an unhappy face on its screen, and the text says: Oh, no! Something terrible has happened and Gnome cannot recover.  Please log out  and try again.

~/.xsession-errors shows this message:

gnome-session[3274]: WARNING: Failed to connect to system bus: 
Exhausted all available authentication mechanisms
(tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) 
(available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)

Identical or similar messages appear for power-plugin, media-keys-plugin, color-plugin of gnome-settings-daemon, and a non-obvious Javascript function, but I suspect that this may be the killer.  

/var/run/dbus/system_bus_socket exists and dbus-daemon has it open (to several clients).  Permissions are 777 and containing directories are 755.  

Searching on Google for "Exhausted all available authentication mechanisms":
Several people have the problem but nobody reports a fix.  

http://dbus.freedesktop.org/doc/dbus-specification.html
When a client opens a d-bus connection, it first authenticates.  It can guess
an initial mechanism but in case of failure the server will eventually announce
the mechanisms that it supports.  These are SASL mechanisms with some d-bus
specific additions.  Gotcha:
    DBUS_COOKIE_SHA1 -- 
	Client sends its user name, hex encoded.
	Server sends a cookie context (e.g. org_freedesktop_session_bus),
	    integer index of cookie (there might be several),
	    random challenge string
	Client decodes the challenge string (with cookie as key; what algo?)
	    and generates its own random challenge string.
	Client sends back "challenge sha1(decoded:challenge:cookie)"
	Server performs the same calculation; success if equal. 
	The cookie is found in $HOME/.dbus-keyrings/${cookie_context}
	The server writes on this file and a lock file.  
	Since this is all happening over NFS, the system d-bus server lacks 
	    permission to write.  Or read.  For Mathtest the directory
	    does not even exist; the server cannot create it.

Workaround: I jiggered /etc/X11/xdm/Xsession to do this (running as the user, idiotproofing checks not shown):
    UID=`stat -c %u $HOME/.`  # Numeric UID of the user
    ln -s /run/user/$UID $HOME/.dbus-keyrings

Now Gnome will usually start up error-free, or at least with no failures to authenticate to d-bus.  However, occasionally it fails as before to authenticate to d-bus, or one of the plugins fails to authenticate. This suggests a race condition, but I have no idea where the culprit might be hiding.  

Suggestion to the developers:  Put the cookie file/directory in a location known to be on the local machine such as /var/run or /tmp.  Likely the clients would not have to be touched, just a coordinated replacement of the server and client d-bus libraries.  

Another suggestion: Add another auth mechanism using SO_PASSCRED and SCM_CREDENTIALS and desist with the cookie file business.  This could only be on Linux (kernel 2.4 and above), and some BSD variants with a different protocol.

Does the problem occur with other desktop environments?  I haven't reviewed log files to find out, but they are more graceful in handling inaccessible features and do start up.  Power management is a likely problem area.  The workaround with the symlink to /run/user/$UID would be effective for any desktop environment.
Comment 1 Simon McVittie 2013-07-16 12:04:47 UTC
(In reply to comment #0)
> Another suggestion: Add another auth mechanism using SO_PASSCRED and
> SCM_CREDENTIALS and desist with the cookie file business.  This could only
> be on Linux (kernel 2.4 and above), and some BSD variants with a different
> protocol.

That's the EXTERNAL mechanism, which has been supported for years - I think it might have been the first one implemented, in fact. It must be failing for you for some reason...

The maintainers of D-Bus in SuSE might have some useful insight?

As you point out, EXTERNAL can only work on vaguely modern Linux and *BSD, but that should cover the majority of D-Bus users.

> Symptom: Our users' home directories reside on file servers and are mounted
> by NFS (root squashed) on workstations.

I believe the current status of NFS-home with D-Bus is "none of the maintainers use it; good luck".

> Suggestion to the developers:  Put the cookie file/directory in a location
> known to be on the local machine such as /var/run or /tmp.

/var is not known to be local; neither is /tmp; neither is the root filesystem.
Comment 2 Jim Carter 2013-08-03 19:57:39 UTC
@simon, thanks for pointing me in a useful direction.  This bug (or my understanding of it) has mutated and evolved.  It turns out that D-Bus successfully does most of EXTERNAL authentication, but then tries to obtain user info and fails.  The given UID is not in the local /etc/passwd and D-Bus does not try either nscd or net directory services (NIS in our case) even though they are available. If I add the user to the local /etc/passwd (not practical for production), EXTERNAL authentication succeeds the first time and every time.  

The same D-Bus symptom is seen for all 3 desktop environments: Gnome, KDE and XFCE; but it is only fatal for Gnome; the others stumble forward without being able to contact ConsoleKit or power control.  After a reboot the D-Bus failure is seen two or three times, after which it mysteriously self-heals.  

The kludge with symlinking ~/.dbus-keyrings is ineffective, because D-Bus can't find the home home directory to do DBUS_COOKIE_SHA1, and even if it could, on OpenSuSE D-Bus runs as the messagebus user, so it could not deposit the cookie.  I had just assumed it was running as root, and that later success was caused by the intervention.  

The reason is bizarre: when I run OpenSuSE 12.3 out of the box, just modifying /etc/nsswitch.conf to use NIS and DNS as appropriate, it hangs while booting, and I can't even get on the machine to identify positively which service is hanging.  My kludgey workaround was to link in a files-only /etc/nsswitch.conf very early, and only after the network has started do I link in the network-enabled /etc/nsswitch.conf.  D-Bus starts way before that, knowing only of the files, and it has experience from early tries to authenticate root that nscd has not started.  This scenario is not proven step by step in straces, but it explains all the symptoms observed, including eventual self-healing when a timeout passes and it can attempt nscd again, succeeding.  

My workaround was to put in a systemd unit "after" network.target and nscd.service which restarts dbus.service.  This kills some but not all connected daemons; I made it "before" upower.service and console-kit-daemon.service, which do not reconnect.  (OpenSuSE starts ConsoleKit preemptively as an optimization, even though it's bus-activated.)  This is brutal but it gets the job done; now my users can start up a Gnome session and get it right the first time and every time.

It's probably not justified for D-Bus to change to accomodate my specific use-case, but D-Bus is very important and general robustness is valuable.  How could D-Bus be changed to be more immune from the vagaries of directory services?  And I'm wondering if my hang on boot could be related and could be fixed by an intervention to avoid using directory services.  

If a user is authenticated by SO_PEERCRED (the EXTERNAL mechanism for D-Bus) but has no user info in any accessible directory service (/etc/passwd), is it really necessary to reject the authentication?  If not, the whole can of worms can be bypassed in contexts where EXTERNAL works, i.e. Linux.  If D-Bus has a service to report user info about the connection, it could give a "service not available" error or report the user as Nobody.  

If you have to fall back to DBUS_COOKIE_SHA1, you need to put the cookie in some directory which the server is assured of permission to write on.  Linux can be used on a discless workstation so I shouldn't have previously said "on the local machine", but there has got to be someplace that the D-Bus user has write access to.  If this were in /tmp or /var/run or something like that, and if the keyring file were named using the user's numeric UID, it could all be done without reference to directory services.

But in OpenSuSE, D-Bus runs as -u messagebus, so it's not clear how it could ever chown the file to prove that the authenticating user could read it.  (It might run as root on other OS's or there might be a setUID helper that the Linux implementation doesn't have.)  We'd better make sure that EXTERNAL auth always succeeds in Linux.

Flexibility would be added if the server would tell the client, in its first challenge, where it had put the keyring.  But of course this would be a different mechanism.
Comment 3 Simon McVittie 2013-08-05 10:41:36 UTC
(In reply to comment #2)
> My workaround was to put in a systemd unit "after" network.target and
> nscd.service which restarts dbus.service.

Restarting the system dbus-daemon is not a supported action. It might coincidentally work, but if you do that, "you're on your own".

I think the root cause of your problem is the use of NIS, rather than the remote home directories.

> when I run OpenSuSE 12.3 out of the box, just modifying
> /etc/nsswitch.conf to use NIS and DNS as appropriate, it hangs while
> booting, and I can't even get on the machine to identify positively which
> service is hanging.

I suspect this is because a service that starts before networking needs a uid that is not in your /etc/passwd. System users (such as messagebus) should be in /etc/passwd or in some sort of local cache, so that system services that are essential for networking can start, so that networking can come up, so that the rest of the system can work.

> D-Bus starts way before that, knowing only of the files, and it
> has experience from early tries to authenticate root that nscd has
> not started.

Have you tried giving dbus-daemon "Wants=nscd" and "After=nscd", and ensuring that nscd does not depend on anything that can't happen that early?

If you're using the glibc nscd, there were a lot of serious bug reports about it during Debian 7 development - Debian now seems to be recommending use of unscd instead. You might have better results with that.

Where I work, our sysadmins do remote uid synching by writing a local cache (nss-db or something, I think) on each server, and keeping that up-to-date out-of-band - this makes our servers considerably more reliable, by ensuring that they can boot (albeit maybe with a slightly outdated user database) even when disconnected. I would advocate that approach, if possible.

> My kludgey workaround was to link in a files-only
> /etc/nsswitch.conf very early, and only after the network has
> started do I link in the network-enabled /etc/nsswitch.conf.

It's possible that dbus-daemon has already cached a negative query result for a user it expects to see, or something?

> If a user is authenticated by SO_PEERCRED (the EXTERNAL mechanism for D-Bus)
> but has no user info in any accessible directory service (/etc/passwd), is
> it really necessary to reject the authentication?

To try to avoid subtle security flaws, the general philosophy is "if strange things are going on in security-sensitive code, if in doubt, reject". This might be a situation where D-Bus is being too strict, but we'd have to think about it carefully to make sure we're not opening up vulnerabilities.

> If you have to fall back to DBUS_COOKIE_SHA1, you need to put the cookie in
> some directory which the server is assured of permission to write on.

As far as I understand it, it's the clients (not the dbus-daemon) that write these files - the client is proving that it can write to a file in its own home directory.

I'm not sure whether DBUS_COOKIE_SHA1 is even relevant for the system bus, though - the unprivileged dbus-daemon user can't necessarily read users' home directories either. The system bus should really be using EXTERNAL.

> If this were in /tmp or /var/run or something
> like that, and if the keyring file were named using the user's numeric UID,
> it could all be done without reference to directory services.

Using well-known filenames in /tmp results in trivial denial-of-service, and sometimes also symlink attacks.

> But in OpenSuSE, D-Bus runs as -u messagebus

The system dbus-daemon should always run as an unprivileged user, typically called messagebus or dbus.
Comment 4 Jim Carter 2013-08-07 22:29:13 UTC
Definitely the root cause of this mess is using network directory services.  If I didn't use NIS or LDAP I wouldn't have the problem.  I haven't actually proven on the net that uses LDAP that D-Bus can't be contacted shortly after boot, but the hang on boot, cured by linking in a files-only /etc/nsswitch.conf, definitely happens equally with LDAP and NIS.  

I do have all system users/groups like messagebus in a local password/shadow/group file, in sync on all hosts.  The OS should not refer to any ordinary users until one of them types his loginID in the greeter box, and a cached negative result is therefore unlikely.  The symptom happens not during boot but afterward, caused by a snarl-up early in the boot process. 

I tried your suggestion for hacking the unit files for D-Bus (minus the restarting kludge).  In the first attempt, I altered the unit for dbus.socket, putting it After=nscd.service.  The OS booted, but when I (as root) tried to start a session using SSH (pubkey) or a console login, I authenticated successfully but was kicked off.  The console login reported: "Cannot make/remove an entry for the specified session", i.e. console-kit-daemon is inoperative. I can see no sign that D-Bus started, though I could see the message for rsyslogd, and dbus-daemon usually comes just after that.  

In the second attempt I reverted dbus.socket, and made dbus.service After=nscd.service.  "After" was not honored; dbus-daemon started just before console-kit-daemon (the first client of D-Bus), and just after rsyslogd (i.e. early, way before nscd).  The OS booted normally and would let users on, except that early Gnome sessions came to a terrible demise,, i.e. users logging in early could not authenticate to the system D-Bus.  

Paranoia about security is certainly a good idea.  Maybe you or one of the other developers could remember what bad thing might happen if you believe in EXTERNAL auth for a nonexistent user.  Certainly DBUS_COOKIE_SHA1 is impossible without user info (the homedir).  But I can't see any way a hacker could gain any advantage by being nonexistent, and I think it's reasonable to rely on the login authentication system to keep out the riff-raff.  It's valuable to disentangle mission-critical infrastructure from the possibly unavailable network directory service.  

On unscd, OpenSuSE v12.2 had that and I used it, but in v12.3 they have reverted to the non-U nscd and I went along.  In v11.2 or v11.4 I had nscd freeze on me once too often, and discontinued it, but I was attracted by the promised improvements in unscd.  Likely unscd and nscd were merged in the past year or so.
Comment 5 Jim Carter 2013-08-08 21:25:44 UTC
I think I have a feasible workaround.  
    The unit that restarted D-Bus has been removed (yay!)
    /etc/nsswitch.conf was reverted to use files then NIS or DNS.
    Units that switched in a files-only nsswitch.conf were removed. 
    nscd.service was hacked to be Before=dbus.service.  See Simon's comment 3.

This combination does not hang during booting, and as soon as the network comes up and user logins are allowed, users in NIS can start a Gnome session and successfully authenticate to the system D-Bus.  I'm assuming that D-Bus gets user info from glibc's NSS, which first tries nscd, successfully because of Before=dbus.service.  Apparently when the network is not yet up, nscd is less picky or prone to hanging than the lookup code intrinsic to NSS.  

So I think my issue is now finished.  However, I do suggest that D-Bus could avoid potential nasty snarl-ups if it didn't need to look up passwd info, except for DBUS_COOKIE_SHA1.  

Thank you for your help in this, getting me pointed on a path that led to a solution.
Comment 6 Simon McVittie 2013-08-27 15:20:07 UTC
(In reply to comment #5)
>     nscd.service was hacked to be Before=dbus.service.

This...

> However, I do suggest that D-Bus could
> avoid potential nasty snarl-ups if it didn't need to look up passwd info,
> except for DBUS_COOKIE_SHA1

basically agrees with Bug #28355 (which is about LDAP, not NIS, but the general idea is the same - "get your users from the network").

*** This bug has been marked as a duplicate of bug 28355 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.