Bug 24264 - Crash on removing NULL value from hash in device_remove() (with encrypted drives)
Crash on removing NULL value from hash in device_remove() (with encrypted dri...
Status: RESOLVED FIXED
Product: udisks
Classification: Unclassified
Component: general
unspecified
All All
: medium normal
Assigned To: David Zeuthen (not reading bugmail)
https://bugs.launchpad.net/bugs/414407
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-01 13:50 UTC by Martin Pitt
Modified: 2009-10-09 08:18 UTC (History)
0 users

See Also:


Attachments
don't remove NULL values from hash tables (2.21 KB, patch)
2009-10-09 00:51 UTC, Martin Pitt
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Pitt 2009-10-01 13:50:45 UTC
We got several duplicate reports about dk-disks crashing in device_remove(), apparently when trying to remove a NULL key from the hash tables:

#0  IA__g_str_hash (v=0x0) at /build/buildd/glib2.0-2.21.4/glib/gstring.c:99
	p = (const signed char *) 0x0
	h = <value optimized out>
#1  0x00926d97 in g_hash_table_remove_internal (hash_table=0x8227e90, 
    key=0x0, notify=1) at /build/buildd/glib2.0-2.21.4/glib/ghash.c:195
	node = <value optimized out>
	node_index = <value optimized out>
	__PRETTY_FUNCTION__ = "g_hash_table_remove_internal"
#2  0x0804dc53 in device_remove (daemon=0x8231c08, d=<value optimized out>)
    at devkit-disks-daemon.c:748
	device = <value optimized out>
	native_path = <value optimized out>
	__PRETTY_FUNCTION__ = "device_remove"
[...]
(See http://launchpadlibrarian.net/30388984/ThreadStacktrace.txt for the full trace).

So it seems that one of device->priv->{native_path,device_file,object_path,dev} is NULL. Unfortunately the stack trace isn't precise enough to tell which one in particular. Most duplicates reported that this happens on trying to mount an encrypted drive.

One could just paper over this by checking for NULL keys before trying to remove from the hash table. It's probably a good idea for robustness anyway, but it would paper over the fact that something in the internal state keeping is wrong. These properties aren't queried on device removal, but on addition, so it doesn't sound like a race condition on already lost data on removal.

native_path is already set in devkit_disks_device_new(), so it's unlikely to be NULL. However, object_path is set in register_disks_device(), and device_file and dev in update_info(). So there is one possible path which could lead to this:

        devkit_disks_device_set_device_file (device, g_udev_device_get_device_file (device->priv->d));
        if (device->priv->device_file == NULL) {
                g_warning ("No device file for %s", device->priv->native_path);
                goto out;
        }

So perhaps device_add() shouldn't add such invalid ones to the hash tables, or devices which don't have all four data fields should be discarded at all?
Comment 1 Martin Pitt 2009-10-09 00:51:22 UTC
Created attachment 30207 [details] [review]
don't remove NULL values from hash tables

Since the duplicates of this crash keep coming (11 already), I robustified device_remove() to not attempt to remove NULL values from the hash tables, since that's guaranteed to blow up. It does not fix the underlying status bookkeeping problem, of course, so I added a TODO comment.

IMHO the patch is good and should be applied anyway, but it's certainly not complete yet.

I tried to replicate this under various conditions with an encrypted USB drive (all of the reports are for encrypted drives); yanking it out right after plugging in, entering correct password and yanking it out, entering wrong password, or none at all, killing/restarting daemon in between, but I never got any of the values to be NULL. So this remains a bit of a mystery to me.

There was one recent comment in the downstream bug, after I asked for a reproduction recipe:

  "upon first boot, when adding encrypted usb hdds it gives a "DBUS timeout" message after the enter encryption password prompt, then cannot mount the volume manually via cryptsetup until the usb drive is taken out and plugged back in (cannot access device)."

But a d-bus timeout rather sounds like a consequence of dk-disks crashing than the reason for the crash.
Comment 2 David Zeuthen (not reading bugmail) 2009-10-09 08:15:20 UTC
(In reply to comment #0)
> However, object_path is set in register_disks_device(), and device_file
> and dev in update_info(). So there is one possible path which could lead to
> this:
> 
>         devkit_disks_device_set_device_file (device,
> g_udev_device_get_device_file (device->priv->d));
>         if (device->priv->device_file == NULL) {
>                 g_warning ("No device file for %s", device->priv->native_path);
>                 goto out;
>         }
> 
> So perhaps device_add() shouldn't add such invalid ones to the hash tables, or
> devices which don't have all four data fields should be discarded at all?

Yeah, I think there's some race with 'change', then 'remove' and our process handles the 'change' event just before the 'remove' event (e.g. the 'remove' event happened in udev already - it's just queued up in our process). So that brings us here.

Actually we don't handle neither the udev 'move' event nor the situation where the device file is changing. To handle this we need to keep the global hash tables up to date when processing a 'change' event. I've committed a patch to do this (hold on for the URL) and it should fix this problem as well.

Comment 3 David Zeuthen (not reading bugmail) 2009-10-09 08:18:01 UTC
Here's the patch

http://cgit.freedesktop.org/DeviceKit/DeviceKit-disks/commit/?id=7d9be0143e8d5b34364c5b7dc27cf735da84b558

Martin, any chance you can get the upstream reporters to test it? Thanks.

Please reopen the bug if there's still a crash.