Bug 18046

Summary: crashes in file_monitor_remove_watch() (NULL item?)
Product: ConsoleKit Reporter: Martin Pitt <martin.pitt>
Component: DaemonAssignee: william.jon.mccann
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: jw+debian, pachoramos1
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
URL: https://launchpad.net/bugs/269651
Whiteboard:
i915 platform: i915 features:
Attachments: Ubuntu patch
Patch to serialise watch removals
Serialise removals, and avoid using freed data caused by removals

Description Martin Pitt 2008-10-13 10:31:01 UTC
Hi!

We get a lot of bug reports about crashes with this signature:

g_hash_table_remove_internal (hash_table=0x9237db8, key=0x0, notify=1)
file_monitor_remove_watch (monitor=0x9240d60, watch=0x0) at ck-file-monitor-inotify.c:245
ck_file_monitor_remove_notify (monitor=0x9240d60, id=2) at ck-file-monitor-inotify.c:529
ck_tty_idle_monitor_stop (monitor=0x9243980) at ck-tty-idle-monitor.c:211
ck_session_finalize (object=0x92490e0) at ck-session.c:855

http://launchpadlibrarian.net/17571430/Stacktrace.txt has a complete and fully symbolic stack trace. It seems that in some cases, file_monitor_remove_watch() is called with watch == NULL, which leads to this crash.

The dodgy approach would be to just test for this condition in file_monitor_remove_watch(), but I guess watch == NULL is a "this should not happen(TM)" case, and there is a deeper logic error?
Comment 1 Martin Pitt 2008-10-13 23:07:38 UTC
Created attachment 19645 [details] [review]
Ubuntu patch

For now I applied a patch which intercepts removing a NULL watch. Even if that condition "should not happen", it is defensive, and after removing a watch the state is consistent again anyway.
Comment 2 James Westby 2008-10-18 19:08:55 UTC
Hi,

I believe I have found the cause of this bug.

It is to do with the interaction of the watch removal caused by
the dbus name change and the inotify IN_IGNORED notification.

Both will trigger a watch to be removed, so there can potentially
be two removals attempted. However, this is not checked for in the
code. Even if it was, it would still be potentially racy (even
if the hole would be a lot smaller), so I think the inotify
removals should be pushed in to the main thread, like the rest of
the responses to inotify events, where they can safely check
whether the watch has already been removed.

I will try and write a patch for this now.

Thanks,

James
Comment 3 James Westby 2008-10-19 05:20:03 UTC
Created attachment 19742 [details] [review]
Patch to serialise watch removals

Hi,

Here is my proposed patch to fix this issue.

The first thing it does is move the removal of a watch
caused by an IN_IGNORED event in to the main thread, where
they will be serialised with the other removals, preventing
race conditions.

Then, for the case where two removals are generated it adds
two checks for the watch being already removed. It firstly
checks for watch->wd == -1 in file_monitor_remove_watch,
which will be True if the watch is already removed, and handles
the case when the inotify removal ends up after the dbus one.

It then checks notify->watch->notifies for the case when the
dbus removal ends up after the inotify one.

The testcase I was using for this is thanks to "iponeverything"
in the Ubuntu bug. It is opening 10 xterms and doing "su -" in
each (i.e. opening lots of sessions), and then closing them all
one by one using the close button of the WM.

I am going to seek an upload of this patch in to Ubuntu, as the
bug is currently one of the most reported there, and the crashes
mean people lose desktop functionality until they restart their
session.

Thanks,

James
Comment 4 James Westby 2008-10-19 11:03:45 UTC
Hi,

Unfortunately my patch does not completely solve the problem.

After applying the patch it is still possible to get the crash
with the same recipe (I think I was closing them too quickly,
so the inotify watches hadn't been added yet).

The stacktrace shows that the crash is in g_slist_remove in
ck_file_monitor_remove_notify. My patch adds a guard for
NULL watch->notifies, but the value is not NULL, but neither
is it a valid pointer (0x10 in case I am looking at).

The inotify IN_IGNORED event does cause a removal, and it does
set watch->notifies to NULL prior to this, but by the time
the other removal event arrives the value is 0x10, which I
don't understand.

Any insights you may have would be appreciated.

Thanks,

James
Comment 5 James Westby 2008-10-20 07:18:47 UTC
Created attachment 19760 [details] [review]
Serialise removals, and avoid using freed data caused by removals

Hi,

Further debugging revealed this issue with the previous patch.
While the removals were serialised each notify still contained
a reference to a watch, which will have been freed if a removal
was already triggered, causing a segfault.

I changed the code to also loop through watch->notifies when
removing the watch due to inotify, and NULL each notify->watch
reference, the code then checks this before trying to delete
the watch itself if asked to remove the notify.

In order to prevent other race conditions in this area I also
made the inotify code not pass a watch to emit_events_in_idle,
as the watch may get freed in the meantime. It instead passes
the wd and the emit loop looks up the watch, discarding the
event if the watch has been removed.

I did however leave in the code that checks for a removed watch
before doing anything with inotify, as I hoped that this would
just optimise this case.

Please review the attached patch for inclusion. I am going to
request testing of the patch in the Ubuntu bug, and if that
reveals no further problems seek an upload of the patch.

Thanks,

James
Comment 6 Martin Pitt 2008-10-21 02:19:21 UTC
For the record, this is a recipe to reliably reproduce the crash, thanks to "iponeverything" in the Ubuntu bug (slightly simplified):

--------- 8< -----------
- Attach a an strace to CK so you can see when you see it fall down:
  sudo strace -p `pidof console-kit-daemon`

- Open additional xterm's like so: xterm -geometry 20x10 &
  open 9 or 10 of them.

- now go through and use "su - someuser" in all of the open xterms.

- Next go thru and close each xterm one by one by using the close button in the upper right corner -- by the time you get to the forth or fifth window -- console-kit-daemon will have segfault'ed for you.
--------- 8< -----------

I tried for half an hour to turn this into a noninteractive test script, but failed unfortunately.

I tested CK with James' patch, and it works very well. I don't get crashes any more, and both X and pam-ck-connector based sessions work as usual, and even hammering it with something like

for i in `seq 50`; do
    ck-launch-session sleep 5 &
    sleep 0.1
done

works correctly.
Comment 7 william.jon.mccann 2009-02-11 17:23:27 UTC
Sorry for the really long delay.  And thanks for looking into this!

I think your analysis is correct.  In retrospect I should have designed this to use ref-counted objects.  This patch seems like a good solution for now.  Longer term I'd like to try to add ACCESS inotify support to GIO and use GFileMonitor.

I've committed this to master with a few small changes such as casting the wd int into a pointer and vice versa.

I'd appreciate any testing you can do of the committed patch.  I'll be rolling a release with it shortly.

Thanks again.
Comment 8 Martin Pitt 2009-02-11 23:51:43 UTC
Thanks, William. For the testing record, we have had this patch applied for four months now, in a stable release, and haven't heard bad things about it.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.