Hi! We get a lot of bug reports about crashes with this signature: g_hash_table_remove_internal (hash_table=0x9237db8, key=0x0, notify=1) file_monitor_remove_watch (monitor=0x9240d60, watch=0x0) at ck-file-monitor-inotify.c:245 ck_file_monitor_remove_notify (monitor=0x9240d60, id=2) at ck-file-monitor-inotify.c:529 ck_tty_idle_monitor_stop (monitor=0x9243980) at ck-tty-idle-monitor.c:211 ck_session_finalize (object=0x92490e0) at ck-session.c:855 http://launchpadlibrarian.net/17571430/Stacktrace.txt has a complete and fully symbolic stack trace. It seems that in some cases, file_monitor_remove_watch() is called with watch == NULL, which leads to this crash. The dodgy approach would be to just test for this condition in file_monitor_remove_watch(), but I guess watch == NULL is a "this should not happen(TM)" case, and there is a deeper logic error?
Created attachment 19645 [details] [review] Ubuntu patch For now I applied a patch which intercepts removing a NULL watch. Even if that condition "should not happen", it is defensive, and after removing a watch the state is consistent again anyway.
Hi, I believe I have found the cause of this bug. It is to do with the interaction of the watch removal caused by the dbus name change and the inotify IN_IGNORED notification. Both will trigger a watch to be removed, so there can potentially be two removals attempted. However, this is not checked for in the code. Even if it was, it would still be potentially racy (even if the hole would be a lot smaller), so I think the inotify removals should be pushed in to the main thread, like the rest of the responses to inotify events, where they can safely check whether the watch has already been removed. I will try and write a patch for this now. Thanks, James
Created attachment 19742 [details] [review] Patch to serialise watch removals Hi, Here is my proposed patch to fix this issue. The first thing it does is move the removal of a watch caused by an IN_IGNORED event in to the main thread, where they will be serialised with the other removals, preventing race conditions. Then, for the case where two removals are generated it adds two checks for the watch being already removed. It firstly checks for watch->wd == -1 in file_monitor_remove_watch, which will be True if the watch is already removed, and handles the case when the inotify removal ends up after the dbus one. It then checks notify->watch->notifies for the case when the dbus removal ends up after the inotify one. The testcase I was using for this is thanks to "iponeverything" in the Ubuntu bug. It is opening 10 xterms and doing "su -" in each (i.e. opening lots of sessions), and then closing them all one by one using the close button of the WM. I am going to seek an upload of this patch in to Ubuntu, as the bug is currently one of the most reported there, and the crashes mean people lose desktop functionality until they restart their session. Thanks, James
Hi, Unfortunately my patch does not completely solve the problem. After applying the patch it is still possible to get the crash with the same recipe (I think I was closing them too quickly, so the inotify watches hadn't been added yet). The stacktrace shows that the crash is in g_slist_remove in ck_file_monitor_remove_notify. My patch adds a guard for NULL watch->notifies, but the value is not NULL, but neither is it a valid pointer (0x10 in case I am looking at). The inotify IN_IGNORED event does cause a removal, and it does set watch->notifies to NULL prior to this, but by the time the other removal event arrives the value is 0x10, which I don't understand. Any insights you may have would be appreciated. Thanks, James
Created attachment 19760 [details] [review] Serialise removals, and avoid using freed data caused by removals Hi, Further debugging revealed this issue with the previous patch. While the removals were serialised each notify still contained a reference to a watch, which will have been freed if a removal was already triggered, causing a segfault. I changed the code to also loop through watch->notifies when removing the watch due to inotify, and NULL each notify->watch reference, the code then checks this before trying to delete the watch itself if asked to remove the notify. In order to prevent other race conditions in this area I also made the inotify code not pass a watch to emit_events_in_idle, as the watch may get freed in the meantime. It instead passes the wd and the emit loop looks up the watch, discarding the event if the watch has been removed. I did however leave in the code that checks for a removed watch before doing anything with inotify, as I hoped that this would just optimise this case. Please review the attached patch for inclusion. I am going to request testing of the patch in the Ubuntu bug, and if that reveals no further problems seek an upload of the patch. Thanks, James
For the record, this is a recipe to reliably reproduce the crash, thanks to "iponeverything" in the Ubuntu bug (slightly simplified): --------- 8< ----------- - Attach a an strace to CK so you can see when you see it fall down: sudo strace -p `pidof console-kit-daemon` - Open additional xterm's like so: xterm -geometry 20x10 & open 9 or 10 of them. - now go through and use "su - someuser" in all of the open xterms. - Next go thru and close each xterm one by one by using the close button in the upper right corner -- by the time you get to the forth or fifth window -- console-kit-daemon will have segfault'ed for you. --------- 8< ----------- I tried for half an hour to turn this into a noninteractive test script, but failed unfortunately. I tested CK with James' patch, and it works very well. I don't get crashes any more, and both X and pam-ck-connector based sessions work as usual, and even hammering it with something like for i in `seq 50`; do ck-launch-session sleep 5 & sleep 0.1 done works correctly.
Sorry for the really long delay. And thanks for looking into this! I think your analysis is correct. In retrospect I should have designed this to use ref-counted objects. This patch seems like a good solution for now. Longer term I'd like to try to add ACCESS inotify support to GIO and use GFileMonitor. I've committed this to master with a few small changes such as casting the wd int into a pointer and vice versa. I'd appreciate any testing you can do of the committed patch. I'll be rolling a release with it shortly. Thanks again.
Thanks, William. For the testing record, we have had this patch applied for four months now, in a stable release, and haven't heard bad things about it.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.