Bug 95191

Summary: Memory leak when running test-(d)bus test on Windows
Product: dbus Reporter: yfei <yiyangfei36>
Component: coreAssignee: D-Bus Maintainers <dbus>
Status: RESOLVED MOVED QA Contact: D-Bus Maintainers <dbus>
Severity: major    
Priority: medium CC: ralf.habacker, smcv
Version: 1.10Keywords: patch
Hardware: x86-64 (AMD64)   
OS: Windows (All)   
Whiteboard: review-, almost ready
i915 platform: i915 features:
Attachments: Log from running "nmake check" on Win64 build using VS2013.
Fix memory leak in _dbus_replace_install_prefix() on success case on Windows.
Display thread id in _dbus_verbose to be able to see thread issues.
On Windows make access to member 'refcount' of struct DBusBabysitter thread save.
Prefix debug messages displayed with _dbus_verbose() with a timestamp.
Follow up of commit 5bd460ecf027bb36ffd21d325739bb51df94521f
Display thread id in _dbus_verbose to be able to see thread issues (update 1)
Prefix debug messages displayed with _dbus_verbose() with a timestamp (update 1)
On Windows make access to member 'refcount' of struct DBusBabysitter thread save (update 1)
Eliminates a race condition accessing DBusBabysitter instance at startup of babysitter() on Windows.
Fix memory leak issue in spawn code on Windows.
Fix memory leak issue in spawn code on Windows (update 1)
Display thread id in _dbus_verbose to be able to see thread issues (update 2)
dbus-spawn-win: Do not attempt to call child_setup (again)
Fix compiling on windows/mingw with automake build system.
Migrate platform specific spawn tests into cross platform tests.
Migrate platform specific spawn tests into cross platform tests (update 1)
Fix ambiguous setup of DBusBabySitter struct member child_handle on Windows.
Fix memory leak issue in spawn code on Windows (update 2)
Migrate platform specific spawn tests into cross platform tests (update 2)
travis: compile with mingw in both production and debug flavours
test-segfault: add missing include of <stdio.h> on Windows
Migrate platform specific spawn tests into cross platform tests.
Use mutex to protect access to DBusBabysitter struct members on Windows.
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter().
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter() (update 1)
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter().
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter().

Description yfei 2016-04-28 16:12:54 UTC
Created attachment 123326 [details]
Log from running "nmake check" on Win64 build using VS2013.

When running test-dbus spawn, I get memory leak errors.  Full unit test output is attached.  Test was run in Win10 x64, VS2013 compiler, "nmake check".
Comment 1 yfei 2016-04-28 21:26:16 UTC
Ran unit tests with Visual Leak Detector enabled but it did not report any leaks.  Noticed that when running the tests in debug mode through debugger, the leak becomes intermittent.  The behavior suggests that it is possible check_memleaks() is called before all dbus_free() are called.  Added a sleep(100) to the beginning of check_memleaks() and no more leaks are reported.  Any thoughts on what might be causing this behavior?
Comment 2 Simon McVittie 2016-04-29 11:11:05 UTC
(In reply to yfei from comment #1)
> Ran unit tests with Visual Leak Detector enabled but it did not report any
> leaks.  Noticed that when running the tests in debug mode through debugger,
> the leak becomes intermittent.  The behavior suggests that it is possible
> check_memleaks() is called before all dbus_free() are called.  Added a
> sleep(100) to the beginning of check_memleaks() and no more leaks are
> reported.  Any thoughts on what might be causing this behavior?

My guess would be that some resource is only freed when the spawned subprocess exits, and in the runs where it succeeded, check_memleaks() might be happening before that (i.e. the test might not be waiting for the spawned subprocess to finish).
Comment 3 Ralf Habacker 2016-05-02 05:29:30 UTC
(In reply to Simon McVittie from comment #2)

> My guess would be that some resource is only freed when the spawned
> subprocess exits, and in the runs where it succeeded, check_memleaks() might
> be happening before that (i.e. the test might not be waiting for the spawned
> subprocess to finish).

Spawning on windows needs more time as on unix, so this may indeed the reason.
Comment 4 Ralf Habacker 2016-05-09 12:52:46 UTC
Comment on attachment 123326 [details]
Log from running "nmake check" on Win64 build using VS2013.

>test-dbus: running message tests
>...
>8 dbus_malloc blocks were not freed

This is not spawn related.
Comment 5 Ralf Habacker 2016-05-09 12:54:36 UTC
I personally did not get the leaks with test-dbus only with test-bus

xxx@yyy:~/src/dbus-cmake-cross-build> DBUS_TEST_DAEMON=z:/home/xxx/src/dbus-cmake-cross-build/bin/dbus-daemon.exe DBUS_SESSION_BUS_ADDRESS= DBUS_FATAL_WARNINGS=0 DBUS_TEST_DATA=z:/home/xxx/src/dbus-cmake-cross-build/test/data DBUS_TEST_DBUS_LAUNCH=z:/home/xxx/src/dbus-cmake-cross-build/bin/dbus-launch.exe DBUS_TEST_HOMEDIR=z:/home/xxx/src/dbus-cmake-cross-build/dbus bin/test-bus.exe --tap config-parser
Z:\home\xxx\src\dbus-cmake-cross-build\bin\test-bus.exe: Running config file parser test
Testing retrieving the default session service directories
    default service dir: Z:/home/xxx/src/dbus-cmake-cross-build/share/dbus-1/services
    default service dir: C:\Program Files\Common Files/dbus-1/services
    test service dir: 'Z:/home/xxx/src/dbus-cmake-cross-build/share/dbus-1/services'
    current standard service dir: 'Z:/home/xxx/src/dbus-cmake-cross-build/share/dbus-1/services'
    test service dir: 'C:\Program Files\Common Files/dbus-1/services'
    current standard service dir: 'C:\Program Files\Common Files/dbus-1/services'
default system service dir skipped
Testing valid files:
    forbidding.conf
    entities.conf
    session.conf
    incoming-limit.conf
    debug-allow-all-sha1.conf
    multi-user.conf
    basic.conf
    finite-timeout.conf
    check-own-rules.conf
        Check name org.freedesktop: not allowed
        Check name org.freedesktop.ManySystem: not allowed
        Check name org.freedesktop.ManySystems: allowed
        Check name org.freedesktop.ManySystems.foo: allowed
        Check name org.freedesktop.ManySystems.foo.bar: allowed
        Check name org.freedesktop.ManySystems2: not allowed
        Check name org.freedesktop.ManySystems2.foo: not allowed
        Check name org.freedesktop.ManySystems2.foo.bar: not allowed
    many-rules.conf
    systemd-activation.conf
    debug-allow-all.conf
    listen-unix-runtime.conf
Testing invalid files:
    badselinux-2.conf
    not-well-formed.conf
    truncated-file.conf
    circular-2.conf
    circular-3.conf
    badselinux-1.conf
    circular-1.conf
Comparing equivalent files:
    entities-1.conf
    entities-2.conf
Comparing equivalent files:
    basic-2.conf
    basic-1.conf
Z:\home\xxx\src\dbus-cmake-cross-build\bin\test-bus.exe: checking for memleaks
72 dbus_malloc blocks were not freed
Unit test failed: memleaks
Comment 6 yfei 2016-05-09 13:48:03 UTC
The leaks I saw were intermittent when I ran the tests repeatedly on the same PC.
Comment 7 Ralf Habacker 2016-05-09 14:28:27 UTC
(In reply to Ralf Habacker from comment #5)
> Z:\home\xxx\src\dbus-cmake-cross-build\bin\test-bus.exe: checking for memleaks
> 72 dbus_malloc blocks were not freed
> Unit test failed: memleaks
00246480 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00246ae8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00247830 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0024ecf8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0024ef78 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0024f3c8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00251b20 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00253198 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
002538e0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
002540f8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00254e18 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00254fd0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00258470 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
002599c8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0025bd18 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0025d8a0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0025eff0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0025f5f8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
002609b8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00262908 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00262c50 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00263270 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00265648 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00266fc8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00268208 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0026a1c0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0026a660 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0026b580 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0026bb90 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0026bf28 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0026c638 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00271cc0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00273460 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00274138 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00274788 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00276d68 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00277010 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00278370 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00279db8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0027b6d0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0027d300 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0027e8f8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0027eed8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0027f090 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00280cd0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00282268 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
002828e8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00282b68 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00283b18 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00285758 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00286ad0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00286e18 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00287510 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00289c50 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0028a088 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0028a3a8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0028bbf0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0028be98 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0028d938 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0028f860 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0028fbf8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
002903b8 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00291060 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00292a70 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00293dc0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00295bd0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00295fd0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
002964a0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00299270 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00299948 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
00299eb0 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
0029a380 realloc(268) allocated at [/home/xxx/src/dbus/dbus/dbus-memory.c:678]
Comment 8 Ralf Habacker 2016-05-09 19:53:52 UTC
(In reply to yfei from comment #6)
> The leaks I saw were intermittent when I ran the tests repeatedly on the
> same PC.

On a different pc I also got intermittent memory leaks on running 'test-dbus --tap spawn'. 

test-dbus: running spawn tests
8 dbus_malloc blocks were not freed

The related memory allocation is 
002465c0 malloc(12) ()
Comment 9 Ralf Habacker 2016-05-09 20:40:13 UTC
(In reply to Ralf Habacker from comment #8)
> (In reply to yfei from comment #6)
> > The leaks I saw were intermittent when I ran the tests repeatedly on the
> > same PC.
> 
> On a different pc I also got intermittent memory leaks on running 'test-dbus
> --tap spawn'. 
> 
> test-dbus: running spawn tests
> 8 dbus_malloc blocks were not freed
> 
> The related memory allocation is 
> 002465c0 malloc(12) ()

002465c0 malloc(12) () [/home/xxx/src/dbus/dbus/dbus-memory.c:829]

_dbus_register_shutdown_func_unlocked (DBusShutdownFunction  func,
                                       void                 *data)
{
827:  ShutdownClosure *c;
828:
829:  c = dbus_new (ShutdownClosure, 1);
Comment 10 Ralf Habacker 2016-05-09 21:19:16 UTC
(In reply to Ralf Habacker from comment #7)
> (In reply to Ralf Habacker from comment #5)
> > Z:\home\xxx\src\dbus-cmake-cross-build\bin\test-bus.exe: checking for memleaks
> > 72 dbus_malloc blocks were not freed
> > Unit test failed: memleaks
10: 002463f8 realloc(268) (S:\ralf\src\dbus-cmake-cross-x86-build\) [/home/ralf/src/dbus/dbus/dbus-string.c:366]

This is DBusString instance holding the install root and allocated in 

static dbus_bool_t
reallocate_for_length (DBusRealString *real,
                       int             new_length)
{
...
  /* But be sure we always alloc at least space for the new length */
  new_allocated = MAX (new_allocated,
                       new_length + _DBUS_STRING_ALLOCATION_PADDING);

  _dbus_assert (new_allocated >= real->allocated); /* code relies on this */
366:  new_str = dbus_realloc (real->str - real->align_offset, new_allocated);
Comment 11 Ralf Habacker 2016-05-09 21:35:19 UTC
Created attachment 123585 [details] [review]
Fix memory leak in _dbus_replace_install_prefix() on success case on Windows.

Bug:
Comment 12 Ralf Habacker 2016-05-10 00:01:07 UTC
(In reply to Ralf Habacker from comment #9)
> (In reply to Ralf Habacker from comment #8)
> > (In reply to yfei from comment #6)
> > > The leaks I saw were intermittent when I ran the tests repeatedly on the
> > > same PC.
> > 
> > On a different pc I also got intermittent memory leaks on running 'test-dbus
> > --tap spawn'. 
> > 
> > test-dbus: running spawn tests
> > 8 dbus_malloc blocks were not freed
> > 
other results 
Unit test failed: memleaks
2715: 002465c0 malloc(12) () [/home/ralf/src/dbus/dbus/dbus-memory.c:854]
2690: 002649d8 calloc(24) () [/home/ralf/src/dbus/dbus/dbus-watch.c:236]
2691: 00264af0 malloc(15) (spawn_and_kill) [/home/ralf/src/dbus/dbus/dbus-internals.c:547]
2703: 00264f20 malloc(8) () [/home/ralf/src/dbus/dbus/dbus-spawn-win.c:416]
2689: 00266140 calloc(72) () [/home/ralf/src/dbus/dbus/dbus-spawn-win.c:97]

> 2691: 00264af0 malloc(15) (spawn_and_kill) [/home/ralf/src/dbus/dbus/dbus-internals.c:547]
This is while running the "spawn_and_kill" test case

> 2689: 00266140 calloc(72) () [/home/ralf/src/dbus/dbus/dbus-spawn-win.c:97]
This references the related sitter instance. 

Looks like a reference counting issue in DBusBabysitter: The spawn is handled in a different thread. Both threads unrefs the DBusBabysitter instance and write to the refcount member as shown in the follow debug infos:

function      threadid sitter refcount
_dbus_babysitter_new 68 00245F78 1
_dbus_babysitter_ref 205 00245F78 2
_dbus_babysitter_unref 68 00245F78 old: 2
_dbus_babysitter_unref 68 00245F78 new: 1
_dbus_babysitter_unref 68 00245F78 -> delete sitter instance (looks wrong: should only be when refcount == 0)
close_socket_to_babysitter 68 00245F78 0
_dbus_babysitter_unref 205 00245F78 old: 1
_dbus_babysitter_unref 205 00245F78 new: 0
_dbus_babysitter_unref 205 00245F78 -> delete sitter instance
close_socket_to_babysitter 205 00245F78 0

Especially the follow entries shows that there is something wrong
>_dbus_babysitter_unref 68 00245F78 old: 2
>_dbus_babysitter_unref 68 00245F78 new: 1
-> new refcount is 1 
>_dbus_babysitter_unref 68 00245F78 -> delete sitter instance
-> Bug: the instance is only removed when refcount == 0

I changed refcount type from int to DBusAtomic and got 

function      threadid sitter refcount
_dbus_babysitter_new 263 00265D00 1
_dbus_babysitter_ref 73 00265D00 2
_dbus_babysitter_unref 263 00265D00 old: 2
_dbus_babysitter_unref 263 00265D00 new: 1
_dbus_babysitter_unref 263 00265D00 new: 1 end
-> no additional delete 
_dbus_babysitter_unref 73 00265D00 old: 2
_dbus_babysitter_unref 73 00265D00 new: 0
_dbus_babysitter_unref 73 00265D00 -> delete sitter instance
-> this is expected and okay

_dbus_babysitter_unref 73 00265D00 old: 2
_dbus_babysitter_unref 73 00265D00 new: 0
-> still unclear why jumping from 2 to zero: additional threading issue ?

@Simon: Any hints ?
Comment 13 Ralf Habacker 2016-05-10 05:28:21 UTC
(In reply to Ralf Habacker from comment #9)
> (In reply to Ralf Habacker from comment #8)
> > (In reply to yfei from comment #6)
> > > The leaks I saw were intermittent when I ran the tests repeatedly on the
> > > same PC.
> > 
> > On a different pc I also got intermittent memory leaks on running 'test-dbus
> > --tap spawn'. 
> > 
> > test-dbus: running spawn tests
> > 8 dbus_malloc blocks were not freed
> > 
> > The related memory allocation is 
> > 002465c0 malloc(12) ()
> 
> 002465c0 malloc(12) () [/home/xxx/src/dbus/dbus/dbus-memory.c:829]
> 
> _dbus_register_shutdown_func_unlocked (DBusShutdownFunction  func,
>                                        void                 *data)
> {
> 827:  ShutdownClosure *c;
> 828:
> 829:  c = dbus_new (ShutdownClosure, 1);

function threadid instance refcount/(comment) 
_dbus_babysitter_new 191 00266060 1
_dbus_babysitter_ref 154 00266060 2
_dbus_babysitter_unref 191 00266060 old: 2
_dbus_babysitter_unref 191 00266060 new: 1
_dbus_babysitter_unref 191 00266060 new: 1 end
dbus_shutdown 191 002465C0 (shutdown ShutdownClosure)
_dbus_babysitter_unref 154 00266060 old: 1
_dbus_babysitter_unref 154 00266060 new: 0
test-dbus: checking for memleaks
_dbus_babysitter_unref 154 00266060 delete
close_socket_to_babysitter 154 00266060 0
8 dbus_malloc blocks were not freed
Unit test failed: memleaks
_dbus_register_shutdown_func_unlocked 154 002465C0 (create ShutdownClosure)

-> access to registered_globals and/or implementation of creating/shutdown ShutdownClosure is not thread safe
Comment 14 Simon McVittie 2016-05-10 11:23:44 UTC
(In reply to Ralf Habacker from comment #12)
> Looks like a reference counting issue in DBusBabysitter: The spawn is
> handled in a different thread.

This is unique to Windows: on Unix, the babysitter is a forked subprocess.
Comment 15 Simon McVittie 2016-05-10 11:29:43 UTC
Comment on attachment 123585 [details] [review]
Fix memory leak in _dbus_replace_install_prefix() on success case on Windows.

Review of attachment 123585 [details] [review]:
-----------------------------------------------------------------

Looks good. I don't think this is applicable on 1.10 or older - we reworked _dbus_replace_install_prefix() since 1.10 (it used to use a fixed-length static buffer).
Comment 16 Simon McVittie 2016-05-10 11:37:13 UTC
(In reply to Ralf Habacker from comment #12)
> Looks like a reference counting issue in DBusBabysitter: The spawn is
> handled in a different thread. Both threads unrefs the DBusBabysitter
> instance and write to the refcount member

Yes, if it is manipulated from two different threads, you have to use either _dbus_atomic_inc()/_dbus_atomic_dec() or a mutex.

It would also be worthwhile to audit what else is used in the non-main thread. dbus-*-util-* and bus/ are specifically not thread-safe, so you can't use any APIs from there in the non-main thread, unless those APIs are known to be thread-safe (in which case they should have a comment so we don't break that property).

There are basically two common patterns for this:

* use a mutex (a global one for dbus-spawn-win.c is probably fine) to
  protect whatever state is shared between the main thread and the other
  thread; or

* write to particular members from the main thread, transfer ownership
  to the other thread with a memory barrier (_dbus_atomic_inc() on a
  refcount is enough), and do not touch those members from the main thread
  ever again (except perhaps to free the struct, which comes after a
  _dbus_atomic_dec() from the other thread, which is also a memory barrier
  so it's OK)
Comment 17 Simon McVittie 2016-05-10 11:50:33 UTC
(In reply to Ralf Habacker from comment #12)
> _dbus_babysitter_unref 73 00265D00 old: 2
> _dbus_babysitter_unref 73 00265D00 new: 0
> -> still unclear why jumping from 2 to zero: additional threading issue ?

How are you inspecting the refcount? Please show me the code.

To avoid a time-of-check/time-of-use bug, you should always use _dbus_atomic_inc() and _dbus_atomic_dec() if at all possible, like in _dbus_server_ref_unlocked() and _dbus_server_unref_unlocked(). They both return what the value was *before* the increment or decrement - _dbus_atomic_inc(x) is an atomic version of x++ and _dbus_atomic_dec() is an atomic version of x-- (not ++x and --x). Use that returned value instead of calling _dbus_atomic_get().

If the reference-counting is particularly tricky, it might be worth adding the same machinery as for _dbus_server_trace_ref().

_dbus_atomic_get() is only valid to use in situations where it is OK that the value might change immediately after it returns (for instance, we use it for the functions that are called by the Stats interface, where it's OK if the result is out-of-date by a few milliseconds). Don't use it unless you can spell out why it's OK. Direct access to atomic->value is always wrong, regardless.
Comment 18 Ralf Habacker 2016-05-10 12:19:31 UTC
Comment on attachment 123585 [details] [review]
Fix memory leak in _dbus_replace_install_prefix() on success case on Windows.

committed to git master branch
Comment 19 Ralf Habacker 2016-05-10 15:02:59 UTC
Created attachment 123606 [details] [review]
Display thread id in _dbus_verbose to be able to see thread issues.

Bug:
Comment 20 Ralf Habacker 2016-05-10 15:06:14 UTC
Created attachment 123607 [details] [review]
On Windows make access to member 'refcount' of struct DBusBabysitter thread save.
Comment 21 Ralf Habacker 2016-05-10 15:47:30 UTC
(In reply to Simon McVittie from comment #17)
> (In reply to Ralf Habacker from comment #12)
> > _dbus_babysitter_unref 73 00265D00 old: 2
> > _dbus_babysitter_unref 73 00265D00 new: 0
> > -> still unclear why jumping from 2 to zero: additional threading issue ?
> 
> How are you inspecting the refcount? Please show me the code.

see attachment 123607 [details] [review]
> 
> To avoid a time-of-check/time-of-use bug, you should always use
> _dbus_atomic_inc() and _dbus_atomic_dec() if at all possible, like in
> _dbus_server_ref_unlocked() and _dbus_server_unref_unlocked(). 

the calls to dbus_assert can still be optimized in this way.


> If the reference-counting is particularly tricky, it might be worth adding
> the same machinery as for _dbus_server_trace_ref().
see attachment 123607 [details] [review]
> 
> _dbus_atomic_get() is only valid to use in situations where it is OK that
> the value might change immediately after it returns 

sitter->child_handle may need a mutex because it is written from both threads. Also deleting the sitter instance in _dbus_babysitter_unref may require a mutex to about access to a partially destroyed instance.
Comment 22 Ralf Habacker 2016-05-10 15:48:26 UTC
(In reply to Ralf Habacker from comment #21)
> sitter->child_handle may need a mutex because it is written from both
> threads. Also deleting the sitter instance in _dbus_babysitter_unref may
> require a mutex to about access to a partially destroyed instance.

sed 's/about/avoid/g'
Comment 23 Ralf Habacker 2016-05-10 16:11:21 UTC
(In reply to Ralf Habacker from comment #21)
> (In reply to Simon McVittie from comment #17)
> > (In reply to Ralf Habacker from comment #12)
> > > _dbus_babysitter_unref 73 00265D00 old: 2
> > > _dbus_babysitter_unref 73 00265D00 new: 0
> > > -> still unclear why jumping from 2 to zero: additional threading issue ?
> 
with attachment 123606 [details] [review] and 123607 applied I can see something the following order 

spawn_and_kill: (will fail malloc 3 with 3 failures)
===
63:0x0035 1462896352.247877 [dbus/dbus-internals.c(524):_dbus_trace_ref] DBusBabysitter 002647F0 0 -> 1 refs (_dbus_babysitter_new)
-> thread 0x0035 creates the sitter 

63:0x0035 1462896352.248038 [dbus/dbus-sysdeps-win.c(1141):_dbus_socketpair] full-duplex pipe 196:196 <-> 200:200
63:0x0041 1462896352.248131 [dbus/dbus-internals.c(524):_dbus_trace_ref] DBusBabysitter 002647F0 1 -> 2 refs (_dbus_babysitter_ref)
-> thread 0x0041 refs the sitter 

63:0x0041 1462896352.248139 [dbus/dbus-spawn-win.c(616):babysitter] babysitter: spawning spawn_and_kill
63:0x0035 1462896352.252271 [dbus/dbus-internals.c(524):_dbus_trace_ref] DBusBabysitter 002647F0 2 -> 1 refs (_dbus_babysitter_unref)
-> thread 0x0035 unrefs the sitter 

63:0x0041 1462896352.252276 [dbus/dbus-internals.c(524):_dbus_trace_ref] DBusBabysitter 002647F0 1 -> 0 refs (_dbus_babysitter_unref)
-> thread 0x0041 (the non main thread) destroyes the sitter  ???

63:0x0041 1462896352.252300 [dbus/dbus-spawn-win.c(186):close_socket_to_babysitter] Closing babysitter
63:0x0035 1462896352.252314 [dbus/dbus-internals.c(1021):run_failing_each_malloc] 

BTW: The timestamp (e.g. 1462896352.252314) comes from an additional local patch of dbus_verbose
Comment 24 Ralf Habacker 2016-05-11 06:09:15 UTC
Created attachment 123615 [details] [review]
Prefix debug messages displayed with _dbus_verbose() with a timestamp.

The new format of output generated with dbus_verbose() is <pid>:<thread> <timestamp> <text>

Bug:
Comment 25 Ralf Habacker 2016-05-11 06:10:23 UTC
Created attachment 123616 [details] [review]
Follow up of commit 5bd460ecf027bb36ffd21d325739bb51df94521f

Bug: https://bugs.freedesktop.org/show_bug.cgi?id=95191

- remove unused variable enabled
- reduce usage of _dbus_atomic_get()

This patch will be merged into the related commit after review
Comment 26 yfei 2016-05-11 15:43:34 UTC
Thank you for investigating this memory leak.  With the latest code on master branch, test-bus consistently passes.  However, test-dbus still consistently shows one memory leak during spawn test:

X:\qaweb\opensource_dbus\build\x64>bin\test-dbus test\data spawn
Test data in test\data
test-dbus: running spawn tests
test-dbus: checking for memleaks
1 dbus_malloc blocks were not freed
Unit test failed: memleaks
Comment 27 Ralf Habacker 2016-05-11 17:19:18 UTC
(In reply to yfei from comment #26)
> Thank you for investigating this memory leak.  With the latest code on
> master branch, test-bus consistently passes.  However, test-dbus still
> consistently shows one memory leak during spawn test:
> 
> X:\qaweb\opensource_dbus\build\x64>bin\test-dbus test\data spawn
> Test data in test\data
> test-dbus: running spawn tests
> test-dbus: checking for memleaks
> 1 dbus_malloc blocks were not freed
> Unit test failed: memleaks

This is caused by a design issue in the windows implementation of _dbus_spawn_async_with_babysitter() for example in the 'spawn_nonexistant' test case showed by the following code example.

  if (_dbus_spawn_async_with_babysitter (&sitter, "spawn_nonexistent", argv, NULL,
                                         NULL, NULL,
                                         &error))
-> at this point an additional thread has been started, which starts the requested application. The exit status is returned back to the main thread by sitter->child_handle.
    {
      _dbus_babysitter_block_for_child_exit (sitter);
-> at this location the child process should has been exited. The function is implemented as 

_dbus_babysitter_block_for_child_exit (DBusBabysitter *sitter)
{
  if (sitter->child_handle == NULL)
    return;

  WaitForSingleObject (sitter->end_sync_event, INFINITE);
}

-> In the mentioned test case sitter->child_handle is zero, which let this function returns immediately instead of waiting until child has been exited.

      _dbus_babysitter_set_child_exit_error (sitter, &error);
    }

   if (sitter)
    _dbus_babysitter_unref (sitter);

-> In the mentioned test case this is called before the ..._unref() counterpart in the non-main thread. 
The non main thread then calls _unref() later and destroys the sitter instance. Before that happens the main thread starts the mem leak check, which detects the not or partially destroyed sitter instance.
Comment 28 yfei 2016-05-11 17:53:48 UTC
The pointer to the sitter object is passed into the thread function babysitter() as a parameter, and the sitter->child_handle is populated there by spawn_program().  There is also a start_sync_event that the main thread waits on to know when the program has actually launched inside the thread.  So I'm not understanding why sitter->child_handle would be zero if the program launched successfully.
Comment 29 Ralf Habacker 2016-05-11 18:03:11 UTC
(In reply to yfei from comment #28)
> The pointer to the sitter object is passed into the thread function
> babysitter() as a parameter, and the sitter->child_handle is populated there
> by spawn_program().  There is also a start_sync_event that the main thread
> waits on to know when the program has actually launched inside the thread. 
> So I'm not understanding why sitter->child_handle would be zero if the
> program launched successfully.

The 'spawn_nonexistant' test case tries to spawn a non existant executable and therefore sitter->child_handle is zero.
Comment 30 yfei 2016-05-11 19:58:09 UTC
It seems the sitter is intended to track the spawned program.  In fact, even waiting on  sitter->end_sync_event is not a guarantee because the sitter is freed after that signal is raised and memory check could happen in between those two actions.  The problem seems to be that nothing is tracking the thread.  Perhaps a _dbus_babysitter_block_for_thread_exit() should be added to wait until the thread has also returned to guarantee everything created by _dbus_spawn_async_with_babysitter() has been freed.
Comment 31 Ralf Habacker 2016-05-12 06:13:05 UTC
(In reply to yfei from comment #30)
> It seems the sitter is intended to track the spawned program.  In fact, even
> waiting on  sitter->end_sync_event is not a guarantee because the sitter is
> freed after that signal is raised and memory check could happen in between
> those two actions.  The problem seems to be that nothing is tracking the
> thread.  Perhaps a _dbus_babysitter_block_for_thread_exit() should be added
> to wait until the thread has also returned to guarantee everything created
> by _dbus_spawn_async_with_babysitter() has been freed.

Sounds good, but the testcases are also used on unix/linux where the spawn is performed by fork/exec, so adding  _dbus_babysitter_block_for_thread_exit() would only be usable on windows (or platforms where spawing is implemented with threading too). 

Maybe it would be possible to add an empty _dbus_babysitter_block_for_thread_exit() on unix and to use something like this

  if (_dbus_spawn_async_with_babysitter (...)
    {
      _dbus_babysitter_block_for_child_exit (sitter);
      _dbus_babysitter_set_child_exit_error (sitter, &error);
      _dbus_babysitter_block_for_thread_exit (sitter);
    }

or this 

  if (_dbus_spawn_async_with_babysitter (...)
    {
      if (_dbus_babysitter_block_for_child_exit (sitter))
        _dbus_babysitter_set_child_exit_error (sitter, &error);
      else
        _dbus_babysitter_block_for_thread_exit (sitter);
    }

Another option would be to add this check to _dbus_babysitter_block_for_child_exit()

@simon: what do you think ?
Comment 32 yfei 2016-05-12 15:57:18 UTC
It will have to be the first option since _dbus_babysitter_block_for_child_exit() has no return values.  I don't think putting the wait for thread exit inside _dbus_babysitter_block_for_child_exit() because some code may not care to wait for thread exit.
Comment 33 Simon McVittie 2016-05-12 18:03:51 UTC
While reading the code to check on this, I noticed another threading bug which would probably be good to fix first:

static DWORD __stdcall
babysitter (void *parameter)
{
  int ret = 0;
  DBusBabysitter *sitter = (DBusBabysitter *) parameter;

  PING();
  _dbus_babysitter_ref (sitter);           // <-----
  ...
  _dbus_babysitter_unref (sitter);

  return ...;
}

If the main thread gets far enough to destroy sitter before the babysitter thread gets to the line I've marked, then the babysitter thread will crash.

What we should do is:

  sitter_thread = (HANDLE) CreateThread (NULL, 0, babysitter,
    _dbus_babysitter_ref (sitter), 0, &sitter_thread_id);

and remove the _dbus_babysitter_ref() call from babysitter. This ensures that the babysitter thread already owns its one reference to the babysitter when it starts up, and eliminates the race condition (assuming that our refcounting is thread-safe, which perhaps it isn't, but you already know how to fix that).

OK, now back to the current problem:

(In reply to Ralf Habacker from comment #31)
> the testcases are also used on unix/linux where the spawn
> is performed by fork/exec, so adding 
> _dbus_babysitter_block_for_thread_exit() would only be usable on windows (or
> platforms where spawing is implemented with threading too). 

No, they aren't. The tests that run on Unix are in dbus/dbus-spawn.c, which is only used on Unix. The tests that run on Windows are in dbus/dbus-spawn-win.c, which is only used on Windows. You can do whatever you like within that file, as long as the API presented to the rest of libdbus remains the same.

> Another option would be to add this check to
> _dbus_babysitter_block_for_child_exit()

Let's take a step back here.

What we want to do is to wait until the child has exited (so we can poll its status), and also the babysitter thread has exited (so we don't leak). Neither of these is done (or even compiled) in production code; they only exist so the tests can use them. The end_sync_event HANDLE struct member doesn't exist when the test code isn't compiled in, either.

We wait for the child to exit by doing a WaitForSingleObject() on the end_sync_event. We don't currently wait for the babysitter thread to exit, but if we did, from a quick look at GLib it seems we would do so by calling WaitForSingleObject() on *that*.

This raises the question: if we already have a convenient HANDLE that we can wait for, namely the thread, then why are we bothering to have end_sync_event at all? We could just do:

  sitter->thread = (HANDLE) CreateThread (NULL, 0, babysitter,
      _dbus_sitter_ref (sitter), 0, &sitter_thread_id);
...
static void
_dbus_babysitter_block_for_child_exit (DBusBabysitter *sitter)
{
  WaitForSingleObject (sitter->thread, INFINITE);
}

and remove all references to end_sync_event. The babysitter thread indicates that its job is over by terminating. Simple!
Comment 34 Simon McVittie 2016-05-12 18:06:03 UTC
(In reply to yfei from comment #32)
> It will have to be the first option since
> _dbus_babysitter_block_for_child_exit() has no return values.

Again, that function isn't called outside dbus-spawn-win.c (it can't be, because it's static), and it isn't called except in tests anyway. So if it would benefit from having a return value, it would be fine to add one.

>  I don't think
> putting the wait for thread exit inside
> _dbus_babysitter_block_for_child_exit() because some code may not care to
> wait for thread exit.

The only things that call this function are tests in the same file, and they all want to wait for both the child process and the babysitter thread to finish, as far as I can see.
Comment 35 Simon McVittie 2016-05-12 18:12:59 UTC
Comment on attachment 123607 [details] [review]
On Windows make access to member 'refcount' of struct DBusBabysitter thread save.

Review of attachment 123607 [details] [review]:
-----------------------------------------------------------------

(Some of this is probably fixed in your follow-up patch already, but you attached them separately so that's how I'm reviewing them)

::: dbus/dbus-spawn-win.c
@@ +99,5 @@
> +_dbus_babysitter_trace_ref (DBusBabysitter *sitter,
> +    int old_refcount,
> +    int new_refcount,
> +    const char *why)
> +{

No need for a separate declaration, just put the static keyword on the definition:

static void
_dbus_babysitter_trace_ref (etc.)
{
  ...
}

@@ +112,5 @@
>  static DBusBabysitter*
>  _dbus_babysitter_new (void)
>  {
>    DBusBabysitter *sitter;
> +  static int enabled = -1;

unused

@@ +174,3 @@
>    PING();
>    _dbus_assert (sitter != NULL);
> +  _dbus_assert (_dbus_atomic_get (&sitter->refcount) > 0);

Atomic operations are relatively expensive, so prefer:

  old_refcount = _dbus_atomic_inc (&sitter->refcount);
  _dbus_assert (old_refcount > 0);
  _dbus_babysitter_trace_ref (...);

@@ +215,4 @@
>  
>    PING();
>    _dbus_assert (sitter != NULL);
> +  _dbus_assert (_dbus_atomic_get (&sitter->refcount) > 0);

Same as in ref(), base the assertion on the old_refcount
Comment 36 Simon McVittie 2016-05-12 18:24:03 UTC
Comment on attachment 123606 [details] [review]
Display thread id in _dbus_verbose to be able to see thread issues.

Review of attachment 123606 [details] [review]:
-----------------------------------------------------------------

::: dbus/dbus-sysdeps-unix.c
@@ +4432,5 @@
>  }
>  
> +unsigned long _dbus_current_thread_id ()
> +{
> +    return pthread_self();

Sorry, this doesn't work. pthread_self() returns a pthread_t, which can be anything - an integer, a pointer or even a struct.

Refactoring: instead of _dbus_current_thread_id(), introduce a sysdeps function something like this:

#ifdef DBUS_ENABLE_VERBOSE_MODE
/*
 * If we can identify the current process and/or thread, print them to stderr followed by a colon.
 */
void
_dbus_print_thread (void)
{
#ifdef __linux__
  /* we know a pthread_t is numeric on Linux */
  fprintf (stderr, "%lu: 0x%lx: ", _dbus_pid_for_log (), (unsigned long) pthread_self ());
#else
  /* in principle pthread_t isn't required to be printable */
  fprintf (stderr, "%lu: ", _dbus_pid_for_log ());
#endif
}
#endif

I think the Windows implementation would be:

void
_dbus_print_thread (void)
{
  fprintf (stderr, "%lu: 0x%04lx ", _dbus_pid_for_log (), GetCurrentThreadId ());
}
Comment 37 Simon McVittie 2016-05-12 18:26:44 UTC
Comment on attachment 123615 [details] [review]
Prefix debug messages displayed with _dbus_verbose() with a timestamp.

Review of attachment 123615 [details] [review]:
-----------------------------------------------------------------

I'm a little reluctant to make _dbus_verbose() *even* *more* verbose. It's already a sufficiently huge flood of information that I don't normally use it, even for debugging...

::: dbus/dbus-internals.c
@@ +394,4 @@
>    va_list args;
>    static dbus_bool_t need_pid = TRUE;
>    int len;
> +  long sec,usec;

space after comma

@@ +394,5 @@
>    va_list args;
>    static dbus_bool_t need_pid = TRUE;
>    int len;
> +  long sec,usec;
> +  _dbus_get_real_time(&sec, &usec);

space before open parenthesis
Comment 38 Simon McVittie 2016-05-12 18:28:38 UTC
Comment on attachment 123616 [details] [review]
Follow up of commit 5bd460ecf027bb36ffd21d325739bb51df94521f

Review of attachment 123616 [details] [review]:
-----------------------------------------------------------------

This all looks good, and addresses almost all of the issues I had with your previous version.

If you realise one of your patches is not quite right, and I haven't showed any signs of reviewing it yet, it's usually easier for both of us if you just attach the fixed version (with the fixes already squashed in).
Comment 39 Simon McVittie 2016-05-12 18:44:15 UTC
(In reply to Ralf Habacker from comment #21)
> sitter->child_handle may need a mutex because it is written from both
> threads.

Yes. The same mutex should protect have_spawn_errno, spawn_errno, child_status and have_child_status, I think.

You could either allocate a native Win32 mutex with CreateMutex() and put it in the babysitter structure (there's no point in going through the portable abstraction because this is non-portable code anyway), or re-use the global _DBUS_LOCK (sysdeps) which is only used in a couple of places anyway, or introduce a new _DBUS_LOCK (spawn).

> Also deleting the sitter instance in _dbus_babysitter_unref may
> require a mutex to about access to a partially destroyed instance.

If the babysitter thread holds a reference for its entire lifetime, and unrefs it at the end, then I think we're fine without holding a mutex during unref. If the last-unref is in the main thread, then we know the babysitter thread has released its ref, so it isn't using the object any more and hence there's no data race.

Similarly, if the last-unref is in the babysitter thread, then we know the main thread has released *its* ref, so again it can't be using the object any more.

This reasoning relies on refcounting being atomic, but we already need to make that true.
Comment 40 Ralf Habacker 2016-05-12 20:34:01 UTC
Created attachment 123655 [details] [review]
Display thread id in _dbus_verbose to be able to see thread issues (update 1)
Comment 41 Ralf Habacker 2016-05-12 20:34:49 UTC
Created attachment 123656 [details] [review]
Prefix debug messages displayed with _dbus_verbose() with a timestamp (update 1)
Comment 42 Ralf Habacker 2016-05-12 20:35:25 UTC
Created attachment 123657 [details] [review]
On Windows make access to member 'refcount' of struct DBusBabysitter thread save (update 1)
Comment 43 Ralf Habacker 2016-05-12 20:41:25 UTC
(In reply to Simon McVittie from comment #33)

> No, they aren't. The tests that run on Unix are in dbus/dbus-spawn.c, which
> is only used on Unix. The tests that run on Windows are in
> dbus/dbus-spawn-win.c, which is only used on Windows. 

why ? 
As far as I know does _dbus_spawn_async_with_babysitter () provide a cross platform api for spawning processes, so there should also be a cross platform set of test cases.
Comment 44 Ralf Habacker 2016-05-12 22:25:34 UTC
(In reply to Simon McVittie from comment #37)
> Comment on attachment 123615 [details] [review] [review]
> Prefix debug messages displayed with _dbus_verbose() with a timestamp.
> 
> Review of attachment 123615 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> I'm a little reluctant to make _dbus_verbose() *even* *more* verbose. It's
> already a sufficiently huge flood of information that I don't normally use
> it, even for debugging...

First i tried to understand the threading issue without any timing information and was not able to see if there are races between the threads (indicated by a few usec's) or timeouts (in msec range). After adding the timestamp I was able to see that there are no timeouts involved.
Comment 45 Ralf Habacker 2016-05-12 23:46:39 UTC
Created attachment 123660 [details] [review]
Eliminates a race condition accessing DBusBabysitter instance at startup of babysitter() on Windows.
Comment 46 Ralf Habacker 2016-05-12 23:47:42 UTC
Created attachment 123661 [details] [review]
Fix memory leak issue in spawn code on Windows.
Comment 47 Ralf Habacker 2016-05-12 23:53:50 UTC
Created attachment 123662 [details] [review]
Fix memory leak issue in spawn code on Windows (update 1)

Merge minor fix.
Comment 48 Simon McVittie 2016-05-13 11:11:24 UTC
Comment on attachment 123660 [details] [review]
Eliminates a race condition accessing DBusBabysitter instance at startup of babysitter() on Windows.

Review of attachment 123660 [details] [review]:
-----------------------------------------------------------------

Looks good
Comment 49 Simon McVittie 2016-05-13 11:12:32 UTC
Comment on attachment 123657 [details] [review]
On Windows make access to member 'refcount' of struct DBusBabysitter thread save (update 1)

Review of attachment 123657 [details] [review]:
-----------------------------------------------------------------

Looks good. Please amend the commit message to say "thread safe" instead of "thread save".
Comment 50 Ralf Habacker 2016-05-13 11:27:55 UTC
(In reply to Simon McVittie from comment #48)
> Comment on attachment 123660 [details] [review] [review]
> Eliminates a race condition accessing DBusBabysitter instance at startup of
> babysitter() on Windows.
> 
> Review of attachment 123660 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> Looks good

According to https://bugs.freedesktop.org/show_bug.cgi?id=95160#c16 I would say this is for dbus-1.10. Are I'm correct ? Same for the remaining patches ?
Comment 51 Simon McVittie 2016-05-13 11:28:50 UTC
(In reply to Ralf Habacker from comment #44)
> First i tried to understand the threading issue without any timing
> information and was not able to see if there are races between the threads
> (indicated by a few usec's) or timeouts (in msec range). After adding the
> timestamp I was able to see that there are no timeouts involved.

I'm on the fence about this. I can see that timestamps and thread IDs are occasionally useful, but the majority of time they're just noise (of which we have plenty already).

The rule of thumb with threads is that if you can't prove there isn't a race, then there is a race. Timings being further apart in normal use just means you have to be really unlucky to see it.

(In reply to Ralf Habacker from comment #43)
> why ? 
> As far as I know does _dbus_spawn_async_with_babysitter () provide a cross
> platform api for spawning processes, so there should also be a cross
> platform set of test cases.

Good idea. I didn't write it, but if I had, I would have probably done as you suggest, maybe putting the tests in dbus/dbus-spawn-test.c or something. I'd be happy to review a patch that did that.

This would require exposing _dbus_babysitter_block_for_child_exit() as extern (although it wouldn't need to be a DBUS_PRIVATE_EXPORT if the implementation and tests were compiled into the same module on each platform), and probably adding some similar functions to make it testable.

Looking at the diff between the Unix and Windows implementations, the tests are really quite similar. Apart from some cosmetic differences in how variables are initialized:

* test-segfault is reported as SIGNALED on Unix or EXITED on Windows,
  for reasons we've discussed already on Bug #92721 (your commit dcdb29a1).
  That could just be #ifdef DBUS_WIN, like it is in the other places it's
  relevant.

* Similarly, killing a spawned executable with kill() is SIGNALED on Unix
  or EXITED on Windows.

* The Windows code path says:

  /* Don't run the obnoxious segfault test by default,
   * it's a pain to have to click all those error boxes.
   */

  which looks like the same thing you and yfei fixed on Bug #95155 -
  maybe we can remove that difference now?
Comment 52 Simon McVittie 2016-05-13 11:31:19 UTC
(In reply to Ralf Habacker from comment #50)
> According to https://bugs.freedesktop.org/show_bug.cgi?id=95160#c16 I would
> say this is for dbus-1.10. Are I'm correct ? Same for the remaining patches ?

Attachment #123657 [details] and Attachment #123660 [details] are certainly dbus-1.10 material. Attachment #123662 [details] probably is too, but I haven't reviewed it yet.

The _dbus_verbose() enhancements aren't, even if we do apply them to master.
Comment 53 Simon McVittie 2016-05-13 11:34:03 UTC
Comment on attachment 123662 [details] [review]
Fix memory leak issue in spawn code on Windows (update 1)

Review of attachment 123662 [details] [review]:
-----------------------------------------------------------------

::: dbus/dbus-spawn-win.c
@@ +787,4 @@
>  static void
>  _dbus_babysitter_block_for_child_exit (DBusBabysitter *sitter)
>  {
> +  WaitForSingleObject (sitter->thread_handle, INFINITE);

I think this deserves a comment, perhaps something like

/* The thread terminates after the child does. We want to wait for the thread,
 * not just the child, to avoid data races and ensure that it has freed all
 * its memory. */

Looks good other than that, and also for dbus-1.10.

I think we still need a mutex to protect the child_handle, exit status and other members that are used in both threads.
Comment 54 Simon McVittie 2016-05-13 11:36:37 UTC
Comment on attachment 123655 [details] [review]
Display thread id in _dbus_verbose to be able to see thread issues (update 1)

Review of attachment 123655 [details] [review]:
-----------------------------------------------------------------

I should try this on Linux before we land it, but it looks good in principle.

You're the main user of _dbus_verbose() most of the time (I try to avoid it because it's just too noisy to see what's going on!) so if you find it useful, I suppose we might as well have it.
Comment 55 Simon McVittie 2016-05-13 11:37:17 UTC
Comment on attachment 123656 [details] [review]
Prefix debug messages displayed with _dbus_verbose() with a timestamp (update 1)

Review of attachment 123656 [details] [review]:
-----------------------------------------------------------------

Yeah, if you find this useful, we might as well do it.
Comment 56 Ralf Habacker 2016-05-13 11:42:11 UTC
Comment on attachment 123660 [details] [review]
Eliminates a race condition accessing DBusBabysitter instance at startup of babysitter() on Windows.

recreated for dbus-1.10 because of diverged patch context and committed to dbus-1.10
Comment 57 Ralf Habacker 2016-05-13 11:42:29 UTC
Comment on attachment 123657 [details] [review]
On Windows make access to member 'refcount' of struct DBusBabysitter thread save (update 1)

committed to dbus-1.10 with amend commit message
Comment 58 Ralf Habacker 2016-05-13 11:48:01 UTC
(In reply to Simon McVittie from comment #54)
> (I try to avoid it because it's just too noisy to see what's going on!) 

Because it misses categories and levels to be able to concentrate on the real interesting stuff. May be an extension similiar to https://wiki.winehq.org/Debug_Channels may be useful in this area.
Comment 59 Ralf Habacker 2016-05-13 11:51:26 UTC
(In reply to Ralf Habacker from comment #58)
> (In reply to Simon McVittie from comment #54)
> > (I try to avoid it because it's just too noisy to see what's going on!) 
> 
> Because it misses categories and levels to be able to concentrate on the
> real interesting stuff. May be an extension similiar to
> https://wiki.winehq.org/Debug_Channels may be useful in this area.

For example thread related message could be enabled by using 

DBUS_VERBOSE=+thread

or if they enabled by default suppressed with 

DBUS_VERBOSE=-thread.
Comment 60 Simon McVittie 2016-05-13 12:19:20 UTC
(In reply to Simon McVittie from comment #54)
> I should try this on Linux before we land it, but it looks good in principle.

No, looks like it fails. https://travis-ci.org/d-bus/dbus/jobs/129975913#L920

I think the easiest and nicest fix would be to move _dbus_print_thread() into dbus-sysdeps-pthread.c (Unix code path) and dbus-sysdeps-thread-win.c (Windows code path).
Comment 61 Simon McVittie 2016-05-13 12:38:58 UTC
(In reply to Ralf Habacker from comment #58)
> Because it misses categories and levels to be able to concentrate on the
> real interesting stuff. May be an extension similiar to
> https://wiki.winehq.org/Debug_Channels may be useful in this area.

Hmm, maybe. Adding more complexity isn't particularly appealing, but neither is wading through full _dbus_verbose() logs.

Having said that, the refcount-tracing stuff that I added a while ago is a bit like a simplistic form of debug channels (it's activated with an extra environment variable, since logging each ref/unref would be very, very noisy).

In projects that have fancy logging, I usually find that what I what in practice is a form of logging where I'll see most of the potentially interesting messages (for instance in libdbus, things that happen once per DBusMessage would probably be appropriate), but not the really, really spammy stuff that is completely routine and happens many times per second (like ref/unref and interactions with the main loop). For instance, GLib with G_MESSAGES_DEBUG=all is usually good enough to not bother switching individual libraries' debug messages on/off, but Wine relay tracing is too much.

Perhaps identifying the parts of libdbus that are far too noisy (main loop, sockets), and making only those areas check an extra environment variable like my refcount-tracing does, would be a reasonable effort to usefulness ratio?

That should probably be a separate "enhancement" bug if you're interested in pursuing it.
Comment 62 Ralf Habacker 2016-05-13 13:03:09 UTC
Created attachment 123682 [details] [review]
Display thread id in _dbus_verbose to be able to see thread issues (update 2)
Comment 63 Simon McVittie 2016-05-13 13:13:15 UTC
Comment on attachment 123682 [details] [review]
Display thread id in _dbus_verbose to be able to see thread issues (update 2)

Review of attachment 123682 [details] [review]:
-----------------------------------------------------------------

Looks good, if all flavours of https://travis-ci.org/d-bus/dbus/builds/129993182 pass
Comment 64 Simon McVittie 2016-05-13 13:21:19 UTC
(In reply to Simon McVittie from comment #63)
> Looks good, if all flavours of
> https://travis-ci.org/d-bus/dbus/builds/129993182 pass

They do. Ship it!
Comment 65 Ralf Habacker 2016-05-13 13:29:33 UTC
Comment on attachment 123682 [details] [review]
Display thread id in _dbus_verbose to be able to see thread issues (update 2)

committed to master
Comment 66 Ralf Habacker 2016-05-13 13:34:10 UTC
Comment on attachment 123656 [details] [review]
Prefix debug messages displayed with _dbus_verbose() with a timestamp (update 1)

committed to master
Comment 67 Simon McVittie 2016-05-13 13:56:46 UTC
Created attachment 123695 [details] [review]
dbus-spawn-win: Do not attempt to call child_setup (again)

This reinstates part of commit 420f3474, fixing a mis-merge in
commit 8c0d5980.

---

The build fails when cross-compiling for Windows on travis-ci: https://travis-ci.org/d-bus/dbus/builds/129998634

It will probably fail in any environment that targets Windows. Untested, but I think this should solve it.

Only needed on master.
Comment 68 Ralf Habacker 2016-05-13 14:04:48 UTC
Comment on attachment 123695 [details] [review]
dbus-spawn-win: Do not attempt to call child_setup (again)

thanks, committed to master
Comment 69 Ralf Habacker 2016-05-13 21:51:19 UTC
(In reply to Simon McVittie from comment #64)
> (In reply to Simon McVittie from comment #63)
> > Looks good, if all flavours of
> > https://travis-ci.org/d-bus/dbus/builds/129993182 pass
> 
> They do. Ship it!

Unfortunally the cross mingw32 builds on obs reports an issue not having stderr defined (from https://build.opensuse.org/public/build/home:rhabacker:branches:windows:mingw:win32:unstable/openSUSE_13.2/x86_64/mingw32-dbus-1/_log)

[  169s] dbus-sysdeps-thread-win.c: In function '_dbus_print_thread':
[  169s] dbus-sysdeps-thread-win.c:308:3: warning: implicit declaration of function 'fprintf' [-Wimplicit-function-declaration]
[  169s]    fprintf (stderr, "%lu: 0x%04lx: ", _dbus_pid_for_log (), GetCurrentThreadId ());
[  169s]    ^
[  169s] dbus-sysdeps-thread-win.c:308:3: warning: incompatible implicit declaration of built-in function 'fprintf'
[  169s] dbus-sysdeps-thread-win.c:308:3: note: include '<stdio.h>' or provide a declaration of 'fprintf'
[  169s] dbus-sysdeps-thread-win.c:308:12: error: 'stderr' undeclared (first use in this function)
[  169s]    fprintf (stderr, "%lu: 0x%04lx: ", _dbus_pid_for_log (), GetCurrentThreadId ());
[  169s]             ^
Comment 70 Ralf Habacker 2016-05-14 08:02:57 UTC
Created attachment 123738 [details] [review]
Fix compiling on windows/mingw with automake build system.

fix issue in comment 69
Comment 71 Ralf Habacker 2016-05-14 08:51:25 UTC
Created attachment 123739 [details] [review]
Migrate platform specific spawn tests into cross platform tests.
Comment 72 Ralf Habacker 2016-05-14 08:58:28 UTC
Created attachment 123740 [details] [review]
Migrate platform specific spawn tests into cross platform tests (update 1)

- added missing file dbus-spawn-test.c
- added comment about enabled segfault test.
Comment 73 Ralf Habacker 2016-05-14 09:36:26 UTC
(In reply to Simon McVittie from comment #39)
> (In reply to Ralf Habacker from comment #21)
> > sitter->child_handle may need a mutex because it is written from both
> > threads.
> 
> Yes. The same mutex should protect have_spawn_errno, spawn_errno,
> child_status and have_child_status, I think.

Are I'm right with the assumption that a mutex is required to lock shared write access to a variable ?

child_status, have_child_status, spawn_errno and have_spawn_errno are only written in _dbus_babysitter_new () (non main thread not active) and in the non main thread. The main threads only reads *child_status and *spawn_errno

Also the non-main thread setup child_handle and spawn_errno and *then* the related have_xxx, which looks safe to me.
From that observations I would say now that there is no need for an additional mutex.

While looking at member child_handle there may be an issue in babysitter() 

  sitter->child_handle = spawn_program (sitter->log_name,
					sitter->argv, sitter->envp);
-> child_handle may be -1, wich indicates a spawn error
  PING();
  if (sitter->child_handle == (HANDLE) -1)
    {
      sitter->child_handle = NULL;
-> child_handle is 0

in the main thread child_handle is polled for example by 

_dbus_babysitter_get_child_exited (DBusBabysitter *sitter)
{
  PING();
  return (sitter->child_handle == NULL);
}

which will get an incorrect status.  I will add a related patch.

> If the babysitter thread holds a reference for its entire lifetime, and
> unrefs it at the end, then I think we're fine without holding a mutex during
> unref. If the last-unref is in the main thread, then we know the babysitter
> thread has released its ref, so it isn't using the object any more and hence
> there's no data race.
okay

> Similarly, if the last-unref is in the babysitter thread, then we know the
> main thread has released *its* ref, so again it can't be using the object
> any more.
okay

> This reasoning relies on refcounting being atomic, but we already need to
> make that true.
which has been fixed.
Comment 74 Ralf Habacker 2016-05-14 09:36:52 UTC
Created attachment 123741 [details] [review]
Fix ambiguous setup of DBusBabySitter struct member child_handle on Windows.
Comment 75 Ralf Habacker 2016-05-14 09:43:00 UTC
Created attachment 123742 [details] [review]
Fix memory leak issue in spawn code on Windows (update 2)

- Added note
Comment 76 Ralf Habacker 2016-05-14 23:36:41 UTC
Created attachment 123754 [details] [review]
Migrate platform specific spawn tests into cross platform tests (update 2)

- fix spawn_exit error name
- fix cpp defines indent
Comment 77 Simon McVittie 2016-05-16 09:43:54 UTC
Comment on attachment 123738 [details] [review]
Fix compiling on windows/mingw with automake build system.

Review of attachment 123738 [details] [review]:
-----------------------------------------------------------------

Hmm, I wonder why this succeeded on travis-ci, which uses mingw-w64. Are you using mingw-w64 too, or some other fork of mingw?

::: dbus/dbus-sysdeps-thread-win.c
@@ +28,4 @@
>  #include "dbus-threads.h"
>  #include "dbus-list.h"
>  
> +#include "stdio.h"

Should be <stdio.h>. Always use <> for system headers.

OK to push with that change. I assume this is only for master since it's a fix for a master-only verbose enhancement.
Comment 78 Simon McVittie 2016-05-16 09:45:17 UTC
Comment on attachment 123741 [details] [review]
Fix ambiguous setup of DBusBabySitter struct member child_handle on Windows.

Review of attachment 123741 [details] [review]:
-----------------------------------------------------------------

Fine (and also for 1.10 if you want)
Comment 79 Simon McVittie 2016-05-16 09:46:19 UTC
Comment on attachment 123742 [details] [review]
Fix memory leak issue in spawn code on Windows (update 2)

Review of attachment 123742 [details] [review]:
-----------------------------------------------------------------

Go ahead
Comment 80 Simon McVittie 2016-05-16 10:28:46 UTC
(In reply to Ralf Habacker from comment #73)
> Are I'm right with the assumption that a mutex is required to lock shared
> write access to a variable ?

It's more complicated than that. Any time you change memory in thread A, thread B will not necessarily see that change happen until/unless there is a "memory barrier" that makes thread B drop its cache. Mutexes and atomic operations are examples of memory barriers.

> child_status, have_child_status, spawn_errno and have_spawn_errno are only
> written in _dbus_babysitter_new () (non main thread not active) and in the
> non main thread. The main threads only reads *child_status and *spawn_errno

You can't assume the main thread will ever see them change, unless there's a memory barrier.

> Also the non-main thread setup child_handle and spawn_errno and *then* the
> related have_xxx, which looks safe to me.

This sort of guard isn't enough. If a change to memory does become visible to the other thread, there's no guarantee that other changes to memory will become visible in any particular order. It is possible for the main thread to see have_spawn_errno change to TRUE without also seeing spawn_errno change to the correct value, even though the babysitter thread changed spawn_errno first!

I think you really do need a mutex here.
Comment 81 Simon McVittie 2016-05-16 10:37:55 UTC
Comment on attachment 123754 [details] [review]
Migrate platform specific spawn tests into cross platform tests (update 2)

Review of attachment 123754 [details] [review]:
-----------------------------------------------------------------

::: dbus/Makefile.am
@@ +266,5 @@
>  	dbus-string-util.c			\
>  	dbus-sysdeps-util.c			\
>  	dbus-test.c				\
> +	dbus-test.h                             \
> +	dbus-spawn-test.c

Either dbus-spawn-test.c should all be #ifdef DBUS_ENABLE_EMBEDDED_TESTS like dbus-test.c is, or this Makefile.am change should instead be

if DBUS_ENABLE_EMBEDDED_TESTS
DBUS_UTIL_SOURCES += dbus-spawn-test.c
endif

equivalent to how you've hooked it up in CMake. Choose one - either will do.
Comment 82 Ralf Habacker 2016-05-16 12:18:29 UTC
(In reply to Simon McVittie from comment #77)
> Comment on attachment 123738 [details] [review] [review]
> Fix compiling on windows/mingw with automake build system.
> 
> Review of attachment 123738 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> Hmm, I wonder why this succeeded on travis-ci, which uses mingw-w64. Are you
> using mingw-w64 too, or some other fork of mingw?

The reason is that travis mingw builds do *not* enable DBUS_ENABLE_VERBOSE_MODE, which hides such errors.
 
> Should be <stdio.h>. Always use <> for system headers.
I will fix it.
> OK to push with that change. I assume this is only for master since it's a
> fix for a master-only verbose enhancement.
yes.
Comment 83 Ralf Habacker 2016-05-16 12:19:17 UTC
Comment on attachment 123742 [details] [review]
Fix memory leak issue in spawn code on Windows (update 2)

committed to master
Comment 84 Ralf Habacker 2016-05-16 12:19:42 UTC
Comment on attachment 123741 [details] [review]
Fix ambiguous setup of DBusBabySitter struct member child_handle on Windows.

committed to dbus-1.10 and merged into master
Comment 85 Ralf Habacker 2016-05-16 12:20:17 UTC
Comment on attachment 123738 [details] [review]
Fix compiling on windows/mingw with automake build system.

fixed to use <> tag and committed to master
Comment 86 Simon McVittie 2016-05-16 12:55:52 UTC
Created attachment 123779 [details] [review]
travis: compile with mingw in both production and debug  flavours

This should detect anything that is only a build failure when embedded
tests or verbose mode are enabled, such as the missing include fixed
in commit 4858faf. I'm not enabling embedded tests and verbose mode
orthogonally because we don't want CI builds to take too long.

---

(In reply to Ralf Habacker from comment #82)
> The reason is that travis mingw builds do *not* enable
> DBUS_ENABLE_VERBOSE_MODE, which hides such errors.

Good point. This only matters for bugs in verbose mode or embedded tests if they are also platform-specific, but it's probably still worth testing.

Testing this on travis-ci (for master and dbus-1.10-ci, but not dbus-1.10 - dbus-1.10-ci consists of dbus-1.10, plus cherry-picked CI stuff from master, and is tested but never released). Hopefully it'll work first time...
Comment 87 Simon McVittie 2016-05-16 13:33:35 UTC
Created attachment 123782 [details] [review]
test-segfault: add missing include of <stdio.h> on Windows

Needed for fprintf (stderr, ...).

---

Did this work with CMake? If it did, I'm not sure why?

Detected by Attachment #123779 [details] on master and dbus-1.10-ci. I'll apply it to master and dbus-1.10.
Comment 88 Ralf Habacker 2016-05-16 13:35:58 UTC
Created attachment 123783 [details] [review]
Migrate platform specific spawn tests into cross platform tests.

Also enable segfault checks on windows because the reason why it
has been disabled has been fixed with bug #95155.

Bug:
Comment 89 Ralf Habacker 2016-05-16 13:37:13 UTC
Comment on attachment 123754 [details] [review]
Migrate platform specific spawn tests into cross platform tests (update 2)

superseeded
Comment 90 Simon McVittie 2016-05-16 15:21:26 UTC
(In reply to Simon McVittie from comment #86)
> Testing this on travis-ci (for master and dbus-1.10-ci, but not dbus-1.10 -
> dbus-1.10-ci consists of dbus-1.10, plus cherry-picked CI stuff from master,
> and is tested but never released).

It failed, but additionally adding Attachment #123782 [details] makes it pass.
Comment 91 Simon McVittie 2016-05-16 15:23:42 UTC
Comment on attachment 123783 [details] [review]
Migrate platform specific spawn tests into cross platform tests.

Review of attachment 123783 [details] [review]:
-----------------------------------------------------------------

Go ahead, for master only.
Comment 92 Ralf Habacker 2016-05-16 19:35:12 UTC
Comment on attachment 123783 [details] [review]
Migrate platform specific spawn tests into cross platform tests.

committed to master
Comment 93 Ralf Habacker 2016-05-16 19:36:26 UTC
Comment on attachment 123782 [details] [review]
test-segfault: add missing include of <stdio.h> on Windows

committed to master
Comment 94 Ralf Habacker 2016-05-16 19:58:13 UTC
Comment on attachment 123779 [details] [review]
travis: compile with mingw in both production and debug  flavours

Review of attachment 123779 [details] [review]:
-----------------------------------------------------------------

Would it not be required to enable the same variant for cmake too ?
Comment 95 Simon McVittie 2016-05-17 11:50:35 UTC
(In reply to Ralf Habacker from comment #94)
> Would it not be required to enable the same variant for cmake too ?

It's a balance: if we have too many builds, travis-ci takes forever to build them all and we don't get prompt feedback. This is the same reason I don't build every combination of (verbose or not) x (checks or not) x (assertions or not) with Autotools.

The Autotools build seems to be the one with the finest-grained checks for functionality and the most careful/cautious build setup, which means it's the most likely to break in unforeseen configurations: for instance, the reason we didn't catch those missing inclusions of stdio.h appears to be that the CMake build unconditionally includes stdio.h in config.h.

The Autotools build is also the only one I'm interested in supporting as a "first class" configuration on Unix. The major Linux/*BSD distributions have a lot of Autotools expertise, and are particularly keen on being able to make changes centrally that affect all Autotools packages (compiler hardening options, capturing debug symbols, cross-compiling, relocating libraries for multilib/multiarch, that sort of thing). You've made the CMake build support a large subset of what the Autotools build does, which is great, but for unusual situations or configurations our answer should still be "use Autotools, it can already do this".
Comment 96 Ralf Habacker 2016-05-17 14:39:24 UTC
Created attachment 123830 [details] [review]
Use mutex to protect access to DBusBabysitter struct members on Windows.
Comment 97 Ralf Habacker 2016-05-18 13:03:31 UTC
(In reply to Simon McVittie from comment #95)

> for instance, the
> reason we didn't catch those missing inclusions of stdio.h appears to be
> that the CMake build unconditionally includes stdio.h in config.h.

indeed: from generated config.h

#if defined(_WIN32) || defined(_WIN64)
// mingw mode_t
# ifdef HAVE_STDIO_H
#  include <stdio.h>
# endif
...

Seems to be an additional reason to work on bug 85418.
Comment 98 Ralf Habacker 2016-05-18 13:03:41 UTC
Comment on attachment 123779 [details] [review]
travis: compile with mingw in both production and debug  flavours

Review of attachment 123779 [details] [review]:
-----------------------------------------------------------------

looks good
Comment 99 Simon McVittie 2016-05-18 16:04:59 UTC
Comment on attachment 123830 [details] [review]
Use mutex to protect access to DBusBabysitter struct members on Windows.

Review of attachment 123830 [details] [review]:
-----------------------------------------------------------------

I would prefer to use CreateMutex(), WaitForSingleObject(), ReleaseMutex() and CloseHandle() directly, because this whole file is Windows-specific *anyway*, and if we do that then we don't need to export the _dbus_cmutex stuff from libdbus.

libdbus-internal.la is specifically not thread-safe in general, so dbus-spawn can only be correct if its use of a thread "behind the scenes" is completely hidden from the API. Not exposing the mutex symbols as API, even privately, seems a good way to remind us of that.

I haven't done a thorough review looking for error-handling code paths that forget to release the mutex, because I wanted to raise design points first.

::: dbus/dbus-spawn-win.c
@@ +111,4 @@
>    if (sitter == NULL)
>      return NULL;
>  
> +  _dbus_cmutex_new_at_location (&sitter->mutex);

Check for sitter->mutex != NULL. If it's NULL, you need to free the babysitter and return NULL. (You probably can't use _dbus_babysitter_unref() here because the babysitter isn't fully allocated yet.)

@@ +301,3 @@
>    PING();
> +  _dbus_cmutex_lock (sitter->mutex);
> +  state = sitter->child_handle == NULL;

For better clarity:

state = (sitter->child_handle == NULL);

This probably also deserves a comment mentioning that sitter->child_handle only changes from non-NULL to NULL, never the other way, so it's OK for a caller to assume that after get_child_exited() returns TRUE, the other information about the exit status remains valid.

If that wasn't true, this would be another data race, and we'd have to combine _dbus_babysitter_get_child_exited(), _dbus_babysitter_set_child_exit_error(), etc. into a single function that fetches all the fields while holding the same lock.

For master, it might still be a worthwhile follow-up to combine those APIs into a single function that only needs to take and release the lock once, something like:

  /*
   * If the child is still running, return FALSE without setting error or exit_status.
   *
   * If the child has exited, set error and return TRUE. If the error is
   * DBUS_ERROR_SPAWN_CHILD_EXITED, also place the exit status in exit_status.
   */
  dbus_bool_t _dbus_babysitter_get_child_exited (int *exit_status, DBusError *error);

(We'd also need to combine their Unix equivalents, but that's easy, because Unix dbus-spawn is single-threaded.)

@@ +399,5 @@
> +                                           toggled_function,
> +                                           data,
> +                                           free_data_function);
> +  _dbus_cmutex_unlock (sitter->mutex);
> +  return result;

The DBusWatch stuff is specifically not thread-safe, so this can only validly be called from the main thread anyway, and protecting sitter->watches with a mutex seems pointless.

Hmm... if the DBusBabysitter's last-unref occurs in the non-main thread, that would be Bad, precisely because DBusWatchList isn't thread-safe. Can we ensure that the last-unref occurs in the main thread instead? I think that could be done by removing the unref from the end of babysitter(), and moving it to just before handle_watch() returns, where it will be triggered by writing to sitter->socket_to_main.sock?

As a bonus, that would mean we wouldn't need to take the mutex in unref(), because we would know that unref() is only ever happening in the main thread. We could probably even revert the change that made the refcounting atomic, because all refcounting would occur in the main thread. Effectively, the main thread would hold one extra reference on behalf of the non-main thread for (at least) the duration between the non-main thread starting, and the non-main thread writing to its socket to notify the main thread that it's finishing.

All this complexity just because you can't select() on a child process :-(

(You can't portably do that on Unix either, unfortunately)

@@ +645,5 @@
>        DWORD status;
>  
>        PING();
>        // wait until process finished
> +      WaitForSingleObject (handle, INFINITE);

(This is the key "slow" operation that justifies having a non-main thread at all.)

@@ +667,2 @@
>    send (sitter->socket_to_main.sock, " ", 1, 0);
> +  _dbus_cmutex_unlock (sitter->mutex);

If we move the unref to the main thread like I suggested above, then this becomes problematic, because as soon as we've sent the 1-byte "message", the main thread can legitimately free the mutex!

I think it might be better like this:

  SOCKET sock;

  _dbus_cmutex_lock (...);

  sock = sitter->socket_to_main.sock;
  ... /* other things that also require the lock, maybe the initialization bit
       * near the beginning */

  _dbus_cmutex_unlock (...);

  ...

  send (sock, " ", 1, 0);

We know that the socket (and mutex) won't be freed for as long as the main thread holds a ref on our behalf, i.e. until handle_watch() goes off, which can't happen until after we have written this byte into the socket.

The socket could be freed at any time after we have sent the byte into it, so we must send the byte and then never use it again (by returning from the thread).

@@ +703,1 @@
>    sitter->log_name = _dbus_strdup (log_name);

Is log_name ever used by the non-main thread? If not, it doesn't need to be protected by the mutex.

It would probably be helpful, at least on master, to arrange the struct members into groups: protected by the mutex, used by main thread only, and used by non-main thread only.

You don't actually need to lock the mutex for accesses to other struct members until just before the other thread is created, because until that point, there's no other thread that could race with the main thread. I believe creating a thread is already a memory barrier (because how else could it work?) but if you're concerned, lock the mutex just before creating the other thread, so the first thing it will do is block on its own attempt to lock the mutex.

@@ +775,5 @@
>      }
>  
>    PING();
> +  _dbus_cmutex_lock (sitter->mutex);
> +  event = sitter->start_sync_event;

What's the purpose of this start_sync_event anyway? It seems to be there to ensure that the non-main thread has attempted to call spawn_program() before the main thread passes this point... but if we're doing that, I don't see why we don't just spawn_program() from the main thread? That might be a useful simplification for master (probably best not done in 1.10).

If I'm reading this correctly, then the only things we need to do in the non-main thread are waiting for the child handle, maybe getting its exit code (if GetExitCodeProcess() can be slow), and writing to socket_to_main to report back to the main thread. Everything else could happen in either thread, and if as much as possible happens in the main thread, that'll make it all a lot simpler.

@@ +807,3 @@
>    sitter->finished_cb = finished;
>    sitter->finished_data = user_data;
> +  _dbus_cmutex_unlock (sitter->mutex);

finished_cb and finished_data don't need to be protected by the mutex. They are only used in:

* _dbus_babysitter_set_result_function(), which does not need to be thread-safe, because the -util library is specifically not thread-safe in general, so we can assume that library users only call it in the main thread

* handle_watch(), which is called by the DBusLoop in the main thread
Comment 100 Ralf Habacker 2016-05-19 13:29:20 UTC
Created attachment 123913 [details] [review]
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter().

This patch does not not include the race fix mentioned in comment 99
Comment 101 Ralf Habacker 2016-05-20 04:22:08 UTC
Comment on attachment 123783 [details] [review]
Migrate platform specific spawn tests into cross platform tests.

committed to master
Comment 102 Ralf Habacker 2016-05-20 06:36:37 UTC
(In reply to Simon McVittie from comment #99)
> Comment on attachment 123830 [details] [review] [review]
> Use mutex to protect access to DBusBabysitter struct members on Windows.
> 
> Review of attachment 123830 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> I would prefer to use CreateMutex(), WaitForSingleObject(), ReleaseMutex()
> and CloseHandle() directly, because this whole file is Windows-specific
> *anyway*, and if we do that then we don't need to export the _dbus_cmutex
> stuff from libdbus.
> 
> libdbus-internal.la is specifically not thread-safe in general, so
> dbus-spawn can only be correct if its use of a thread "behind the scenes" is
> completely hidden from the API. Not exposing the mutex symbols as API, even
> privately, seems a good way to remind us of that.
> 
> I haven't done a thorough review looking for error-handling code paths that
> forget to release the mutex, because I wanted to raise design points first.
> 
> ::: dbus/dbus-spawn-win.c
> @@ +111,4 @@
> >    if (sitter == NULL)
> >      return NULL;
> >  
> > +  _dbus_cmutex_new_at_location (&sitter->mutex);
> 
> Check for sitter->mutex != NULL. If it's NULL, you need to free the
> babysitter and return NULL. (You probably can't use _dbus_babysitter_unref()
> here because the babysitter isn't fully allocated yet.)
fixed

> 
> @@ +301,3 @@
> >    PING();
> > +  _dbus_cmutex_lock (sitter->mutex);
> > +  state = sitter->child_handle == NULL;
> 
> For better clarity:
> 
> state = (sitter->child_handle == NULL);

fixed 

> This probably also deserves a comment mentioning that sitter->child_handle
> only changes from non-NULL to NULL, never the other way, so it's OK for a
> caller to assume that after get_child_exited() returns TRUE, the other
> information about the exit status remains valid.
> 
> If that wasn't true, this would be another data race, and we'd have to
> combine _dbus_babysitter_get_child_exited(),
> _dbus_babysitter_set_child_exit_error(), etc. into a single function that
> fetches all the fields while holding the same lock.
> 
> For master, it might still be a worthwhile follow-up to combine those APIs
> into a single function that only needs to take and release the lock once,
> something like:
> 
>   /*
>    * If the child is still running, return FALSE without setting error or
> exit_status.
>    *
>    * If the child has exited, set error and return TRUE. If the error is
>    * DBUS_ERROR_SPAWN_CHILD_EXITED, also place the exit status in
> exit_status.
>    */
>   dbus_bool_t _dbus_babysitter_get_child_exited (int *exit_status, DBusError
> *error);
> 
> (We'd also need to combine their Unix equivalents, but that's easy, because
> Unix dbus-spawn is single-threaded.)

will be fixed in an additional patch

> @@ +399,5 @@
> > +                                           toggled_function,
> > +                                           data,
> > +                                           free_data_function);
> > +  _dbus_cmutex_unlock (sitter->mutex);
> > +  return result;
> 
> The DBusWatch stuff is specifically not thread-safe, so this can only
> validly be called from the main thread anyway, and protecting
> sitter->watches with a mutex seems pointless.
> 
> Hmm... if the DBusBabysitter's last-unref occurs in the non-main thread,
> that would be Bad, precisely because DBusWatchList isn't thread-safe. Can we
> ensure that the last-unref occurs in the main thread instead? I think that
> could be done by removing the unref from the end of babysitter(), and moving
> it to just before handle_watch() returns, where it will be triggered by
> writing to sitter->socket_to_main.sock?

need to be check with updated patch again
 
> As a bonus, that would mean we wouldn't need to take the mutex in unref(),
> because we would know that unref() is only ever happening in the main
> thread. We could probably even revert the change that made the refcounting
> atomic, because all refcounting would occur in the main thread. Effectively,
> the main thread would hold one extra reference on behalf of the non-main
> thread for (at least) the duration between the non-main thread starting, and
> the non-main thread writing to its socket to notify the main thread that
> it's finishing.
> 
> All this complexity just because you can't select() on a child process :-(
> 
> (You can't portably do that on Unix either, unfortunately)
> 
> @@ +645,5 @@
> >        DWORD status;
> >  
> >        PING();
> >        // wait until process finished
> > +      WaitForSingleObject (handle, INFINITE);
> 
> (This is the key "slow" operation that justifies having a non-main thread at
> all.)
> 
> @@ +667,2 @@
> >    send (sitter->socket_to_main.sock, " ", 1, 0);
> > +  _dbus_cmutex_unlock (sitter->mutex);
> 
> If we move the unref to the main thread like I suggested above, then this
> becomes problematic, because as soon as we've sent the 1-byte "message", the
> main thread can legitimately free the mutex!
> 
> I think it might be better like this:
> 
>   SOCKET sock;
> 
>   _dbus_cmutex_lock (...);
> 
>   sock = sitter->socket_to_main.sock;
>   ... /* other things that also require the lock, maybe the initialization
> bit
>        * near the beginning */
> 
>   _dbus_cmutex_unlock (...);
> 
>   ...
> 
>   send (sock, " ", 1, 0);
>
> We know that the socket (and mutex) won't be freed for as long as the main
> thread holds a ref on our behalf, i.e. until handle_watch() goes off, which
> can't happen until after we have written this byte into the socket.
> 
> The socket could be freed at any time after we have sent the byte into it,
> so we must send the byte and then never use it again (by returning from the
> thread).

will be fixed in update 2.
 
> @@ +703,1 @@
> >    sitter->log_name = _dbus_strdup (log_name);
> 
> Is log_name ever used by the non-main thread? If not, it doesn't need to be
> protected by the mutex.

it is used for dbus_verbose purpose
 
> It would probably be helpful, at least on master, to arrange the struct
> members into groups: protected by the mutex, used by main thread only, and
> used by non-main thread only.

fixed

> You don't actually need to lock the mutex for accesses to other struct
> members until just before the other thread is created, because until that
> point, there's no other thread that could race with the main thread. I
> believe creating a thread is already a memory barrier (because how else
> could it work?) but if you're concerned, lock the mutex just before creating
> the other thread, so the first thing it will do is block on its own attempt
> to lock the mutex.
> 
> @@ +775,5 @@
> >      }
> >  
> >    PING();
> > +  _dbus_cmutex_lock (sitter->mutex);
> > +  event = sitter->start_sync_event;
> 
> What's the purpose of this start_sync_event anyway? It seems to be there to
> ensure that the non-main thread has attempted to call spawn_program() before
> the main thread passes this point... but if we're doing that, I don't see
> why we don't just spawn_program() from the main thread? That might be a
> useful simplification for master (probably best not done in 1.10).
> 
> If I'm reading this correctly, then the only things we need to do in the
> non-main thread are waiting for the child handle, maybe getting its exit
> code (if GetExitCodeProcess() can be slow), and writing to socket_to_main to
> report back to the main thread. Everything else could happen in either
> thread, and if as much as possible happens in the main thread, that'll make
> it all a lot simpler.

uploaded patch contains this refactoring, which may solve several raised issues, but may also introduce additional issues.
 
> @@ +807,3 @@
> >    sitter->finished_cb = finished;
> >    sitter->finished_data = user_data;
> > +  _dbus_cmutex_unlock (sitter->mutex);
> 
> finished_cb and finished_data don't need to be protected by the mutex. 

I oversaw that in the first patch will be fixed in an update.
Comment 103 Ralf Habacker 2016-05-20 06:46:00 UTC
Created attachment 123936 [details] [review]
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter() (update 1)

- move mutex unlock up in handle_watch ()
- remove mutex usage in _dbus_babysitter_set_result_function ()
- use local sock variable in babysitter ()
Comment 104 Simon McVittie 2016-07-11 11:02:02 UTC
Comment on attachment 123936 [details] [review]
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter() (update 1)

Review of attachment 123936 [details] [review]:
-----------------------------------------------------------------

This is nearly there. I think we'll end up with much more robust subprocess-starting as a result of these changes.

::: dbus/dbus-spawn-win.c
@@ +55,4 @@
>  #include <process.h>
>  #endif
>  
> +/* Local implementations of win related mutex functions to be able to track them */

I'd prefer these just inlined entirely: "HANDLE mutex;" and using the Windows primitives directly. There's no point in using a portable abstraction in non-portable code.

This is OK if you have a strong reason to prefer it though.

@@ +93,2 @@
>  
> +    /* used by both threads and guarded by a mutex */

"guarded by @mutex" or "guarded by the mutex". It is specifically the mutex named "mutex" that guards these, and that's important information.

@@ +269,4 @@
>    PING();
> +  _dbus_win_cmutex_lock (sitter->mutex);
> +  handle = sitter->child_handle;
> +  _dbus_win_cmutex_unlock (sitter->mutex);

Move the unlock down to the end of the function, and use the "goto out" pattern:

  _dbus_win_cmutex_lock (sitter->mutex);

  if (sitter->child_handle == NULL)
    goto out;

  PING();
  TerminateProcess (sitter->child_handle, 12345);

out:
  _dbus_win_cmutex_unlock (sitter->mutex);

Otherwise, there's nothing to stop the other thread from closing the child_handle after you have unlocked the mutex, but before you have terminated the child_handle.

In general, this pattern

  lock ();
  local = struct->nonlocal;
  unlock ();

  do_something (local);

is usually wrong, and should be replaced by this pattern

  lock ();
  do_something (struct->nonlocal);
  unlock ();

although sometimes this pattern can be OK:

  lock ();
  local = ref (struct->nonlocal);
  unlock ();

  do_something (local);
  unref (local);

@@ +289,5 @@
>    PING();
> +  _dbus_win_cmutex_lock (sitter->mutex);
> +  state = (sitter->child_handle == NULL);
> +  _dbus_win_cmutex_unlock (sitter->mutex);
> +  return state;

I think this one is OK, because the child_handle never changes from NULL back to non-NULL.

@@ +312,2 @@
>    if (!_dbus_babysitter_get_child_exited (sitter))
>      _dbus_assert_not_reached ("Child has not exited");

It would be better to move the lock() call above here, and check for sitter->child_handle != NULL directly instead of calling get_child_exited(), so that the whole function call happens under one lock. It's usually wrong to have this pattern:

   lock ();
   check precondition;
   unlock ();

   lock ();
   do things;
   unlock ();

because the precondition could theoretically have changed while we were unlocked.

@@ +624,5 @@
>  
> +  CloseHandle (sitter->child_handle);
> +  sitter->child_handle = NULL;
> +  sock = sitter->socket_to_main.sock;
> +  _dbus_win_cmutex_unlock (sitter->mutex);

Do the unlock after the send here, same reasoning as above.

@@ +630,2 @@
>  
>    _dbus_babysitter_unref (sitter);

I would still prefer to have this change in addition:

> I think that
> could be done by removing the unref from the end of babysitter(), and moving
> it to just before handle_watch() returns, where it will be triggered by
> writing to sitter->socket_to_main.sock?
Comment 105 Ralf Habacker 2016-07-14 17:43:36 UTC
Created attachment 125072 [details] [review]
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter().

The child process is now started from the main thread. Afterwards a
second thread is created to provide a blocking free child exit detection.

Also make the implementation thread safe by using a mutex to protect access
to some DBusBabysitter struct members.

Bug:
Comment 106 Ralf Habacker 2016-07-14 17:52:02 UTC
(In reply to Simon McVittie from comment #104)
> Comment on attachment 123936 [details] [review] [review]
> Simplify windows implementation of
> _dbus_spawn_async_with_babysittery_sitter() (update 1)
> 
> Review of attachment 123936 [details] [review] [review]:
> -----------------------------------------------------------------
> 
> > +/* Local implementations of win related mutex functions to be able to track them */
> 
> I'd prefer these just inlined entirely: "HANDLE mutex;" and using the
> Windows primitives directly. There's no point in using a portable
> abstraction in non-portable code.

removed for now. May be better to use an abstraction later in whole windows code for easier reading.
For me it is much easier to read 

 _dbus_win_mutex_unlock (sitter->mutex);

as 

 WaitForSingleObject (sitter->mutex, INFINITE);


> @@ +93,2 @@
> >  
> > +    /* used by both threads and guarded by a mutex */
> 
> "guarded by @mutex" or "guarded by the mutex". It is specifically the mutex
> named "mutex" that guards these, and that's important information.

fixed

> @@ +269,4 @@
> >    PING();
> > +  _dbus_win_cmutex_lock (sitter->mutex);
> > +  handle = sitter->child_handle;
> > +  _dbus_win_cmutex_unlock (sitter->mutex);
> 
> Move the unlock down to the end of the function, and use the "goto out"
> pattern:

fixed

>   _dbus_win_cmutex_lock (sitter->mutex);
> 
>   if (sitter->child_handle == NULL)
>     goto out;
> 
>   PING();
>   TerminateProcess (sitter->child_handle, 12345);
> 
> out:
>   _dbus_win_cmutex_unlock (sitter->mutex);
>

fixed 

> Otherwise, there's nothing to stop the other thread from closing the
> child_handle after you have unlocked the mutex, but before you have
> terminated the child_handle.
> 
> In general, this pattern
> 
>   lock ();
>   local = struct->nonlocal;
>   unlock ();
> 
>   do_something (local);
> 
> is usually wrong, and should be replaced by this pattern
> 
>   lock ();
>   do_something (struct->nonlocal);
>   unlock ();
> 
> although sometimes this pattern can be OK:
> 
>   lock ();
>   local = ref (struct->nonlocal);
>   unlock ();
> 
>   do_something (local);
>   unref (local);

Thanks for this explanation.
 
> @@ +289,5 @@
> >    PING();
> > +  _dbus_win_cmutex_lock (sitter->mutex);
> > +  state = (sitter->child_handle == NULL);
> > +  _dbus_win_cmutex_unlock (sitter->mutex);
> > +  return state;
> 
> I think this one is OK, because the child_handle never changes from NULL
> back to non-NULL.
> 
> @@ +312,2 @@
> >    if (!_dbus_babysitter_get_child_exited (sitter))
> >      _dbus_assert_not_reached ("Child has not exited");
> 
> It would be better to move the lock() call above here, and check for
> sitter->child_handle != NULL directly instead of calling get_child_exited(),
> so that the whole function call happens under one lock.

fixed 

> @@ +624,5 @@
> >  
> > +  CloseHandle (sitter->child_handle);
> > +  sitter->child_handle = NULL;
> > +  sock = sitter->socket_to_main.sock;
> > +  _dbus_win_cmutex_unlock (sitter->mutex);
> 
> Do the unlock after the send here, same reasoning as above.

fixed 

> @@ +630,2 @@
> >  
> >    _dbus_babysitter_unref (sitter);
> 
> I would still prefer to have this change in addition:
> 
> > I think that
> > could be done by removing the unref from the end of babysitter(), and moving
> > it to just before handle_watch() returns, where it will be triggered by
> > writing to sitter->socket_to_main.sock?

Sorry, I will not fix, because then test-dbus complains about memory leaks we want to fix.
Comment 107 Ralf Habacker 2016-07-14 18:07:42 UTC
Created attachment 125075 [details] [review]
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter().

The child process is now started from the main thread. Afterwards a
second thread is created to provide a blocking free child exit detection.

Also make the implementation thread safe by using a mutex to protect access
to some DBusBabysitter struct members.

Bug:
Comment 108 Ralf Habacker 2016-07-14 18:09:13 UTC
Comment on attachment 125075 [details] [review]
Simplify windows implementation of _dbus_spawn_async_with_babysittery_sitter().

- fix usage of DBusCMutex type
Comment 109 Ralf Habacker 2016-07-23 07:40:43 UTC
(In reply to Ralf Habacker from comment #106)


> > I would still prefer to have this change in addition:
> > 
> > > I think that
> > > could be done by removing the unref from the end of babysitter(), and moving
> > > it to just before handle_watch() returns, where it will be triggered by
> > > writing to sitter->socket_to_main.sock?

To be sure I do not misunderstand your request: You want me to move the unref call to the end of handle_watch() or to move it to which location ? 

The current location of the unref is after writing to the mentioned socket.

Moving the unref call to handle_watch() reintroduce the memory leak.
Comment 110 Simon McVittie 2016-07-23 09:55:31 UTC
(In reply to Ralf Habacker from comment #109)
> To be sure I do not misunderstand your request: You want me to move the
> unref call to the end of handle_watch() or to move it to which location ? 

It's probably easiest if I write a patch for you to test.
Comment 111 GitLab Migration User 2018-10-12 21:27:42 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/dbus/dbus/issues/149.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.