Bug 29702

Summary:	Unit tests are racy/buggy
Product:	Telepathy	Reporter:	Olli Salli <ollisal>
Component:	tp-qt	Assignee:	Olli Salli <ollisal>
Status:	RESOLVED FIXED	QA Contact:	Telepathy bugs list <telepathy-bugs>
Severity:	major
Priority:	high	CC:	andrunko
Version:	git master
Hardware:	All
OS:	All
URL:	http://git.collabora.co.uk/?p=user/oggis/telepathy-qt4.git;a=shortlog;h=refs/heads/fix-racy-tests
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:
Bug Blocks:	25126

Description Olli Salli 2010-08-20 06:40:35 UTC

I'm seeing random unit tests failures (some of which are segfaults!) here when running make check, especially with -j but also without.

At least the conference channel test segfaults quite often, I'll try to identify and fix other suspects.

This blocks the next release - I can't really reliably detect any regressions if the tests are this flaky.

Comment 1 Olli Salli 2010-08-22 08:28:30 UTC

The branch fixes races identified thus far, and also adds a script to repeat tests for a given number of times, saving the logs for the failed runs. This is useful for detecting intermittent test failures which stem from race conditions.

Once this is merged, it probably should added to the release procedure to do tools/repeat-tests.sh check 100 and tools/repeat-tests.sh valgrind-check 10 (or more, but it takes a lot of time) to identify potential racy tests. The valgrind variant is useful because the tests run a lot slower under valgrind, so some types of race conditions become much more common due to the altered timings.

Comment 2 Olli Salli 2010-08-22 09:30:09 UTC

The fixes bring about some graciously overlapping rules-of-thumb for
implementing tests:
 - NEVER assume that only one event happens when you call processEvents()
unless further events can't happen without additional requests (otherwise, if
the test executes slower than the service, two service events may happen, which
then get processed with one processEvents() call and any checks depending on
those events after that call are then unpredictable)
 - NEVER implement a while (state != desired) {processEvents()} style loop if
the state can go to desired and back to something else without making further
requests (otherwise Murphy says it will go to desired during a processEvents()
call, and then again to something undesired during that same call, and never go
back to desired afterwards - so we'll just busy-loop infinitely...)
 - NEVER assume that you can reach checking that an event has happened (such as
you asking somebody for their presence) before another simulated event
overwriting it (such as them accepting) happens
 - even more generally, NEVER assume that you can reach starting to wait for a
simulated network event to happen before it already happened (so always check
for it to not have already happened before starting to wait)

Comment 3 Olli Salli 2010-08-23 02:09:06 UTC

Fix merged to master. Unit tests run reliably now. This was verified with repeat-tests.sh "make check" 1000 and repeat-tests.sh "make check-valgrind" 100, both of which completed without a single failure, which is impressive considering the branch lowers delay between simulated network events from 1000ms to just 1ms. This has the consequence of speeding up the tests immensely.

If you, however, notice that running unit tests now occasionally fails / hangs for you, please reopen the bug with the log from the failed run. This sort of fixing can never be 100% - if the races are sufficiently rare to not have happened in the finite amount of test runs I've done (which is several tens of thousands though), I haven't been able to fix them. For example one of the races fixed only happened three times in 10000 test runs.

Will be in 0.3.8.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.