37381 – client-types.py appears to have a race

Bug 37381 - client-types.py appears to have a race

Summary: client-types.py appears to have a race

Status:	RESOLVED MOVED

Alias:	None

Product:	Telepathy
Classification:	Unclassified
Component:	gabble (show other bugs)
Version:	git master
Hardware:	Other All

Importance:	medium normal
Assignee:	Telepathy bugs list
QA Contact:	Telepathy bugs list

URL:	http://cgit.collabora.com/git/user/al...
Whiteboard:
Keywords:	patch

Duplicates (1):	44016 (view as bug list)
Depends on:
Blocks:

Reported:	2011-05-19 10:39 UTC by Will Thompson
Modified:	2019-12-03 19:52 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
bug37381-pingpong.patch (2.58 KB, patch) 2012-02-13 10:43 UTC, Alban Crequy	Details \| Splinter Review
client-types test: avoid a race (942 bytes, patch) 2013-09-11 12:02 UTC, Simon McVittie	Details \| Splinter Review
Presence: use a sequence number, not a timestamp, to test "most recent" (14.44 KB, patch) 2013-09-11 12:03 UTC, Simon McVittie	Details \| Splinter Review
View All

Description Will Thompson 2011-05-19 10:39:26 UTC

Today I pushed a trivial patch to a test to Gabble, and the tests failed on 3 out of the 4 build slaves. Two of them were races, but one of them was a genuine failure <http://buildbot.telepathy.im/builders/Gabble%20-%20plugins%20-%20amd64/builds/52>:

Traceback (most recent call last):
  File "/var/lib/buildbot/telepathy/gabble-plugins-amd64/build/tests/twisted/gabbletest.py", line 627, in exec_test_deferred
    fun(queue, bus, conns[0], streams[0])
  File "./client-types.py", line 218, in test2
    assertSameSets(['pc'], types[handle])
KeyError: dbus.UInt32(2L)
FAIL: client-types.py

`tt -r client-types.py` reproduced the issue pretty quickly. It looks like a pretty straight race: contact_online() sends a disco reply but doesn't actually wait around to check that Gabble's actually received it, and immediately afterwards the test asks for handle's client types. So I threw a sync_stream() into the 'not initial' branch of contact_online(). But that still seemed to be racy:

Traceback (most recent call last):
  File "/home/wjt/src/tp/gabble/tests/twisted/gabbletest.py", line 627, in exec_test_deferred
    fun(queue, bus, conns[0], streams[0])
  File "./client-types.py", line 221, in test2
    assertSameSets(['pc'], types[handle])
  File "/home/wjt/src/tp/gabble/tests/twisted/servicetest.py", line 565, in assertSameSets
    pretty(exp_set), pretty(val_set)))
AssertionError: expected contents:
set(['pc'])
got:
set([dbus.String(u'phone')])
FAIL: client-types.py

so then I gave up.

Comment 1 Alban Crequy 2012-02-08 09:46:11 UTC

The first race is fixed by your sync_stream.

The second race is interesting :) It can be reproduced at 100% if you add a time.sleep(2) just before the "phone comes online". The race happens when the 'phone' and the 'pc' come online at different time in second since Epoch.

It is due to src/presence.c aggregate_resources(): the sort criteria to find the best resource has 2 problems:

- it depends on last_available in seconds, and the unit tests cannot guarantee that both resources will come online at the same Epoch-second.

- it sorts both on last_available and on client_type but none of them has a priority over the other, so the result of the sorting will depend on the previous order. We need to decide whether last_available or client_type is more important.

Comment 2 Will Thompson 2012-02-08 09:56:22 UTC

(In reply to comment #1)
> The first race is fixed by your sync_stream.
> 
> The second race is interesting :) It can be reproduced at 100% if you add a
> time.sleep(2) just before the "phone comes online". The race happens when the
> 'phone' and the 'pc' come online at different time in second since Epoch.
> 
> It is due to src/presence.c aggregate_resources(): the sort criteria to find
> the best resource has 2 problems:
> 
> - it depends on last_available in seconds, and the unit tests cannot guarantee
> that both resources will come online at the same Epoch-second.

This could perhaps be addressed by doing something like tests/test-presence.c? It explicitly controls the time at which events appear to occur.

> - it sorts both on last_available and on client_type but none of them has a
> priority over the other, so the result of the sorting will depend on the
> previous order. We need to decide whether last_available or client_type is more
> important.

Comment 3 Alban Crequy 2012-02-09 03:27:44 UTC

*** Bug 44016 has been marked as a duplicate of this bug. ***

Comment 4 Simon McVittie 2012-02-09 06:04:38 UTC

sync_stream, fine.

The other patch I don't like at all. Making the conditional even more giant seems like loss, adding an arbitrary 2-second window to our source code for the benefit of a test even more so.

If we have to, I'd prefer to just not test the "resources differing only by time" case in the client-types test: that test checks that the highest-priority resource has its client type used, and test-presence.c checks that time is used correctly, so it seems safe to not test.

Comment 5 Alban Crequy 2012-02-13 10:43:34 UTC

Created attachment 56982 [details] [review]
bug37381-pingpong.patch

(In reply to comment #4)

> The other patch I don't like at all. Making the conditional even more giant
> seems like loss, adding an arbitrary 2-second window to our source code for
> the benefit of a test even more so.

I can remove the 2-second window.

But I don't like the current conditional because there is no proper priority between the resources and it is dependent on the order of the resources in the linked list.

As an example, the attached test/scenario will pass but I think it should not pass. It adds 10 resources online, and the client type will alternate between 'pc' and 'phone' while the resources are added. And removing one pc will make the client type change from 'pc' to 'phone' although there is still several pcs online.

Comment 6 Simon McVittie 2013-09-11 12:00:54 UTC

Cc Jonny and Sjoerd, contributors to Bug #32139, in the hope that they can shed some light on why the algorithm is the way it is...

The limited resolution of last_available is irritating, but we can fix that by making it a sequence number rather than a timestamp.

(In reply to comment #0)
> If there are multiple clients with the same availability, but different
> client types, then Gabble seems to (sometimes?, always?) use phone as the
> client type.

It picks the "best" resource, and uses its status, status message and client type. That seems right - the problem is the algorithm for "best", which is pretty odd.

The big messy conditional looks wrong, or at least non-deterministic:

      if (best == NULL ||
          r->status > best->status ||
          (r->status == best->status &&
              (r->last_available > best->last_available ||
               r->priority > best->priority)) ||
          (r->client_type & GABBLE_CLIENT_TYPE_PC
              && !(best->client_type & GABBLE_CLIENT_TYPE_PC)))
        best = r;

I'd feel happier about this if it was expressed in the form of a strcmp()-style "is x better than y?" function. In addition to the info in the struct itself, comparison currently depends on an implicit extra piece of info: when a resource is added (goes from unavailable to any online status), it goes to the end of the list; when a resource is updated (goes from any online status to any other online status), it doesn't. If the comparison depending on that is intentional, I'd prefer it to be explicit.

As a result, this:

    - phone and Empathy both offline
    - phone becomes Available
      - phone is selected
    - Empathy becomes Busy
      - Empathy is selected, because its client type is "better"

does not lead to the same result as this:

    - phone and Empathy both offline
    - Empathy becomes Busy
      - Empathy is selected
    - phone becomes Available
      - phone is selected, because its status is "better"

Adding a couple of extra parameters to resource_better_than() could let us use it for both "which resource should I use for chat/calls/FT?" and "which resource should be displayed in the contact list?", with different rules if necessary - but we'd need to agree on what those rules were. Is "PC beats phone" a last-resort tie-breaker, or top priority, for instance?

Things that seem reasonable as a basis for comparison to me include:

* something is better than nothing (unless it has negative priority and we're trying to communicate with it)
* PCs are better (for status display) than phones, according to Bug #32139
* phones are better (for calls) than PCs
* high status (available) is better than low status (away)
* if both are available, the one that transitioned more recently is better
* if both are away/busy, the one that was available more recently is better
* high priority is better than low priority
* older connection is better than newer connection? (or the other way round?)

but the question is how to deal with each pair "x1 > x2 && y1 < y2". The bullet points there match how resource_better_than() behaves (first = most important).

Comment 7 Simon McVittie 2013-09-11 12:02:55 UTC

Created attachment 85624 [details] [review]
client-types test: avoid a race

---

Will noticed this was necessary, but not sufficient, further up this bug.

Comment 8 Simon McVittie 2013-09-11 12:03:17 UTC

Created attachment 85625 [details] [review]
Presence: use a sequence number, not a timestamp, to test  "most recent"

This means the client-types test deterministically gets a different
result: it was previously relying on the PC and the phone coming online
near-simultaneously, which resulted in the PC "winning" the comparison
(except when the test ran close to an integer number of
seconds-since-the-epoch, when it failed).

Now, the phone has a more recent "most recently available"
sequence number, so it "wins". This might be considered to make
<https://bugs.freedesktop.org/show_bug.cgi?id=32139> regress,
depending which tie-breaker you consider to be more important.

The change to prefer_higher_priority_resources() illustrates one
effect of this change: the priority never actually matters
unless neither of the resources under consideration has ever
been "available", because the "most recently available"
sequence number is considered more important.

It's not completely clear whether the "most recently available"
sequence number is intended to tie-break between resources that are
away/busy (display the one that went away more recently), or that are
available (display the one that arrived more recently) - at the
moment it does both.

Comment 9 Simon McVittie 2013-09-11 12:09:56 UTC

(In reply to comment #5)
> As an example, the attached test/scenario will pass but I think it should
> not pass. It adds 10 resources online, and the client type will alternate
> between 'pc' and 'phone' while the resources are added. And removing one pc
> will make the client type change from 'pc' to 'phone' although there is
> still several pcs online.

If we used the same logic as resource_better_than(), the result of this test would change to displaying the "dnd" presence of the most recent PC to come online, ignoring the "available" phones. Would people prefer that?

Bug #32139 suggests that Sjoerd would prefer the "available" phones to be preferred over the "dnd" PCs, which could be achieved by skipping the "strongly prefer client type" step at the beginning of resource_better_than(), and adding a "weakly prefer client type" step after comparing statuses. Where should that go? Before "last available"? After "last available"? After priority? Answers on a postcard.

Comment 10 GitLab Migration User 2019-12-03 19:52:02 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/telepathy/telepathy-gabble/issues/148.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.