60851 – server must always be responsive, in particular after SetActiveAddressBooks

Bug 60851 - server must always be responsive, in particular after SetActiveAddressBooks

Summary: server must always be responsive, in particular after SetActiveAddressBooks

Status:	RESOLVED MOVED

Alias:	None

Product:	SyncEvolution
Classification:	Unclassified
Component:	PIM Manager (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	highest major
Assignee:	SyncEvolution Community
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	55918
	Show dependency tree / graph

Reported:	2013-02-14 17:13 UTC by Patrick Ohly
Modified:	2018-10-13 12:42 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
2000 contacts, folks 7f1fd0dd0, generating contacts one-by-one while folks runs (8.15 KB, text/plain) 2013-03-26 13:47 UTC, Patrick Ohly	Details
2000 contacts, folks with patch, generating contacts one-by-one while folks runs (45.30 KB, text/plain) 2013-03-26 13:48 UTC, Patrick Ohly	Details
2000 contacts, folks 7f1fd0dd0, generating contacts before folks runs (8.05 KB, text/plain) 2013-03-26 13:48 UTC, Patrick Ohly	Details
2000 contacts, folks with patch, generating contacts before folks runs (24.98 KB, text/plain) 2013-03-26 13:49 UTC, Patrick Ohly	Details
View All

Description Patrick Ohly 2013-02-14 17:13:54 UTC

Need to write a test which regularly pings the server while reconfiguration of the active address books is running. Also enhance performance of that.

The question related to this is whether SetActiveAddressBooks() should return before the work is done (currently it does) and what the result of queries should be while the reconfiguration runs (currently it may return data that belongs to disabled address books - that data will be removed shortly, for some definition of "short").

Comment 1 Patrick Ohly 2013-02-15 14:31:56 UTC

I've written a test and can reproduce the problem:

mkdir /tmp/testing
cd /tmp/testing
export XDG_CONFIG_HOME=/tmp/testing/temp-testpim/config XDG_DATA_HOME=/tmp/testing/temp-testpim/local/cache
~pohly/src/syncevolution/test/dbus-session.sh bash
TEST_DBUS_QUIET=1 PATH=/home/pohly/work/syncevolution/src:$PATH TESTPIM_TEST_ACTIVE_NUM=1000  ~pohly/src/syncevolution/src/dbus/server/pim/testpim.py -v TestContacts.testActive

======================================================================
FAIL: testActive (__main__.TestContacts)
TestContacts.testActive - reconfigure active address books several times
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pohly/src/syncevolution/src/dbus/server/pim/testpim.py", line 1647, in testActive
    w.checkpoint('full view')
  File "/home/pohly/src/syncevolution/src/dbus/server/pim/testpim.py", line 248, in checkpoint
    self.check()
  File "/home/pohly/src/syncevolution/src/dbus/server/pim/testpim.py", line 239, in check
    self.test.assertEqual([], tooslow)
  File "/home/pohly/src/syncevolution/test/testdbus.py", line 1321, in assertEqualCustom
    unittest.TestCase.assertEqual(self, self.stripDBus(a, sortLists), self.stripDBus(b, sortLists), msg)
AssertionError: Lists differ: [] != [(1360937032.086724, 78.482029...

Second list contains 1 additional elements.
First extra element 0:
(1360937032.086724, 78.48202991485596)

- []
+ [(1360937032.086724, 78.48202991485596)]
    ^^^^^^             ^^^^
start time             duration of GetAllPeers call

What I see in the dbus.log of that test are many "new contact data" messages from evolution-addressbook-factory to syncevo-dbus-server, followed by a GetAllPeers message (the ping) and directly afterwards the reply:

signal sender=:1.1011 -> dest=(null destination) serial=1036 path=/test/dbus/py; interface=t.d.p; member=log2
   string "87.300s: waiting for view with contacts"
method call sender=:1.1011 -> dest=:1.1013 serial=1037 path=/org/01/pim/contacts; interface=org._01.pim.contacts.Manager; member=GetAllPeers
method return sender=:1.1013 -> dest=:1.1011 reply_serial=1037
   array [
      dict entry(
         string "test-dbus-a"
         array [
            dict entry(
               string "address"
               string "xxx"
            )
            dict entry(
               string "protocol"
               string "PBAP"
            )
         ]
      )
      dict entry(
         string "test-dbus-b"
         array [
            dict entry(
               string "address"
               string "xxx"
            )
            dict entry(
               string "protocol"
               string "PBAP"
            )
         ]
      )
      dict entry(
         string "test-dbus-c"
         array [
            dict entry(
               string "address"
               string "xxx"
            )
            dict entry(
               string "protocol"
               string "PBAP"
            )
         ]
      )
   ]
signal sender=:1.1011 -> dest=(null destination) serial=1038 path=/test/dbus/py; interface=t.d.p; member=log2
   string "165.982s: ping failure: duration 78.482030s, error None"

The dbus.log does not contain time stamps; checking with "dbus-monitor --profile" while the test runs confirms that the reply is sent quite a bit later (over a minute). Attaching with gdb shows that the process is busy inside folks code during that time and only rarely serves new method calls in the main event loop.

My theory is that when receiving many contacts from EDS, folks inside syncevo-dbus-server can't keep up with evolution-addressbook-factory and lets contact data pile up in its incoming D-Bus channel. The D-Bus method call then gets tacked onto the end of that pile and only gets seen by syncevo-dbus-server when all data is processed.

What makes the situation worse is that the test adds contacts one-by-one quickly, so there is one notification message being sent by evolution-addressbook-factory for every contact. This is not the normal startup situation, although it might happen when importing straight from a phone during syncing. When allowing folks to read after all test contacts are already in EDS cuts down the duration of GetAllPeers from 78 to 2 seconds - but that's still too much.

I am not sure yet what the right solution will be. Using direct reading inside folks will change the timing aspect, but might still lead to a long period of time where no other events are processed. Ultimately loading data and processing changes must become a lower priority task inside syncevo-dbus-server which must not interfere with other, more important work.

Comment 2 Patrick Ohly 2013-02-15 14:58:01 UTC

I also found that a ReadContacts() call with a large number of ids can lock up the server for a while, because *that* call is implemented synchronously. I looked briefly into what it would take to make it asynchronous, but stopped after thinking about some of the consequence.

By design, the call has to return all results at once. That implies assembling the response in memory and returning it when done. When interleaving other event processing with that data gathering, the result might become inconsistent (contact prepared, stored, then modified before sending result). Preventing that is much harder than letting the application break up large ReadContacts() calls into multiple smaller ones. This is what I am now doing in testpim.py.

Comment 3 Michael Hasselmann 2013-02-21 10:17:18 UTC

Isn't that something one could improve by adding Direct Read Access to libfolks? I don't know of any method (in D-Bus) to prioritize  messages that are sent to the same bus.

Also, do I read correctly that SetActiveAddressBooks() should be turned into a sync call and only return when reading of new address book data is complete?

Comment 4 Patrick Ohly 2013-02-21 11:07:43 UTC

(In reply to comment #3)
> Isn't that something one could improve by adding Direct Read Access to
> libfolks?

No, I don't think so. The actual work would still need to be chopped up into pieces which gets processed after dealing with the higher priority D-Bus events.

> I don't know of any method (in D-Bus) to prioritize  messages that
> are sent to the same bus.

Right, which is why all messages must be taken out of the bus and queued for later processing, if necessary. The obvious downside is that memory in the recipient will baloon if it cannot keep up.

It may be easier to avoid that with DRA (because the producer of the data is local and can be told to slow down), but that'll involve further API changes.

> Also, do I read correctly that SetActiveAddressBooks() should be turned into
> a sync call and only return when reading of new address book data is
> complete?

That was indeed tossed around as an idea, but it is hard to implement (folks doesn't signal when the change is complete) and thus I'd prefer to keep the call semantic as it is right now (doesn't wait).

Again, these are orthogonal problems. Even if the call was waiting, D-Bus event processing must continue while it waits.

Comment 5 Patrick Ohly 2013-02-22 08:15:05 UTC

A bug step towards solving this problem was posted in folks for review. See https://bugzilla.gnome.org/show_bug.cgi?id=694385 and patch https://bugzilla.gnome.org/review?bug=694385&attachment=237091 in particular.

The TEST_DBUS_QUIET=1 TESTPIM_TEST_ACTIVE_NUM=1000 test now passes with D-Bus messages replied to within 0.3 seconds or less - not perfect, but considerably better than before.

Comment 6 Patrick Ohly 2013-03-26 13:45:56 UTC

Collabora enhanced performance in recent folks. While re-doing benchmarks I noticed that the worst-case performance for 1000 contacts isn't as bad as I remember it. However, I have not done an exact comparison.

The patch in https://bugzilla.gnome.org/show_bug.cgi?id=694385 is still relevant, though. Here's a comparison with TEST_DBUS_QUIET=1 TESTPIM_TEST_ACTIVE_NUM=2000 TESTPIM_TEST_ACTIVE_RESPONSE=-1 and filling the address books one-by-one while folks processes the changes.

I used an updated testpim.py which prints all samples instead of checking them. TESTPIM_TEST_ACTIVE_RESPONSE=-1 enables that mode.

stats (https://github.com/rustyrussell/stats) is used to  calculate min-max(avg) for each phase. Full files will be attached.

Without patch:

$ stats folks-master-unpatched-2000.log
testActive (__main__.TestContacts)
TestContacts.testActive - reconfigure active address books several times ...
Pinging server at intervals of 0.200000s.
full view: ping duration: 0.012624-137.164563(45.7348+/-65)
contact data: ping duration: 0.005966-0.030444(0.0161076+/-0.0069)
['', 'a', 'b', 'c'] -> []: ping duration: 0.008605-0.763513(0.386059+/-0.38)
[] -> ['c']: ping duration: 0.008072-1.226016(0.149534+/-0.38)
['c'] -> ['b']: ping duration: 0.004165-0.836381(0.106718+/-0.26)
['b'] -> ['b', 'c']: ping duration: 0.005579-1.218825(0.145583+/-0.38)
['b', 'c'] -> ['a']: ping duration: 0.008640-1.261694(0.158903+/-0.36)
['a'] -> ['a', 'c']: ping duration: 0.005384-1.210349(0.143487+/-0.38)
['a', 'c'] -> ['a', 'b']: ping duration: 0.004604-0.328148(0.0475674+/-0.099)
['a', 'b'] -> ['a', 'b', 'c']: ping duration: 0.006383-1.224438(0.150367+/-0.38)
['a', 'b', 'c'] -> ['']: ping duration: 0.011664-1.227834(0.198583+/-0.38)
[''] -> ['', 'c']: ping duration: 0.005088-1.201952(0.143017+/-0.37)
['', 'c'] -> ['', 'b']: ping duration: 0.004293-0.313067(0.045124+/-0.095)
['', 'b'] -> ['', 'b', 'c']: ping duration: 0.011554-1.373705(0.27695+/-0.52)
['', 'b', 'c'] -> ['', 'a']: ping duration: 0.004793-1.213316(0.158278+/-0.35)
['', 'a'] -> ['', 'a', 'c']: ping duration: 0.005222-1.234311(0.149249+/-0.38)
['', 'a', 'c'] -> ['', 'a', 'b']: ping duration: 0.004743-0.122250(0.0208045+/-0.034)
['', 'a', 'b'] -> ['', 'a', 'b', 'c']: ping duration: 0.005526-1.225899(0.146111+/-0.38)
ok

----------------------------------------------------------------------
Ran 1 test in 905.721s
OK

Note the 137s maximum response time during populating the data ("full view").

With patch:

$ stats folks-master-patched-2000.log
testActive (__main__.TestContacts)
TestContacts.testActive - reconfigure active address books several times ...
Pinging server at intervals of 0.200000s.
full view: ping duration: 0.004023-0.103919(0.0433612+/-0.025)
contact data: ping duration: 0.005415-0.037208(0.0193849+/-0.0088)
['', 'a', 'b', 'c'] -> []: ping duration: 0.011281-0.751190(0.381236+/-0.37)
[] -> ['c']: ping duration: 0.004249-0.028754(0.0124643+/-0.0061)
['c'] -> ['b']: ping duration: 0.003715-0.207745(0.0281574+/-0.046)
['b'] -> ['b', 'c']: ping duration: 0.006746-0.032266(0.0171924+/-0.0075)
['b', 'c'] -> ['a']: ping duration: 0.005372-0.378336(0.0332488+/-0.076)
['a'] -> ['a', 'c']: ping duration: 0.003991-0.034730(0.0185342+/-0.0083)
['a', 'c'] -> ['a', 'b']: ping duration: 0.004354-0.205181(0.0245261+/-0.04)
['a', 'b'] -> ['a', 'b', 'c']: ping duration: 0.004429-0.053523(0.0222038+/-0.013)
['a', 'b', 'c'] -> ['']: ping duration: 0.007220-0.567568(0.0461981+/-0.13)
[''] -> ['', 'c']: ping duration: 0.004500-0.031244(0.0149976+/-0.0077)
['', 'c'] -> ['', 'b']: ping duration: 0.005897-0.218556(0.0247215+/-0.043)
['', 'b'] -> ['', 'b', 'c']: ping duration: 0.004165-0.045688(0.0226147+/-0.011)
['', 'b', 'c'] -> ['', 'a']: ping duration: 0.004427-0.572938(0.0389641+/-0.1)
['', 'a'] -> ['', 'a', 'c']: ping duration: 0.004139-0.041826(0.021645+/-0.011)
['', 'a', 'c'] -> ['', 'a', 'b']: ping duration: 0.007139-0.221841(0.0326704+/-0.039)
['', 'a', 'b'] -> ['', 'a', 'b', 'c']: ping duration: 0.004101-0.060816(0.0274812+/-0.017)
ok

----------------------------------------------------------------------
Ran 1 test in 946.532s
OK

The maximum response time during loading gets reduced to 0.1s.

On the downside, total runtime increases by 4%.

When running the test so that all data is already in EDS when folks starts, then the "full view" max response time is 2.3s without patch and 0.2s with. Total runtime goes up from 797s to 848s (6%) - this cannot be compared directly with the other setup, because loading no longer overlaps with generating the data.

Comment 7 Patrick Ohly 2013-03-26 13:47:25 UTC

Created attachment 77050 [details]
2000 contacts, folks 7f1fd0dd0, generating contacts one-by-one while folks runs

Comment 8 Patrick Ohly 2013-03-26 13:48:01 UTC

Created attachment 77051 [details]
2000 contacts, folks with patch, generating contacts one-by-one while folks runs

Comment 9 Patrick Ohly 2013-03-26 13:48:36 UTC

Created attachment 77052 [details]
2000 contacts, folks 7f1fd0dd0, generating contacts before folks runs

Comment 10 Patrick Ohly 2013-03-26 13:49:11 UTC

Created attachment 77053 [details]
2000 contacts, folks with patch, generating contacts before folks runs

Comment 11 Patrick Ohly 2013-03-26 19:28:25 UTC

All relevant patches in folks were accepted, see https://bugzilla.gnome.org/show_bug.cgi?id=694385

TODO:
- try out DRA in combination with testActive; blocked by not being able to call
  e_book_client_open_direct_sync from Vala (not tracked upstream yet)

Comment 12 GitLab Migration User 2018-10-13 12:42:32 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/SyncEvolution/syncevolution/issues/89.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.