Bug 77272

Summary: PBAP: download in chunks to make progress after interrupts
Product: SyncEvolution Reporter: Patrick Ohly <patrick.ohly>
Component: PBAPAssignee: Patrick Ohly <patrick.ohly>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: highest CC: nairb1958, syncevolution-issues
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Patrick Ohly 2014-04-10 16:01:48 UTC
Downloading in chunks with PullAll and Offset+MaxCount was originally brought up for issue #72112. The purpose mentioned there (being able to stop the sync between chunks) was better addressed with the Bluez 5.15 suspend/resume API.

However, downloading in chunks is useful for another purpose: just downloading large address books (5000 contacts, 8KB photo data in each contact) from slow devices (Samsung Galaxy S3) can take 45 minutes.

If the driver always turns off the car after a 15 minute commute, he'll never get the entire address book cached. Downloading in chunks can help by starting the download at varying offsets.

The downside is that the PBAP spec doesn't support downloading in chunks reliably if contacts get deleted during the download.
Comment 1 Terrence Enger 2014-05-11 19:53:12 UTC
*** Bug 78528 has been marked as a duplicate of this bug. ***
Comment 2 Patrick Ohly 2014-05-28 09:51:20 UTC
Here's a potential (and not 100% correct!) algorithm for transferring a
complete address book:

uint16 used = GetSize() # not the same as maximum offset!
uint16 start = choose_start()
uint16 chunksize = choose_chunk_size()

uint16 i
for (i = start; i < used; i += chunksize) {
   PullAll( Offset = i, MaxCount = chunksize)
}
for (i = 0; i < start; i += chunksize) {
   PullAll( Offset = i, MaxCount = min(chunksize, start - 1)
}

Note that GetSize() is specified as returning the number of entries in
the selected phonebook object that are actually used (i.e. indexes that
correspond to non-NULL entries). This is relevant if contacts get
deleted after starting the session. In that case, the algorithm above
will not necessarily read all contacts. Here's an example:
        offsets #0 till #99, with contacts #10 till #19 deleted
        chunksize = 10
        GetSize() = 90
        
=> this will request offsets #0 till #89, missing contacts #90 till #99

I think this can be fixed with an additional PullAll, leading to:

for (i = start; i < used; i += chunksize) {
   PullAll( Offset = i, MaxCount = chunksize)
}
PullAll(Offset = i) # not MaxCount!
for (i = 0; i < start; i += chunksize) {
   PullAll( Offset = i, MaxCount = min(chunksize, start - 1)
}

The additional PullAll() is meant to read all contacts at the end which
would not be covered otherwise.

Now the other problem: MaxCount means "read chunksize contacts
starting at #i". Therefore the algorithm above will end up reading contacts
multiple times occasionally. Example:

        offsets #0 till #99, with contact #0 deleted
        chunksize = 10
        GetSize() = 98

PullAll(Offset = 0, MaxCount = 10) => returns 10 contacts #1 till #10 (inclusive)
PullAll(Offset = 10, MaxCount = 10) => returns 10 contacts #10 till #19
=> contact #10 appears twice in the result

The duplicate cannot be filtered out easily because the UID is not
reliable. This could be addressed by keeping a hash of each contact and
discarding those who are exact matches for already seen contacts. It's easier to accept the duplicate and remove it during the next sync.

There are two more aspects that I chose to ignore above: how to
implement the choice of start offset and chunk size.

Start offset could be random (no persistent state needed) or could
continue where the last sync left off. The latter will require a write
after each PullAll() (in case of unexpected shutdowns), even if nothing
ever changes. Is that acceptable? Probably not. I prefer choosing
randomly.

The chunk size depends on the size of the average contact. Make it too
small, and we end up generating lots of individual transfers. Make it
too large (say 1000), and we still have chunks that never transfer
completely. We could tune the chunk size so that on average, each transfer has a certain size in bytes. TODO: how large?

Once we have such a target size in bytes, perhaps we can let the
algorithm adjust the chunk size dynamically: start small (100?), then
increase or decrease depending on the observed size of the returned
contacts.
Comment 3 Patrick Ohly 2014-07-14 12:08:55 UTC
Implemented, included in master. It is turned off by default.

commit 527b47c80ef105e7cbdfb170541615fd3e906906
Author: Patrick Ohly <patrick.ohly@intel.com>
Date:   Wed Jul 2 17:33:13 2014 +0200

    PBAP: transfer in chunks (FDO #77272)
    
    If enabled via env variables, PullAll transfers will be limited to
    a certain numbers contacts at different offsets until all data got
    pulled. See README for details.
    
    When transfering in chunks, the enumeration of contacts for the engine
    no longer matches the PBAP enumeration. Debug output uses "offset #x"
    for PBAP and "ID y" for the engine.

From the PBAP README:

Transfering in chunks
=====================

The default is to pull all contacts in one transfer. This can be
changed to transfer in chunks. Optionally the size of the chunks can
be adjusted dynamically at runtime to achieve a certain time per
transfer.

The purpose of transferring in chunks is twofold:
  1. It avoids having to pull the entire address book into a file
     which then has to be kept around until syncing is complete.
  2. By randomly starting at different offsets, eventually all
     data gets added to the local cache even if no sync ever
     completes.

This gets configured with environment variables:

SYNCEVOLUTION_PBAP_CHUNK_MAX_COUNT_PHOTO=<number of contacts>
  A value larger 0 enables chunking when transferring contacts with photo data.

SYNCEVOLUTION_PBAP_CHUNK_MAX_COUNT_NO_PHOTO=<number of contacts>
  A value larger 0 enables chunking when transferring contacts without photo data.

SYNCEVOLUTION_PBAP_CHUNK_TRANSFER_TIME=<seconds>
  The desired duration of each transfer. Indirectly also controls the amount
  of data which has to be buffered. Defaults to 30 seconds, turned off
  with any value <= 0 seconds.

SYNCEVOLUTION_PBAP_CHUNK_TIME_LAMBDA=<0 to 1>
  Controls how quickly new measurements adapt the chunk size. 0 is fastest
  (= next transfer uses exactly the calculated number of contacts), 1 is not
  all all (= all transfers use the intitial number). Defaults to 0.1.

SYNCEVOLUTION_PBAP_CHUNK_OFFSET=<0 to number of contacts in phone>
  Overrides the random selection of the start offset. Useful for debugging.
  Offsets which are out of range get mapped into a valid offset.

For example, consider a Samsung Galaxy S3, Android 4.3, average
contact size 6KB with photo data and 235B without. The transfer rate
is 40KB/s with photo data, 17KB/s without. To achieve 30s per chunk,
one needs to choose 243 contacts per chunk with photo data resp. 2500
without. A transfer of 1000 contacts without photos completes in under
17 seconds, with photos under 2:05 minutes. In this case, downloading
in chunks was almost as fast as transferring all at once.

To debug transferring in chunks, run
  SYNCEVOLUTION_DEBUG=1 syncevolution --daemon=no --export - \
     backend=pbap loglevel=4 \
     database=obex-bt://64:B3:10:C0:8C:2E 2>&1 | grep -e transferred -e "pullall" -e "max count"

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.