Bug 49687 - Zeitgeist based LogStore
Summary: Zeitgeist based LogStore
Status: RESOLVED MOVED
Alias: None
Product: Telepathy
Classification: Unclassified
Component: logger (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Telepathy bugs list
QA Contact: Telepathy bugs list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-09 05:36 UTC by Seif Lotfy
Modified: 2019-12-03 19:31 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Seif Lotfy 2012-05-09 05:36:51 UTC
As part of my research to improve the logging backend for Telepathy I would like to propose a new LogStore.

http://telepathy.freedesktop.org/wiki/Logger/ZeitgeistStore

To wrap it up it doesn't have to use Zeitgeist but it would make lives much easier. The idea is to store events in a table and have extra data (text of events) in another table for quick lookups.
In general I think this would allow us to search for text quicker by experimenting with FTS indexers like Xapian or Lucene. As well as allow us to look up non text based events quicker.

I will update the wiki and designs as we go through this bug.
Comment 1 Cosimo Alfarano 2012-05-09 06:10:46 UTC
Note that this bug is an evolution of Bug#26908.

Not sure if it's needed as a separate bug (I'd say yes, to keep generic-sqlite and ZG detached).
In this case I'd rename this bug adding Zeitgeist explicitly in the title, keeping here the discussion regarding the integration of TPL with ZeitGeist (which uses sqlite internally anyway).
If ZG is seen as a reasonable way, we can eventually merge the two bugs
Comment 2 Guillaume Desmottes 2012-05-09 06:33:55 UTC
I'm not sure it's really worth logging opening/closing of text channels. In practice they are closed when user closes the chat window. Some users (like me) tend to keep chat windows opened during days while others (like Sjoerd) closes them all the time. So that doesn't really give us anything about the 'state' (terminated or not).
In Empathy, the log viewer groups chats by 'conversations' (which can be expended or closed) where a conversation is a group of messages exchanged with at most $N minutes between 2 consecutive messages (with $N == 20 or something).

If Zeitgeist fails to assign an ID to the even or if the logger exits while waiting for the id then the content of the message seems to be lost.
Comment 3 Seif Lotfy 2012-05-09 06:39:12 UTC
(In reply to comment #2)
> I'm not sure it's really worth logging opening/closing of text channels. In
> practice they are closed when user closes the chat window. Some users (like me)
> tend to keep chat windows opened during days while others (like Sjoerd) closes
> them all the time. So that doesn't really give us anything about the 'state'
> (terminated or not).
> In Empathy, the log viewer groups chats by 'conversations' (which can be
> expended or closed) where a conversation is a group of messages exchanged with
> at most $N minutes between 2 consecutive messages (with $N == 20 or something).

Yeah I agree that we need a better definition of conversations. However it is not essential for the event log. I just thought it could be useful. What matters is the send/receive events.

> If Zeitgeist fails to assign an ID to the even or if the logger exits while
> waiting for the id then the content of the message seems to be lost.

Good point. How would you propose solving the issue though. I mean with or without Zeitgeist if the logger exists we lose the message. Or did I miss something?
Comment 4 Nicolas Dufresne 2012-05-09 08:14:17 UTC
> > If Zeitgeist fails to assign an ID to the even or if the logger exits while
> > waiting for the id then the content of the message seems to be lost.
> 
> Good point. How would you propose solving the issue though. I mean with or
> without Zeitgeist if the logger exists we lose the message. Or did I miss
> something?

The pending message cache could be improved to take this into account.
Comment 5 Cosimo Alfarano 2012-05-09 08:20:50 UTC
(In reply to comment #3)
> Good point. How would you propose solving the issue though. I mean with or
> without Zeitgeist if the logger exists we lose the message. Or did I miss
> something?

Seif, the problem is due to fact that the logged info is split in two: Event Log and Body Index.
Without ZG (the Event Log), there would be just one sync call to Sqlite, to store the message at once.

We are talking of a clean exit, where for example TPL times out after some time without channel activities (I don't think it acts that way yet, but it was supposed to).

The bigger issue is that the second part (body) is the one most likely to be lost and it's also the most important (or at least equally important with some other event's info).

I don't care if I don't remember the avatar used with the message or the geolocation of the event.
I care if I don't have the body or I cannot associate the it with timestamp or from/to.

Is there any way to invert the process?

1- write the body with the minimum set of needed info into the Body Index (even if duplicated in the Log later), assigning a primary key X
2- write the Log, telling ZG that this event is related to X (or giving it our own event_id).

This is a scenario in which we have a private DB for what we need and delegate to ZG all the extra data, rather to have the private DB to keep what ZG cannot store/is better not store in ZG.

Seif, on a (not so) unrelated subject: can you explain how you thought to manage the "delete event" API? It might be insightful for this problem.
Scenario: TPL's event_id_X is deleted from ZG while TPL is not running?

I don't remember it exactly, but it was a sort of recovery procedure. Can it be applied in this case? (this would cover the situation for which TPL exists before ZG returns, not in case ZG does not answer though).
Comment 6 Seif Lotfy 2012-05-09 09:55:41 UTC
(In reply to comment #5)
> (In reply to comment #3)
> > Good point. How would you propose solving the issue though. I mean with or
> > without Zeitgeist if the logger exists we lose the message. Or did I miss
> > something?
> 
> Seif, the problem is due to fact that the logged info is split in two: Event
> Log and Body Index.
> Without ZG (the Event Log), there would be just one sync call to Sqlite, to
> store the message at once.
> 
> We are talking of a clean exit, where for example TPL times out after some time
> without channel activities (I don't think it acts that way yet, but it was
> supposed to).

Very good point.

One of the main reasons I thought of splitting the DB is to wrap the TBI (message DB) is to be able to fullindex the messages, with and FTS library such as Xapian or Lucene.

Having the Events and the Messages in one DB just makes it not so clean in terms of separation of concern. What if we decide that Xapian is not so good of a FTS indexer. This would mean wrapping the whole DB all over again.

Anyhow we could create a dedicated integrated event DB. Which does its own sync read/write for the Log. We could do that with a our own integration or with the upcoming libzeitgeist2 which creates a DB for you without the daemon. In both cases if your worries are doing stuff async then swtiching to sync writing should not be an issue.

> The bigger issue is that the second part (body) is the one most likely to be
> lost and it's also the most important (or at least equally important with some
> other event's info).

When can it get lost.
 
> I don't care if I don't remember the avatar used with the message or the
> geolocation of the event.
> I care if I don't have the body or I cannot associate the it with timestamp or
> from/to.
> 
> Is there any way to invert the process?
> 
> 1- write the body with the minimum set of needed info into the Body Index (even
> if duplicated in the Log later), assigning a primary key X
> 2- write the Log, telling ZG that this event is related to X (or giving it our
> own event_id).

Sadly you can't give Zeitgeist an event_id to an event.

Well a good solution is to have a temp_table which stores all the info as it is (strings) as soon as they arrive. When an interaction happens we will first dump it in the temp_table. Then we insert into the log then into the TBI. Once both insertions took place we remove from the temp_table. This way if TPL quits or crashes, or the Log is not reachable the middle of a process, the next time tpl start it will find the temp_table not-empty and try to empty it. You might ask why not keep the temp_table as our main storage. Well:
1) it is hard to do a FTS index around it.
2) it will have duplicate string entries for example the target string. Which can be costly and should be rather stored as an int.

> This is a scenario in which we have a private DB for what we need and delegate
> to ZG all the extra data, rather to have the private DB to keep what ZG cannot
> store/is better not store in ZG.

I can't follow. Can you elaborate?
 
> Seif, on a (not so) unrelated subject: can you explain how you thought to
> manage the "delete event" API? It might be insightful for this problem.
> Scenario: TPL's event_id_X is deleted from ZG while TPL is not running?
> 
> I don't remember it exactly, but it was a sort of recovery procedure. Can it be
> applied in this case? (this would cover the situation for which TPL exists
> before ZG returns, not in case ZG does not answer though).

Well in case we will have a Zeitgeist extension that does nothing else but listen to deleted events and stores the deleted event_ids. Once TPL runs again it requests all the deleted events from Zeitgeist, and update its TBI.
Comment 7 Cosimo Alfarano 2012-05-10 03:16:09 UTC
(In reply to comment #6)
> One of the main reasons I thought of splitting the DB is to wrap the TBI
> (message DB) is to be able to fullindex the messages, with and FTS library such
> as Xapian or Lucene.

> Having the Events and the Messages in one DB just makes it not so clean in
> terms of separation of concern. What if we decide that Xapian is not so good of
> a FTS indexer. This would mean wrapping the whole DB all over again.

OK, can you expand this thought in a rationale on the wiki?
What are the requirements for having the FTS indexer working?
Actually, an FTS section mentioning Xapian or any other FTS indexer is missing.

What's the difference between having what you proposed (evend_id+body table) and a table with more columns? Would a JOIN on two tables work?
 
> Anyhow we could create a dedicated integrated event DB. Which does its own 
> sync read/write for the Log. We could do that with a our own integration or 
> with the upcoming libzeitgeist2 which creates a DB for you without the daemon. 

I thought you already put a section about those possibilities on the wiki.
Please, add a section about it as well, so we have a complete scenario.

> In both cases if your worries are doing stuff async then swtiching to sync 
> writing should not be an issue.

I don't think that the async writing is what worries people, but what make them (including me) critical.
As long as there is a weak point in the proposal, it's not really viable.

> > The bigger issue is that the second part (body) is the one most likely to be
> > lost and it's also the most important (or at least equally important with some
> > other event's info).
> 
> When can it get lost.

Lost = the daemon shuts down before the callback is fired (including ZG never called us back).
How can it happen? Normal dbus service life cycle, desktop lifecycle, etc.
This part we can work on.

Last but not least TPL crashes: the longer it takes (in term of steps, rather than time, I know ZG is fast) of storing the whole info, the higher the possibility of data loss on a crash. This one we cannot do much, but we need to make TPL arch less susceptible to inconsistencies on such situations as well.

> > I don't care if I don't remember the avatar used with the message or the
> > geolocation of the event.
> > I care if I don't have the body or I cannot associate the it with timestamp or
> > from/to.
> > 
> > Is there any way to invert the process?
> > 
> > 1- write the body with the minimum set of needed info into the Body Index (even
> > if duplicated in the Log later), assigning a primary key X
> > 2- write the Log, telling ZG that this event is related to X (or giving it our
> > own event_id).
> 
> Sadly you can't give Zeitgeist an event_id to an event.
> 
> Well a good solution is to have a temp_table which stores all the info as it is
> (strings) as soon as they arrive. When an interaction happens we will first
> dump it in the temp_table. Then we insert into the log then into the TBI. Once
> both insertions took place we remove from the temp_table. This way if TPL quits
> or crashes, or the Log is not reachable the middle of a process, the next time
> tpl start it will find the temp_table not-empty and try to empty it.

This is a similar approach to what we use for pending messages, I think Nicolas was thinking of a similar thing on Comment #4

My idea is not considering the temp table temporary at all, but part of the log.
You have already the data, why removing it?

This also would make TPL queries (the log_manger_get_FOO()) not asking two places, but just one.

> You might
> ask why not keep the temp_table as our main storage. Well:
> 1) it is hard to do a FTS index around it.

I look forward to seeing it on the Wiki. Would it help to have two tables?
One for the body and one for the rest (timestamp, id).

> 2) it will have duplicate string entries for example the target string. Which
> can be costly and should be rather stored as an int.

I don't understand what you mean. Is it a write() problem?
We already write the data fully in the temp table.

It this is an issue, it can be avoided re-factoring into multiple tables

| tpl_id | event_id | body | (table 1)
| contact_id | contact_id_number (yeah, silly name) | (table 2)
| tpl_id | timestamp | contact_id_number | (table 3)

Table 2 is written when a new contact enters the log (this write() happens once in the table lifetime for each contact who will eventually contact us).
An in memory cache (hashtable) can be used in the LogStore for the recently contacted people, so to avoid continuous queries (read) to table 2 as well.

This way we have a body and then only integers to deal with, on the average situation.

The real problem is how to deal with data loss :)

> > This is a scenario in which we have a private DB for what we need and delegate
> > to ZG all the extra data, rather to have the private DB to keep what ZG cannot
> > store/is better not store in ZG.
> 
> I can't follow. Can you elaborate?

It's just a considaration on how ZG and the private DB are used.

WRT my former idea (a) of 
1- writing the whole needed info into SQLite
2- push the event info to ZG

and your idea (b) of
1- push the event info to ZG
2- writing the info that ZG does not store into SQLite

a) is a way to have a local SQLite DB with the majority of the info we need, and use ZG (from TPL) to get the rest of the info. If for any reason ZG is not running, we still can work.
ZG has partially duplicated info, but it wouldn't be a bit issue in my opinion.

in b) ZG has a main role, and the private SQLite is there only because ZG cannot do FTS for the moment.
Delegating the whole log to ZG would be OK, the fact that it cannot handle the body index for the moment is what actually creating the problem (see callback), fixing that is probably the ideal solution.
Although, we completely relay on ZG, which means that if ZG is down, we cannot do it. Can it be an issue?
Comment 8 Nicolas Dufresne 2012-05-10 06:29:42 UTC
Just before we forget, make sure all the operation are cancellable (see GCancellable for the architecture). While this is not in the Logger API, this is an important goal for fast change. The proposed Walker API is cancellable buy design too, so this one would need no more work.
Comment 9 Seif Lotfy 2012-05-10 07:45:46 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > One of the main reasons I thought of splitting the DB is to wrap the TBI
> > (message DB) is to be able to fullindex the messages, with and FTS library such
> > as Xapian or Lucene.
> 
> > Having the Events and the Messages in one DB just makes it not so clean in
> > terms of separation of concern. What if we decide that Xapian is not so good of
> > a FTS indexer. This would mean wrapping the whole DB all over again.
> 
> OK, can you expand this thought in a rationale on the wiki?
> What are the requirements for having the FTS indexer working?
> Actually, an FTS section mentioning Xapian or any other FTS indexer is missing.

I will update the wiki on FTS as soon as possible...
 
> What's the difference between having what you proposed (evend_id+body table)
> and a table with more columns? Would a JOIN on two tables work?

Joining on two table works.
In theory we could have one table. Practically it would make cost us a LOT of space and resources.
Why?
Because e.g: storing the account path or the tarted_id as strings in the table over and over would require a lot of space, and would reduce the  speed. We would need to map integer ids to those strings, which will be in another table. Integers are easier to compare and don't need as much space as strings in the DB.
Also having the message strings for with the events in one table would make looking up events query slower than having an events table alone. This would happen if we are trying to populate the "recently called list", where we are not interested in messages.

Splitting the log from the content of a message would allow us to perform much more efficient. Zeitgeist internally already would map the account path and such to integer ids, this would take a lot of overhead work away from the log.

> > Anyhow we could create a dedicated integrated event DB. Which does its own 
> > sync read/write for the Log. We could do that with a our own integration or 
> > with the upcoming libzeitgeist2 which creates a DB for you without the daemon. 
> 
> I thought you already put a section about those possibilities on the wiki.
> Please, add a section about it as well, so we have a complete scenario.

Yeah I need to find a way to express myself properly for the scenarios. But I think the rest of my answers here will clarify alot.

> 
> > In both cases if your worries are doing stuff async then swtiching to sync 
> > writing should not be an issue.
> 
> I don't think that the async writing is what worries people, but what make them
> (including me) critical.
> As long as there is a weak point in the proposal, it's not really viable.

Ok not I am lost. what is not viable?
 
> > > The bigger issue is that the second part (body) is the one most likely to be
> > > lost and it's also the most important (or at least equally important with some
> > > other event's info).
> > 
> > When can it get lost.
> 
> Lost = the daemon shuts down before the callback is fired (including ZG never
> called us back).
> How can it happen? Normal dbus service life cycle, desktop lifecycle, etc.
> This part we can work on.
> 
> Last but not least TPL crashes: the longer it takes (in term of steps, rather
> than time, I know ZG is fast) of storing the whole info, the higher the
> possibility of data loss on a crash. This one we cannot do much, but we need to
> make TPL arch less susceptible to inconsistencies on such situations as well.
> 
> > > I don't care if I don't remember the avatar used with the message or the
> > > geolocation of the event.
> > > I care if I don't have the body or I cannot associate the it with timestamp or
> > > from/to.
> > > 
> > > Is there any way to invert the process?
> > > 
> > > 1- write the body with the minimum set of needed info into the Body Index (even
> > > if duplicated in the Log later), assigning a primary key X
> > > 2- write the Log, telling ZG that this event is related to X (or giving it our
> > > own event_id).
> > 
> > Sadly you can't give Zeitgeist an event_id to an event.
> > 
> > Well a good solution is to have a temp_table which stores all the info as it is
> > (strings) as soon as they arrive. When an interaction happens we will first
> > dump it in the temp_table. Then we insert into the log then into the TBI. Once
> > both insertions took place we remove from the temp_table. This way if TPL quits
> > or crashes, or the Log is not reachable the middle of a process, the next time
> > tpl start it will find the temp_table not-empty and try to empty it.
> 
> This is a similar approach to what we use for pending messages, I think Nicolas
> was thinking of a similar thing on Comment #4
> 
> My idea is not considering the temp table temporary at all, but part of the
> log.
> You have already the data, why removing it?

I get your concern here. The answer is tricky. The temp_table saves events in a simple almost mapped raw format for easier looking up. The raw format in storage would cost us a lot of space and is not optimized. We would need to optimize it for querying and such which would end up being an implementation of a whole new sqlite DB similar to what I just proposed.

> This also would make TPL queries (the log_manger_get_FOO()) not asking two
> places, but just one.

Well yes but on the cost of speed and space, check my explanation above.

> > You might
> > ask why not keep the temp_table as our main storage. Well:
> > 1) it is hard to do a FTS index around it.
> 
> I look forward to seeing it on the Wiki. Would it help to have two tables?
> One for the body and one for the rest (timestamp, id).

Not really. If we use Sqlite's FTS then yes it would. If we use Xapian or Lucene, both need their own DB format. Which is efficient for FTS but not for normal querying. Also FTS DB's use a LOT of space, compared to a normal SQlite DB. My ZG DB is 20 MB and my FTS DB for Zeitgeist (we also used the split method) has 43 MB. And Zeitgeist does not really log big text but only headers and mimetypes and such.
 
> > 2) it will have duplicate string entries for example the target string. Which
> > can be costly and should be rather stored as an int.
> 
> I don't understand what you mean. Is it a write() problem?
> We already write the data fully in the temp table.

the temp table if not optimized will be storing a lot of strings which, will cost time in querying since looking up strings is slower than looking up integers. We would need to have tables mapping properties values to integers. Which would lead to more or less a log like zeitgeist :D

> It this is an issue, it can be avoided re-factoring into multiple tables
> 
> | tpl_id | event_id | body | (table 1)
> | contact_id | contact_id_number (yeah, silly name) | (table 2)
> | tpl_id | timestamp | contact_id_number | (table 3)
> Table 2 is written when a new contact enters the log (this write() happens once
> in the table lifetime for each contact who will eventually contact us).
> An in memory cache (hashtable) can be used in the LogStore for the recently
> contacted people, so to avoid continuous queries (read) to table 2 as well.
> 
> This way we have a body and then only integers to deal with, on the average
> situation.

table 2 and 3 are more or less provided by the log, internally. table 1 however as explained before if we decide on this temp_table not being a temp_table at all would force us to use the sqlite FTS ad would require lots of implementations from our side to optimize.

If we keep the temp_table a temporary ans entries are purged when written to the actual log and TBI we will reduce the implementation efforts, and the queries will be efficient since the temp_table ideally would always have 1 entry in it.

> The real problem is how to deal with data loss :)
> 
> > > This is a scenario in which we have a private DB for what we need and delegate
> > > to ZG all the extra data, rather to have the private DB to keep what ZG cannot
> > > store/is better not store in ZG.
> > 
> > I can't follow. Can you elaborate?
> 
> It's just a considaration on how ZG and the private DB are used.
> 
> WRT my former idea (a) of 
> 1- writing the whole needed info into SQLite
> 2- push the event info to ZG
> 
> and your idea (b) of
> 1- push the event info to ZG
> 2- writing the info that ZG does not store into SQLite
> 
> a) is a way to have a local SQLite DB with the majority of the info we need,
> and use ZG (from TPL) to get the rest of the info. If for any reason ZG is not
> running, we still can work.
> ZG has partially duplicated info, but it wouldn't be a bit issue in my opinion.

Agree this wouldn't be an issue. But would require lots of implementation.

> in b) ZG has a main role, and the private SQLite is there only because ZG
> cannot do FTS for the moment.
> Delegating the whole log to ZG would be OK, the fact that it cannot handle the
> body index for the moment is what actually creating the problem (see callback),
> fixing that is probably the ideal solution.
> Although, we completely relay on ZG, which means that if ZG is down, we cannot
> do it. Can it be an issue?

Well Zeitgeist could take over the whole TBI in the form of an extension. Which means Zeitgeist will have an extension just like its FTS extension, that hooks into the "post_insert_event" and writes the body into a new table. This would however move the DBus API to Zeitgeist domain. :/

But I think I have a good idea. We could easily create an observer extension in Zeitgeist that listens to Telepathy events and does the reading and writing in Zeitgeist internally. Or the LogStore would forward an events to Zeitgeist (via the exposed extension API). In both cases Zeitgeist will maintain the Log and the TBI. Zeitgeist then replies with an event_id and LogStore can remove the event from its "temp_table".
Reading will be done directly. 

So to sum it up. If the LogStore is to push the events to Zeitgeist. 
1) TPL would tell Zeitgeist to log an event with the body, while keeping a copy of the event and the body in a temp_table.
2) Zeitgeist does its own magic, and returns an id for the event (which means event and body were successfully stored).
3) TPL gets the id back (via callback) and removes the entry from the temp_table.

else if Zeitgeist would have an extension with an observer:
1) The extension in Zeitgeist would listen to events from Telepathy.
2) When an Event occurs Zeitgeist does its own magic like it does with its current FTS index maintaining 2 DBs.

In both cases deleting stuff will be handled internally in Zeitgeist by the extension.

To read, TPL would have direct access to the DB of Zeitgeist using libzeitgeist2 and would extract the info it needs.
Comment 10 Seif Lotfy 2012-05-18 10:18:18 UTC
I updated the wiki. Please tell me what you think
Comment 11 GitLab Migration User 2019-12-03 19:31:28 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/telepathy/telepathy-logger/issues/25.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.