Bug 48834

Summary: Optimize FindEventIds query speed
Product: Zeitgeist Reporter: Siegfried-Angel Gevatter Pujals <siggi.gevatter>
Component: EngineAssignee: Seif Lotfy <seif>
Status: RESOLVED FIXED QA Contact: zeitgeist-bugs <zeitgeist-bugs>
Severity: normal    
Priority: medium CC: seif
Version: 0.9.x   
Hardware: Other   
OS: All   
Whiteboard:
Attachments: script to test the original query
Improve querying time for queries without templates
Remove usage of IN and replace with AND/OR
Query speed comparison
Query speed comparison
Added simple check if it is possible to use event table instead of event_view

Description Siegfried-Angel Gevatter Pujals 2012-04-17 10:54:19 UTC
Created attachment 60204 [details]
script to test the original query

At https://bugs.launchpad.net/zeitgeist/+bug/844877, Michal Hruby reported a particular query that he found to be slow. I'm attaching a script to check that query.

Some preliminary testing indicates that there seems to be some opportunity for optimizing queries (esp. noticeably with big databases, of the order of ~200.000 events).
Comment 1 Siegfried-Angel Gevatter Pujals 2012-04-30 08:30:58 UTC
I've been looking into some simpler query first. Particularly, to search for MOST_RECENT_SUBJECTS with no conditions, we are currently using the following code:

| SELECT id FROM event_view
| NATURAL JOIN (
|    SELECT subj_id, MAX(timestamp) AS timestamp
|    FROM event_view GROUP BY subj_id)
| GROUP BY subj_id
| ORDER BY timestamp DESC LIMIT 10;

This takes ~0.6 seconds with 300k events and ~0.06 seconds with 30k events.

A more straightforward query like the following only takes half as much time (in both cases):

| SELECT id
| FROM event_view
| GROUP BY subj_id
| ORDER BY MAX(timestamp) DESC LIMIT 10;

However, this doesn't work since «each non-aggregate expression ... is evaluated once for an arbitrarily selected row» [0], so "id" may be that of any event, not necessarily the most recent one. We can't use "MAX(id)" either, since id and timestamp are unrelated.

SQLite supports user-defined aggregate functions, so maybe we could do something like "NEWEST_EVENT(id, timestamp)".

[0] https://www.sqlite.org/lang_select.html#resultset
Comment 2 Seif Lotfy 2012-05-12 19:27:40 UTC
(In reply to comment #1)
> I've been looking into some simpler query first. Particularly, to search for
> MOST_RECENT_SUBJECTS with no conditions, we are currently using the following
> code:
> 
> | SELECT id FROM event_view
> | NATURAL JOIN (
> |    SELECT subj_id, MAX(timestamp) AS timestamp
> |    FROM event_view GROUP BY subj_id)
> | GROUP BY subj_id
> | ORDER BY timestamp DESC LIMIT 10;
> 
> This takes ~0.6 seconds with 300k events and ~0.06 seconds with 30k events.
> 
> A more straightforward query like the following only takes half as much time
> (in both cases):
> 
> | SELECT id
> | FROM event_view
> | GROUP BY subj_id
> | ORDER BY MAX(timestamp) DESC LIMIT 10;
> 
> However, this doesn't work since «each non-aggregate expression ... is
> evaluated once for an arbitrarily selected row» [0], so "id" may be that of any
> event, not necessarily the most recent one. We can't use "MAX(id)" either,
> since id and timestamp are unrelated.
> 
> SQLite supports user-defined aggregate functions, so maybe we could do
> something like "NEWEST_EVENT(id, timestamp)".
> 
> [0] https://www.sqlite.org/lang_select.html#resultset

OK I have a fix for the first example you posted. The patch is attached and basically what it does is check if the WHERE clause is empty or not. Works like charm.
Comment 3 Seif Lotfy 2012-05-12 19:28:38 UTC
Created attachment 61536 [details] [review]
Improve querying time for queries without templates
Comment 4 Seif Lotfy 2012-05-13 13:59:02 UTC
Created attachment 61583 [details] [review]
Remove usage of IN and replace with AND/OR

So I tested this patch on the the following test http://pastebin.com/b6bCMj94
I ended up saving around 0.03 - 0.05 seconds for these queries. After a bit of investigation it turns out that the IN statement adds a bit to the query plan.

In this case trunk gives me a query explanation of 34 lines while my patch reduces this explanation to 7 lines.
Comment 5 Trever Fischer 2012-05-14 13:42:39 UTC
Created attachment 61650 [details]
Query speed comparison

(In reply to comment #4)
> Created attachment 61583 [details] [review] [review]
> Remove usage of IN and replace with AND/OR
> 
> So I tested this patch on the the following test http://pastebin.com/b6bCMj94
> I ended up saving around 0.03 - 0.05 seconds for these queries. After a bit of
> investigation it turns out that the IN statement adds a bit to the query plan.
> 
> In this case trunk gives me a query explanation of 34 lines while my patch
> reduces this explanation to 7 lines.

The attached graph includes a comparison against your comment in Bug 49023. It shows longer query times in all instances when compared to the modified indexes.
Comment 6 Trever Fischer 2012-05-14 15:36:39 UTC
Created attachment 61659 [details]
Query speed comparison

(In reply to comment #5)
> Created attachment 61650 [details]
> Query speed comparison
> 
> (In reply to comment #4)
> > Created attachment 61583 [details] [review] [review] [review]
> > Remove usage of IN and replace with AND/OR
> > 
> > So I tested this patch on the the following test http://pastebin.com/b6bCMj94
> > I ended up saving around 0.03 - 0.05 seconds for these queries. After a bit of
> > investigation it turns out that the IN statement adds a bit to the query plan.
> > 
> > In this case trunk gives me a query explanation of 34 lines while my patch
> > reduces this explanation to 7 lines.
> 
> The attached graph includes a comparison against your comment in Bug 49023. It
> shows longer query times in all instances when compared to the modified
> indexes.

And *this* is the correct graph. In the last graph I rebuilt zeitgeist but didn't restart the engine. Your patch increases speed everywhere, but only slightly. Still, its an improvement.
Comment 7 Seif Lotfy 2012-06-18 11:13:12 UTC
Created attachment 63184 [details] [review]
Added simple check if it is possible to use event table instead of event_view

Using event table instead of event_view when possible improves speed of querying.
I added a check if it is possible to use event table instead of event_view before query execution.
Comment 8 Seif Lotfy 2013-01-14 11:53:36 UTC
This is fixed in master