Summary: | Optimize FindEventIds query speed | ||
---|---|---|---|
Product: | Zeitgeist | Reporter: | Siegfried-Angel Gevatter Pujals <siggi.gevatter> |
Component: | Engine | Assignee: | Seif Lotfy <seif> |
Status: | RESOLVED FIXED | QA Contact: | zeitgeist-bugs <zeitgeist-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | seif |
Version: | 0.9.x | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
script to test the original query
Improve querying time for queries without templates Remove usage of IN and replace with AND/OR Query speed comparison Query speed comparison Added simple check if it is possible to use event table instead of event_view |
Description
Siegfried-Angel Gevatter Pujals
2012-04-17 10:54:19 UTC
I've been looking into some simpler query first. Particularly, to search for MOST_RECENT_SUBJECTS with no conditions, we are currently using the following code: | SELECT id FROM event_view | NATURAL JOIN ( | SELECT subj_id, MAX(timestamp) AS timestamp | FROM event_view GROUP BY subj_id) | GROUP BY subj_id | ORDER BY timestamp DESC LIMIT 10; This takes ~0.6 seconds with 300k events and ~0.06 seconds with 30k events. A more straightforward query like the following only takes half as much time (in both cases): | SELECT id | FROM event_view | GROUP BY subj_id | ORDER BY MAX(timestamp) DESC LIMIT 10; However, this doesn't work since «each non-aggregate expression ... is evaluated once for an arbitrarily selected row» [0], so "id" may be that of any event, not necessarily the most recent one. We can't use "MAX(id)" either, since id and timestamp are unrelated. SQLite supports user-defined aggregate functions, so maybe we could do something like "NEWEST_EVENT(id, timestamp)". [0] https://www.sqlite.org/lang_select.html#resultset (In reply to comment #1) > I've been looking into some simpler query first. Particularly, to search for > MOST_RECENT_SUBJECTS with no conditions, we are currently using the following > code: > > | SELECT id FROM event_view > | NATURAL JOIN ( > | SELECT subj_id, MAX(timestamp) AS timestamp > | FROM event_view GROUP BY subj_id) > | GROUP BY subj_id > | ORDER BY timestamp DESC LIMIT 10; > > This takes ~0.6 seconds with 300k events and ~0.06 seconds with 30k events. > > A more straightforward query like the following only takes half as much time > (in both cases): > > | SELECT id > | FROM event_view > | GROUP BY subj_id > | ORDER BY MAX(timestamp) DESC LIMIT 10; > > However, this doesn't work since «each non-aggregate expression ... is > evaluated once for an arbitrarily selected row» [0], so "id" may be that of any > event, not necessarily the most recent one. We can't use "MAX(id)" either, > since id and timestamp are unrelated. > > SQLite supports user-defined aggregate functions, so maybe we could do > something like "NEWEST_EVENT(id, timestamp)". > > [0] https://www.sqlite.org/lang_select.html#resultset OK I have a fix for the first example you posted. The patch is attached and basically what it does is check if the WHERE clause is empty or not. Works like charm. Created attachment 61536 [details] [review] Improve querying time for queries without templates Created attachment 61583 [details] [review] Remove usage of IN and replace with AND/OR So I tested this patch on the the following test http://pastebin.com/b6bCMj94 I ended up saving around 0.03 - 0.05 seconds for these queries. After a bit of investigation it turns out that the IN statement adds a bit to the query plan. In this case trunk gives me a query explanation of 34 lines while my patch reduces this explanation to 7 lines. Created attachment 61650 [details] Query speed comparison (In reply to comment #4) > Created attachment 61583 [details] [review] [review] > Remove usage of IN and replace with AND/OR > > So I tested this patch on the the following test http://pastebin.com/b6bCMj94 > I ended up saving around 0.03 - 0.05 seconds for these queries. After a bit of > investigation it turns out that the IN statement adds a bit to the query plan. > > In this case trunk gives me a query explanation of 34 lines while my patch > reduces this explanation to 7 lines. The attached graph includes a comparison against your comment in Bug 49023. It shows longer query times in all instances when compared to the modified indexes. Created attachment 61659 [details] Query speed comparison (In reply to comment #5) > Created attachment 61650 [details] > Query speed comparison > > (In reply to comment #4) > > Created attachment 61583 [details] [review] [review] [review] > > Remove usage of IN and replace with AND/OR > > > > So I tested this patch on the the following test http://pastebin.com/b6bCMj94 > > I ended up saving around 0.03 - 0.05 seconds for these queries. After a bit of > > investigation it turns out that the IN statement adds a bit to the query plan. > > > > In this case trunk gives me a query explanation of 34 lines while my patch > > reduces this explanation to 7 lines. > > The attached graph includes a comparison against your comment in Bug 49023. It > shows longer query times in all instances when compared to the modified > indexes. And *this* is the correct graph. In the last graph I rebuilt zeitgeist but didn't restart the engine. Your patch increases speed everywhere, but only slightly. Still, its an improvement. Created attachment 63184 [details] [review] Added simple check if it is possible to use event table instead of event_view Using event table instead of event_view when possible improves speed of querying. I added a check if it is possible to use event table instead of event_view before query execution. This is fixed in master |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.