Bug 8821 - Performance Improvement for TextPage:findText()
Summary: Performance Improvement for TextPage:findText()
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-10-29 16:15 UTC by David Benjamin
Modified: 2018-08-20 22:09 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
patch for TextOutputDev.cc (from 0.5.4 release) (1.13 KB, patch)
2006-10-29 16:16 UTC, David Benjamin
Details | Splinter Review
Better diff format (1.99 KB, patch)
2006-11-02 16:56 UTC, David Benjamin
Details | Splinter Review
Implementation of KMP (Knuth-Morris-Pratt search algorithm) (8.42 KB, patch)
2010-06-13 23:45 UTC, Johannes Buchner
Details | Splinter Review
starting from HEAD, only reformats whitespaces (8.66 KB, patch)
2010-06-15 08:51 UTC, Johannes Buchner
Details | Splinter Review
Implementation of KMP (Knuth-Morris-Pratt search algorithm) (7.92 KB, patch)
2010-06-15 08:52 UTC, Johannes Buchner
Details | Splinter Review

Description David Benjamin 2006-10-29 16:15:47 UTC
(Sorry if I did the patch file wrong or anything like that... this is my first
(and hopefully not the last :-D) contribution to a FOSS project other than a
typo correction, so... yeah. :-) )

Anyways, so I made some minor improvements to TextPage:findText in
poppler/TextOutputDev.cc.

Hopefully I didn't make any mis-optimizations out of misunderstanding bits of
the codebase. :-)

Changes:
- If a block is past what we've already found, then break.
- If a line is past the end of the searching position, break if searching forwards
- If a line is past what we've already found, break if searching forwards,
otherwise just continue
- When searching within a line, instead of searching according to backward,
search via searchBackwards which takes line rotation into account, thus allowing
us to break once the first match in a line is found.

I've also got an implementation of KMP (
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm ) with
caching of failure functions since most programs do find-as-you-type, but I need
to make sure the (minor) extra complexity incurred doesn't outweigh the O()
improvement. Probably will attach that in a couple days.


I hope it's of some use. (Changes made against the 0.5.4 release.)


Hmm... I take it the attach file thing is after commiting the report? (I'm not
that familiar with Bugzilla yet.)
Comment 1 David Benjamin 2006-10-29 16:16:37 UTC
Created attachment 7577 [details] [review]
patch for TextOutputDev.cc (from 0.5.4 release)
Comment 2 David Benjamin 2006-10-30 09:59:21 UTC
Oops. enhancement means "Request for enhancement" which isn't quite right.
Sorry. :-) I'll just set them to the default settings I guess.
Comment 3 Albert Astals Cid 2006-10-30 10:39:38 UTC
Thanks for the patch David. I'm going to ask you two things:
 * please repost the patch with diff -u, that makes patches more readable, at 
least imho.
 * Send a mail to poppler mailing list asking if someone can review it (give a 
link to here) as my knowledge of that part of the code is not as big as i 
would want.
Comment 4 David Benjamin 2006-11-02 16:56:36 UTC
Created attachment 7638 [details] [review]
Better diff format

Okay, redid with diff -u... will post to mailing list as soon as I figure out
how that is done. ;-)
Comment 5 Albert Astals Cid 2006-11-06 10:54:18 UTC
Go to http://poppler.freedesktop.org and take the poppler mailing list link
Comment 6 Albert Astals Cid 2010-02-09 14:59:22 UTC
Does this still apply?
Comment 7 David Benjamin 2010-02-09 17:24:34 UTC
Oh, hey. This thing.

I would imagine it doesn't, given that it's been 3-4 years. There might also been some problems with the patch, or perhaps it was the KMP I never posted? I don't remember.

I'll take a look at it sometime and see what can be done, or if even it's worth touching. I don't think this younger me ever checked things in a profiler.
Comment 8 Johannes Buchner 2010-06-13 15:44:23 UTC
I'd also be interested in a KMP version. Maybe I'll give it a shot. 
David, could you dig up your KMP patch and post it (no matter how bad it is)?
Comment 9 Johannes Buchner 2010-06-13 23:36:15 UTC
Review of attachment 7638 [details] [review]:

This patch is about making "backwards" mean backwards when the page is upside down (rotation=2). I'm uncertain if the semantics have been discussed enough.
Since it is not about performance, it should be in another bug.
Comment 10 Johannes Buchner 2010-06-13 23:45:34 UTC
Created attachment 36255 [details] [review]
Implementation of KMP (Knuth-Morris-Pratt search algorithm)

I implemented the Knuth-Morris-Pratt algorithm.

As far as I see it works correctly. I tested with the glib-demo tool. But I didn't test backward search, although I implemented it (simply by reversing array access indices).

I was able to see about twice the speed in some corner cases, but usually it has no gain. However, in no case I saw a degradation.

Please review and test!

It should also be considered that we rebuild the search string (Unicode conversion) and the next-array for KMP on every call of TextPage::findText. Since it is normally called through all pages, one could save some calculation there.

I also looked at Boyer-Moore, but it seems stupid to make a lookup table that covers the whole Unicode space, and implementing it over a list or hashmap pretty much takes away the benefit of BM.
Comment 11 Johannes Buchner 2010-06-14 00:06:29 UTC
(In reply to comment #10)
> It should also be considered that we rebuild the search string (Unicode
> conversion) and the next-array for KMP on every call of TextPage::findText.
> Since it is normally called through all pages, one could save some calculation
> there.
I just realized that there is no easy fix for this since findText is a operation based on pages, not documents. All in all, the dealing with blocks and lines is quite clumsy, but I guess there is no other way.
Comment 12 Johannes Buchner 2010-06-15 08:51:10 UTC
Created attachment 36291 [details] [review]
starting from HEAD, only reformats whitespaces

whitespaces fixes first in one patch
Comment 13 Johannes Buchner 2010-06-15 08:52:08 UTC
Created attachment 36292 [details] [review]
Implementation of KMP (Knuth-Morris-Pratt search algorithm)

Patch without whitespace (apply other patch first)
Comment 14 Albert Astals Cid 2010-06-15 15:49:29 UTC
You didn't seemed really convinced in the mailing list this should go in, should i spend my time reviewing it or not?
Comment 15 Johannes Buchner 2010-06-15 16:54:17 UTC
(In reply to comment #14)
> You didn't seemed really convinced in the mailing list this should go in,
> should i spend my time reviewing it or not?

Please do.
Comment 16 Carlos Garcia Campos 2010-06-17 01:57:26 UTC
Do I have to apply both patches? There seems to be already unrelated changes in the combination of both patches. Could you provide a single patch with only the kmp implementation, please?
Comment 17 Johannes Buchner 2010-06-17 02:08:39 UTC
You can either apply the one from 2010-06-13 23:45 PDT, or the two later ones. They are the same, but I was asked to split them since I cleaned up the whitespace mess (which is all that the first patch does).
The patches from David are unrelated.
Comment 18 Carlos Garcia Campos 2010-06-29 00:36:44 UTC
Hi Johannes, sorry for the delay. There seems to be still urelated changes in your second patch. Could you please send a single patch with only changes required to implement KMP search? That code is inherited from xpdf code, we avoid to fix indentation or whitespaces issues because it would make future xpdf merges harder for us.
Comment 19 Carlos Garcia Campos 2010-06-29 02:14:07 UTC
Review of attachment 36292 [details] [review]:

::: poppler/TextOutputDev.cc
@@ +3386,3 @@
+  j = 0;
+  kmpNext[0] = -1;
+  for (int i = 0; i < m; i++) {

var i was initialized to 2 before and you are redefining it here.
Comment 20 Albert Astals Cid 2011-06-19 14:56:36 UTC
So what's the status on this?
Comment 21 GitLab Migration User 2018-08-20 22:09:43 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/223.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.