Bug 28053 - poppler is too slow when searching this file
Summary: poppler is too slow when searching this file
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: glib frontend (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-10 11:10 UTC by Pablo Rodríguez
Modified: 2018-08-21 10:59 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
glib/poppler-page.cc (52.02 KB, patch)
2011-04-22 03:09 UTC, Sebastian Kums
Details | Splinter Review
poppler/TextOutputDev.h (23.88 KB, patch)
2011-04-22 03:11 UTC, Sebastian Kums
Details | Splinter Review
poppler/TextOutputDev.cc (123.05 KB, patch)
2011-04-22 03:12 UTC, Sebastian Kums
Details | Splinter Review

Description Pablo Rodríguez 2010-05-10 11:10:50 UTC
poppler-glib (and might be others) is too slow when searching http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:083:FULL:ES:PDF.

Could you check this?

Thanks,


Pablo
Comment 1 Albert Astals Cid 2010-05-10 11:25:15 UTC
Any chance you can give some more detail on what are you doing?
Comment 2 Pablo Rodríguez 2010-05-10 14:21:46 UTC
(In reply to comment #1)
> Any chance you can give some more detail on what are you doing?

To check searching speeds I try to search "Unternehmen" in http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:083:FULL:ES:PDF and it takes 116 secs to show that the word hasn't been found.

Searching "Unternehmen" takes 25 secs not to find the word in http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf.

There might be something wrong here, since the file size and text amount is higher on the second document.

BTW, I used a German word (a word that won't be found) to make evince search the whole document.


Pablo
Comment 3 James Cloos 2010-05-11 07:32:17 UTC
I took a look at that file.

It has an unusually large number of objects, somewhere on the order of
8000 objects for a file with only 300+ pages.

I suspect the large number of objects is what slows down poppler.
Comment 4 Carlos Garcia Campos 2010-05-11 07:36:13 UTC
Yes, and I think most of the time is spent parsing ICC color space objects.
Comment 5 Albert Astals Cid 2010-05-11 11:45:03 UTC
And why do we parse ICC color spaces when extracting text?
Comment 6 Carlos Garcia Campos 2010-05-11 12:08:44 UTC
(In reply to comment #5)
> And why do we parse ICC color spaces when extracting text?

good question, I guess we need to add if (!out->needNonText()) return; in more places in Gfx.cc
Comment 7 Pablo Rodríguez 2010-05-14 09:18:19 UTC
I have two PDF documents containing the same text (a book from different Lambda/XeLaTeX sources) that also show a huge different when searching for non-existing text as in the documents above: 22 vs 3 secs.

I don't think that ICC color space objects are the problem in the documents I refer to.

Is there a way that I can provide the test you need to check this bug? Sorry, but I would rather avoid submitting those files.

Thanks for your help,


Pablo
Comment 8 Sebastian Kums 2011-04-22 03:07:28 UTC
I had a look at this recently, because I am heavily affected by this bug, since a search a lot in big PDF documents (1000+ pages). It takes 2-3 minutes. As you probably know, the reason for this, is that poppler needs to render the page (at least the text portion) in order to extract the text of every page. After the search, the rendered page is discarded in order to conserve memory (I guess). I wrote a little proof-of-concept enhancement, which reduces the search duration to less than 5 seconds in a 1000 page document for consecutive (!) searches. Please mind that this is really just a proof of concept! I used pretty dirty programming techniques, because (a) I have hardly any time and (b) I am not a good C programmer. But all I want is to propose an idea. Here is how it works:

I introduced a static variable in glib/poppler-page.cc: poppler_page_find_text() named text_cache. The contents of this variable are never deleted, since it is static. (Again, there are probably more elegant methods to achieve this, but this is only a demo.) The program flow is as follows:

1. evince calls poppler_page_find_text() to search for text in a page
   given as an argument.
2. poppler_page_find_text() now checks, if the page to be searched is
   already in the cache (i.e., the variable text_cache).
2.a. The first time that the given page is searched, this will not be
     case, because the page has never been rendered before. Therefore,
     the program flow is as usual:
2.a.1. The page is rendered to text_dev.
2.a.2. Then, text_dev->findText() is called to search for the text.
2.a.3. I made a little addition at this point: while the rendered page is 
       usually discarded after this operation, I write the plain text to
       text_cache by calling text_dev->cacheText() - a function I added.
       Only after this, the rendered page is discarded. Please
       note that text_cache only contains the plain text, which is a lot
       smaller in memory than the whole rendition of the page (a few hundred 
       bytes vs. several hundred kB per page).
2.b. The next time, the user performs a search on this very page, the plain
     text of the page will be found in the text_cache.
     poppler_page_find_text() therefore calls a function added by me named 
     poppler_page_scan_text_cache(), which searches the text_cache for
     the given keyword. poppler_page_scan_text_cache() only tells, IF the
     the keyword is contained in the page. It cannot tell WHERE it is
     located. However, this result is returned very fast, because the
     page does not need to be rendered.
2.b.a. If the text was not found on the page, then poppler_page_find_text()
       aborts immediately - there is no need to render the page, because
       it can be assumed that the text is not contained in the page.
2.b.b. If the text was found in the cached text page, then the page is
       rendered as usual and text_dev->findText is called to determine
       the exact locations of the text on the page.

The big performance improvement is made in 2.b.a. poppler saves the effort of rendering a whole lot of pages, because before rendering, it checks if the text is contained in the page at all (which is very fast). Pages that do not contain the text are skipped.

Of course, the first time that the document is opened, the search will be slow as usual, because the text cache is empty. But every consecutive search is way faster, because only pages containing the search string are rendered. I believe that acroread even goes as far as saving the text_cache to disk so that it can be restored in the future without the need to render the whole document again. I did not implement this, though.

To sum it up:
Pros:
- Dramatically improved search speed for consecutive searches (obviously).
Cons:
- Minimal memory overhead to cache text of document (negligible, IMO).
- The first search is still slow (could be improved by saving text_cache to disk and restoring it, next time the document is opened).
- If the search string is found on all pages, the search is still slow, because the text_cache produces only hits and all pages are rendered. This is a different story, though. Ideally evince should not render any but the currently displayed page. It would be sufficient, if poppler returned only the page numbers that contain the text in question and not the exact location of all occurrences. Only when the user goes to a page with a hit, the page should be rendered. But then again, why would I search a document for a string, which is found on every page anyway?

I will attach my adapted source. The source is for poppler 0.12.4. It is old - I know - but this is current on my distro (Ubuntu 10.04). The patch can be easily adapted to the most recent version of poppler, though.

Files:
- poppler/TextOutputDev.h: added "struct TextCachePage" and "TextOutputDev::cacheText()"
- poppler/TextOutputDev.cc: added "TextOutputDev::cacheText()"
- glib/poppler-page.cc: added "poppler_page_scan_text_cache()" and modified "poppler_page_find_text()"

P.S. I can totally understand, if the poppler developers say, this is crap. I just want to present an idea, which others might benefit from, too. If you don't like it, feel free to toss it. I am using it happily.
Also, sorry for the lengthy post.
Comment 9 Sebastian Kums 2011-04-22 03:09:55 UTC
Created attachment 45937 [details] [review]
glib/poppler-page.cc

added "poppler_page_scan_text_cache()" and modified "poppler_page_find_text()"
Comment 10 Sebastian Kums 2011-04-22 03:11:51 UTC
Created attachment 45938 [details] [review]
poppler/TextOutputDev.h

added "struct TextCachePage" and "TextOutputDev::cacheText()"
Comment 11 Sebastian Kums 2011-04-22 03:12:42 UTC
Created attachment 45939 [details] [review]
poppler/TextOutputDev.cc

added "TextOutputDev::cacheText()"
Comment 12 Pablo Rodríguez 2018-05-26 15:31:32 UTC
Any news on this issue?
Comment 13 GitLab Migration User 2018-08-21 10:59:28 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/462.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.