poppler-glib (and might be others) is too slow when searching http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:083:FULL:ES:PDF. Could you check this? Thanks, Pablo
Any chance you can give some more detail on what are you doing?
(In reply to comment #1) > Any chance you can give some more detail on what are you doing? To check searching speeds I try to search "Unternehmen" in http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:083:FULL:ES:PDF and it takes 116 secs to show that the word hasn't been found. Searching "Unternehmen" takes 25 secs not to find the word in http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf. There might be something wrong here, since the file size and text amount is higher on the second document. BTW, I used a German word (a word that won't be found) to make evince search the whole document. Pablo
I took a look at that file. It has an unusually large number of objects, somewhere on the order of 8000 objects for a file with only 300+ pages. I suspect the large number of objects is what slows down poppler.
Yes, and I think most of the time is spent parsing ICC color space objects.
And why do we parse ICC color spaces when extracting text?
(In reply to comment #5) > And why do we parse ICC color spaces when extracting text? good question, I guess we need to add if (!out->needNonText()) return; in more places in Gfx.cc
I have two PDF documents containing the same text (a book from different Lambda/XeLaTeX sources) that also show a huge different when searching for non-existing text as in the documents above: 22 vs 3 secs. I don't think that ICC color space objects are the problem in the documents I refer to. Is there a way that I can provide the test you need to check this bug? Sorry, but I would rather avoid submitting those files. Thanks for your help, Pablo
I had a look at this recently, because I am heavily affected by this bug, since a search a lot in big PDF documents (1000+ pages). It takes 2-3 minutes. As you probably know, the reason for this, is that poppler needs to render the page (at least the text portion) in order to extract the text of every page. After the search, the rendered page is discarded in order to conserve memory (I guess). I wrote a little proof-of-concept enhancement, which reduces the search duration to less than 5 seconds in a 1000 page document for consecutive (!) searches. Please mind that this is really just a proof of concept! I used pretty dirty programming techniques, because (a) I have hardly any time and (b) I am not a good C programmer. But all I want is to propose an idea. Here is how it works: I introduced a static variable in glib/poppler-page.cc: poppler_page_find_text() named text_cache. The contents of this variable are never deleted, since it is static. (Again, there are probably more elegant methods to achieve this, but this is only a demo.) The program flow is as follows: 1. evince calls poppler_page_find_text() to search for text in a page given as an argument. 2. poppler_page_find_text() now checks, if the page to be searched is already in the cache (i.e., the variable text_cache). 2.a. The first time that the given page is searched, this will not be case, because the page has never been rendered before. Therefore, the program flow is as usual: 2.a.1. The page is rendered to text_dev. 2.a.2. Then, text_dev->findText() is called to search for the text. 2.a.3. I made a little addition at this point: while the rendered page is usually discarded after this operation, I write the plain text to text_cache by calling text_dev->cacheText() - a function I added. Only after this, the rendered page is discarded. Please note that text_cache only contains the plain text, which is a lot smaller in memory than the whole rendition of the page (a few hundred bytes vs. several hundred kB per page). 2.b. The next time, the user performs a search on this very page, the plain text of the page will be found in the text_cache. poppler_page_find_text() therefore calls a function added by me named poppler_page_scan_text_cache(), which searches the text_cache for the given keyword. poppler_page_scan_text_cache() only tells, IF the the keyword is contained in the page. It cannot tell WHERE it is located. However, this result is returned very fast, because the page does not need to be rendered. 2.b.a. If the text was not found on the page, then poppler_page_find_text() aborts immediately - there is no need to render the page, because it can be assumed that the text is not contained in the page. 2.b.b. If the text was found in the cached text page, then the page is rendered as usual and text_dev->findText is called to determine the exact locations of the text on the page. The big performance improvement is made in 2.b.a. poppler saves the effort of rendering a whole lot of pages, because before rendering, it checks if the text is contained in the page at all (which is very fast). Pages that do not contain the text are skipped. Of course, the first time that the document is opened, the search will be slow as usual, because the text cache is empty. But every consecutive search is way faster, because only pages containing the search string are rendered. I believe that acroread even goes as far as saving the text_cache to disk so that it can be restored in the future without the need to render the whole document again. I did not implement this, though. To sum it up: Pros: - Dramatically improved search speed for consecutive searches (obviously). Cons: - Minimal memory overhead to cache text of document (negligible, IMO). - The first search is still slow (could be improved by saving text_cache to disk and restoring it, next time the document is opened). - If the search string is found on all pages, the search is still slow, because the text_cache produces only hits and all pages are rendered. This is a different story, though. Ideally evince should not render any but the currently displayed page. It would be sufficient, if poppler returned only the page numbers that contain the text in question and not the exact location of all occurrences. Only when the user goes to a page with a hit, the page should be rendered. But then again, why would I search a document for a string, which is found on every page anyway? I will attach my adapted source. The source is for poppler 0.12.4. It is old - I know - but this is current on my distro (Ubuntu 10.04). The patch can be easily adapted to the most recent version of poppler, though. Files: - poppler/TextOutputDev.h: added "struct TextCachePage" and "TextOutputDev::cacheText()" - poppler/TextOutputDev.cc: added "TextOutputDev::cacheText()" - glib/poppler-page.cc: added "poppler_page_scan_text_cache()" and modified "poppler_page_find_text()" P.S. I can totally understand, if the poppler developers say, this is crap. I just want to present an idea, which others might benefit from, too. If you don't like it, feel free to toss it. I am using it happily. Also, sorry for the lengthy post.
Created attachment 45937 [details] [review] glib/poppler-page.cc added "poppler_page_scan_text_cache()" and modified "poppler_page_find_text()"
Created attachment 45938 [details] [review] poppler/TextOutputDev.h added "struct TextCachePage" and "TextOutputDev::cacheText()"
Created attachment 45939 [details] [review] poppler/TextOutputDev.cc added "TextOutputDev::cacheText()"
Any news on this issue?
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/462.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.