Bug 30688

Summary: file rendered too slow (and searching extremely slow)
Product: poppler Reporter: Pablo Rodríguez <freedesktop>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Pablo Rodríguez 2010-10-07 13:06:26 UTC
Hi there,

paging through the http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:083:0201:0328:ES:PDF (pressing either PgUp or PgDown), rendering the page when you stop takes almost a second.

I think this is too slow comparing the same operation paging through http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/javascript/AcroJS.pdf.

Searching speed is also too slow. Searching for "Übersetzungseinheit" in the first PDF document, so that it has to search the whole document, takes 38 secs both with poppler-0.12.4 and poppler-0.14.3. Searching for the same term in the second PDF document takes 

Searching for "Übersetzungseinheit" http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/javascript/AcroJS.pdf is about 13 secs and the document itself is bigger.

I don't know whether the searching speed is related to bug 28053.

Just in case it might help,


Pablo
Comment 1 srousseau 2016-01-11 11:02:43 UTC
I am hitting the same issue.
Searching is extremely slow. 

With pdftotext 0.12.4, searching a string through hundreds of PDF (2249) took 2m14.163s
With pdftotext 0.35, searching on the exact same set of PDF takes more than 2 hours (i will update the exact time when it is over)
Comment 2 srousseau 2016-01-11 11:10:54 UTC
I tried assigning the bug to the main poppler contributor.
Not sure this is the right thing to do ..
Comment 3 Albert Astals Cid 2016-01-11 11:12:18 UTC
It's not
Comment 4 srousseau 2016-01-11 16:05:52 UTC
Sorry Albert.

Tried pdftotext in its latest version (0.39), but we are still hitting the issue.
I can provide a sample PDF if necessary, all our PDFs are extremely slow (between 2 and 3 minutes to get "textified")
Comment 5 Albert Astals Cid 2016-01-11 21:23:07 UTC
We have lots of bugs, so what would really help is if you could provide a patch.
Comment 6 srousseau 2016-01-15 12:29:33 UTC
OK I will try to look into it.
Not really my area, but I can take some time.

Just for the record.

With version 0.12.4 :
time find . -name '*10.12.2015*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "Archivage documentaire"' \;

real    2m14.163s
user    0m55.425s
sys     0m11.211s


With version 0.39.0 :
time find . -name '*10.12.2015*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "Archivage documentaire"' \;

real    5108m21.180s
user    4237m23.460s
sys     54m20.175s


Test realized on the exact same set of PDF files.
Comment 7 Jason Crain 2016-01-15 13:24:00 UTC
srousseau: It's hard to tell because the link in the original report no longer works, but it sounds like you are talking about a different issue.  It was about searching speed and you are asking about pdftotext runtime.  Your issue sounds more like one of the bugs about depthFirstSearch being slow.  You might try Brian's last patch in bug 3188 to see if it fixes it. (if it still applies?)
Comment 8 srousseau 2016-01-21 12:11:30 UTC
We found the root cause of our problem.
We had issues with some corrupted PDF documents created by a third part product.
The pdftotext slowness only occurred with those corrupted PDF. Works well with others

The learnt lesson is that old pdftotext version do not seem to care about this kind of corruption
Comment 9 Albert Astals Cid 2016-12-02 11:02:21 UTC
I'm going to have to close these bug since the .pdf documents linked don't exist anymore.

Sorry about that.

Next time attach the documents or make sure the links you provide won't die.
Comment 10 Pablo Rodríguez 2016-12-02 15:44:08 UTC
Here you have a similar document http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:C:2016:202:FULL.

poppler-0.34 needed 72s to search the whole document.

Is this PDF document OK?
Comment 11 Albert Astals Cid 2016-12-02 22:44:04 UTC
I don't know what you mean with "poppler-0.34 needed 72s to search the whole document." since poppler doesn't really have a search tool you can be using, but ok, i tried

okular -> searches "caramelo" in around 46s
evince -> searches "caramelo" in around 46s
pdftotext -> converts to text in around 20s
Adobe Reader -> searches "caramelo" in around 12s

So there's some improvement that could be made.

If i were you i would attach the document or save it in a place it won't get lost, otherwise you risk this bug being closed again.

And you may want to edit the subject to talk about search speed and not rendering speed if that's what you're speaking about
Comment 12 Pablo Rodríguez 2016-12-03 11:36:04 UTC
(In reply to Albert Astals Cid from comment #11)
> I don't know what you mean with "poppler-0.34 needed 72s to search the whole
> document." since poppler doesn't really have a search tool you can be using,
> but ok, i tried
> 
> okular -> searches "caramelo" in around 46s
> evince -> searches "caramelo" in around 46s
> pdftotext -> converts to text in around 20s
> Adobe Reader -> searches "caramelo" in around 12s
mupdf searches "caramelo" in less than 2s in my >10yo laptop.

> If i were you i would attach the document or save it in a place it won't get
> lost, otherwise you risk this bug being closed again.
I’d rather avoid attaching the file and providing a new link if required.

> And you may want to edit the subject to talk about search speed and not
> rendering speed if that's what you're speaking about
I’m speaking of both: rendering and searching speed. (There may even be a common root for their slowness.)
Comment 13 GitLab Migration User 2018-08-20 21:32:45 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/1.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.