Hi there, paging through the http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:083:0201:0328:ES:PDF (pressing either PgUp or PgDown), rendering the page when you stop takes almost a second. I think this is too slow comparing the same operation paging through http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/javascript/AcroJS.pdf. Searching speed is also too slow. Searching for "Übersetzungseinheit" in the first PDF document, so that it has to search the whole document, takes 38 secs both with poppler-0.12.4 and poppler-0.14.3. Searching for the same term in the second PDF document takes Searching for "Übersetzungseinheit" http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/javascript/AcroJS.pdf is about 13 secs and the document itself is bigger. I don't know whether the searching speed is related to bug 28053. Just in case it might help, Pablo
I am hitting the same issue. Searching is extremely slow. With pdftotext 0.12.4, searching a string through hundreds of PDF (2249) took 2m14.163s With pdftotext 0.35, searching on the exact same set of PDF takes more than 2 hours (i will update the exact time when it is over)
I tried assigning the bug to the main poppler contributor. Not sure this is the right thing to do ..
It's not
Sorry Albert. Tried pdftotext in its latest version (0.39), but we are still hitting the issue. I can provide a sample PDF if necessary, all our PDFs are extremely slow (between 2 and 3 minutes to get "textified")
We have lots of bugs, so what would really help is if you could provide a patch.
OK I will try to look into it. Not really my area, but I can take some time. Just for the record. With version 0.12.4 : time find . -name '*10.12.2015*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "Archivage documentaire"' \; real 2m14.163s user 0m55.425s sys 0m11.211s With version 0.39.0 : time find . -name '*10.12.2015*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "Archivage documentaire"' \; real 5108m21.180s user 4237m23.460s sys 54m20.175s Test realized on the exact same set of PDF files.
srousseau: It's hard to tell because the link in the original report no longer works, but it sounds like you are talking about a different issue. It was about searching speed and you are asking about pdftotext runtime. Your issue sounds more like one of the bugs about depthFirstSearch being slow. You might try Brian's last patch in bug 3188 to see if it fixes it. (if it still applies?)
We found the root cause of our problem. We had issues with some corrupted PDF documents created by a third part product. The pdftotext slowness only occurred with those corrupted PDF. Works well with others The learnt lesson is that old pdftotext version do not seem to care about this kind of corruption
I'm going to have to close these bug since the .pdf documents linked don't exist anymore. Sorry about that. Next time attach the documents or make sure the links you provide won't die.
Here you have a similar document http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:C:2016:202:FULL. poppler-0.34 needed 72s to search the whole document. Is this PDF document OK?
I don't know what you mean with "poppler-0.34 needed 72s to search the whole document." since poppler doesn't really have a search tool you can be using, but ok, i tried okular -> searches "caramelo" in around 46s evince -> searches "caramelo" in around 46s pdftotext -> converts to text in around 20s Adobe Reader -> searches "caramelo" in around 12s So there's some improvement that could be made. If i were you i would attach the document or save it in a place it won't get lost, otherwise you risk this bug being closed again. And you may want to edit the subject to talk about search speed and not rendering speed if that's what you're speaking about
(In reply to Albert Astals Cid from comment #11) > I don't know what you mean with "poppler-0.34 needed 72s to search the whole > document." since poppler doesn't really have a search tool you can be using, > but ok, i tried > > okular -> searches "caramelo" in around 46s > evince -> searches "caramelo" in around 46s > pdftotext -> converts to text in around 20s > Adobe Reader -> searches "caramelo" in around 12s mupdf searches "caramelo" in less than 2s in my >10yo laptop. > If i were you i would attach the document or save it in a place it won't get > lost, otherwise you risk this bug being closed again. I’d rather avoid attaching the file and providing a new link if required. > And you may want to edit the subject to talk about search speed and not > rendering speed if that's what you're speaking about I’m speaking of both: rendering and searching speed. (There may even be a common root for their slowness.)
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/1.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.