Created attachment 96955 [details] Example file In order to monitor our experiments we usually create PDF files using python matplotlib. Files open very fast on Debian Squeeze with evince 2.30.3-2+squeeze1 and poppler 0.12.4-1.2. We have recently started to upgrade our machines from squeeze to wheezy but we had to stop because the PDF files can no longer be opened, evince consumes a lot of CPU and memory. We have also tried with okular but the problem is still there which makes us think that might be a poppler issue. Evince version on Debian Wheezy is 3.4.0-3.1 and poppler version is 0.18.4-6. I have just tested with evince 3.12.0-1 and poppler 0.24.5-2 and it is really slow also. I attach an example file.
Created attachment 96961 [details] profile of pdftoppm -png i04d_20140404_1836.pdf /tmp/x For an additional data point, I have Fedora 20 with the mate desktop, and "atril" (the mate version of evince) is also very slow on this file while gs 9.14 opens it almost instantly. Fedora 20 has poppler 0.24.3. pdftoppm is also slow, which might make this easier to debug because pdftoppm is easier to build than evince. pdftops processes the file in under 1 second. It might be related to rasterizing a lot of text in small fonts because evince and atril have to do that while pdftops can just pass the fonts through. pdftoppm and pdftocairo both fail with -jpeg but pdftoppm -png works although it is slow. $ time /usr/bin/pdftoppm -png i04d_20140404_1836.pdf /tmp/x.ps real 1m30.491s user 1m30.235s sys 0m0.202s Here are the first few lines of a profile of the pdftoppm command above. I attached the full profile which I made from a git clone of the poppler from Feb 19, 2014. % cumulative self self total time seconds seconds calls s/call s/call name 62.62 16.99 16.99 749102576 0.00 0.00 Splash::pipeRunAARGB8(SplashPipe*) 22.71 23.15 6.16 31880 0.00 0.00 Splash::fillWithPattern(SplashPath*, bool, SplashPattern*, double) 2.65 23.87 0.72 1708951 0.00 0.00 SplashXPathScanner::renderAALine(SplashBitmap*, int*, int*, int, bool) I suspect that the pattern fill is part of the font rasterization. Adding "-aaVector no -aa no" reduces the time a little. $ time /usr/bin/pdftoppm -aaVector no -aa no -png i04d_20140404_1836.pdf /tmp/x real 0m57.463s user 0m57.164s sys 0m0.247s
There's something ultra weird, here qt4/tests/test-poppler-qt4 takes "only" 12 seconds while pdftoppm takes 50 seconds. As far as i can remember they should be using the exact same codepaths.
And pdftotext also takes more than a minute ^_^
Tracked the regression to this commit: 9c5612f6e013a8698eff6531ec388a7e6c1fb89a is the first bad commit commit 9c5612f6e013a8698eff6531ec388a7e6c1fb89a Author: Marek Kasik <mkasik@redhat.com> Date: Fri Feb 12 14:31:01 2010 +0100 Distinguish between columns and tables when selecting text This commit add ability to detect tables in text by checking borders of 4 neighbouring text blocks for arrangement (to the left, to the right, center, ...). Detected border of whole table is then stored in ExMin, ExMax, EyMin and EyMax of each block together with id of detected table. Sorting of blocks is then performed on the these borders to be able to distinguish tables from columns. Pasting of selected text was modified so that tables are pasted correctly (even with multi line cells). :040000 040000 e58e22d3707422029f1ca753868164eb22cf8bb4 46b3ae4c8d7cee01fc6e69a4038140ecab7ce361 M poppler It's slowed in a recursive function, TextBlock::visitDepthFirst. The algorithm in there runs at O(n^3) in the worst case, and the tables patch makes the worst case more likely. I think the pdftoppm/pdftocairo issue mentioned above is a separate problem, possibly caused by the large size of the document (819 inches x 30 inches).
Marek?
(In reply to comment #5) > Marek? Hi, it seems that the last 2 patches in #3188 improves the speed of "pdftotext i04d_20140404_1836.pdf" a lot. They took it from 131.5 seconds to 2.2 seconds on my computer. Brian's patch itself improved it from the 131.5 seconds to 3.2 seconds. I've updated my patch, the Brian's one applies well still. Regards Marek
*** Bug 80573 has been marked as a duplicate of this bug. ***
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/438.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.