Created attachment 96955 [details]
In order to monitor our experiments we usually create PDF files using python matplotlib. Files open very fast on Debian Squeeze with evince 2.30.3-2+squeeze1 and poppler 0.12.4-1.2.
We have recently started to upgrade our machines from squeeze to wheezy but we had to stop because the PDF files can no longer be opened, evince consumes a lot of CPU and memory. We have also tried with okular but the problem is still there which makes us think that might be a poppler issue. Evince version on Debian Wheezy is 3.4.0-3.1 and poppler version is 0.18.4-6.
I have just tested with evince 3.12.0-1 and poppler 0.24.5-2 and it is really slow also.
I attach an example file.
Created attachment 96961 [details]
profile of pdftoppm -png i04d_20140404_1836.pdf /tmp/x
For an additional data point, I have Fedora 20 with the mate desktop, and "atril" (the mate version of evince) is also very slow on this file while gs 9.14 opens it almost instantly. Fedora 20 has poppler 0.24.3.
pdftoppm is also slow, which might make this easier to debug because pdftoppm is easier to build than evince.
pdftops processes the file in under 1 second.
It might be related to rasterizing a lot of text in small fonts because evince and atril have to do that while pdftops can just pass the fonts through.
pdftoppm and pdftocairo both fail with -jpeg but pdftoppm -png works although it is slow.
$ time /usr/bin/pdftoppm -png i04d_20140404_1836.pdf /tmp/x.ps
Here are the first few lines of a profile of the pdftoppm command above. I attached the full profile which I made from a git clone of the poppler from Feb 19, 2014.
% cumulative self self total
time seconds seconds calls s/call s/call name
62.62 16.99 16.99 749102576 0.00 0.00 Splash::pipeRunAARGB8(SplashPipe*)
22.71 23.15 6.16 31880 0.00 0.00 Splash::fillWithPattern(SplashPath*, bool, SplashPattern*, double)
2.65 23.87 0.72 1708951 0.00 0.00 SplashXPathScanner::renderAALine(SplashBitmap*, int*, int*, int, bool)
I suspect that the pattern fill is part of the font rasterization.
Adding "-aaVector no -aa no" reduces the time a little.
$ time /usr/bin/pdftoppm -aaVector no -aa no -png i04d_20140404_1836.pdf /tmp/x
There's something ultra weird, here qt4/tests/test-poppler-qt4 takes "only" 12 seconds while pdftoppm takes 50 seconds.
As far as i can remember they should be using the exact same codepaths.
And pdftotext also takes more than a minute ^_^
Tracked the regression to this commit:
9c5612f6e013a8698eff6531ec388a7e6c1fb89a is the first bad commit
Author: Marek Kasik <firstname.lastname@example.org>
Date: Fri Feb 12 14:31:01 2010 +0100
Distinguish between columns and tables when selecting text
This commit add ability to detect tables in text by checking borders
of 4 neighbouring text blocks for arrangement (to the left, to the right,
center, ...). Detected border of whole table is then stored in ExMin, ExMax,
EyMin and EyMax of each block together with id of detected table. Sorting
of blocks is then performed on the these borders to be able to distinguish
tables from columns.
Pasting of selected text was modified so that tables are pasted correctly
(even with multi line cells).
:040000 040000 e58e22d3707422029f1ca753868164eb22cf8bb4 46b3ae4c8d7cee01fc6e69a4038140ecab7ce361 M poppler
It's slowed in a recursive function, TextBlock::visitDepthFirst. The algorithm in there runs at O(n^3) in the worst case, and the tables patch makes the worst case more likely. I think the pdftoppm/pdftocairo issue mentioned above is a separate problem, possibly caused by the large size of the document (819 inches x 30 inches).
(In reply to comment #5)
it seems that the last 2 patches in #3188 improves the speed of "pdftotext i04d_20140404_1836.pdf" a lot. They took it from 131.5 seconds to 2.2 seconds on my computer. Brian's patch itself improved it from the 131.5 seconds to 3.2 seconds.
I've updated my patch, the Brian's one applies well still.
*** Bug 80573 has been marked as a duplicate of this bug. ***
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/438.