Bug 77087

Summary: High CPU usage
Product: poppler Reporter: Oriol Mula-Valls <oriol.mula-valls>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium CC: gpoo+bfdo, jason, mkasik, oriol.mula-valls
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Example file
profile of pdftoppm -png i04d_20140404_1836.pdf  /tmp/x

Description Oriol Mula-Valls 2014-04-05 16:50:51 UTC
Created attachment 96955 [details]
Example file

In order to monitor our experiments we usually create PDF files using python matplotlib. Files open very fast on Debian Squeeze with evince 2.30.3-2+squeeze1 and poppler 0.12.4-1.2.

We have recently started to upgrade our machines from squeeze to wheezy but we had to stop because the PDF files can no longer be opened, evince consumes a lot of CPU and memory. We have also tried with okular but the problem is still there which makes us think that might be a poppler issue. Evince version on Debian Wheezy is 3.4.0-3.1 and poppler version is 0.18.4-6.

I have just tested with evince 3.12.0-1 and poppler 0.24.5-2 and it is really slow also.

I attach an example file.
Comment 1 William Bader 2014-04-05 22:02:13 UTC
Created attachment 96961 [details]
profile of pdftoppm -png i04d_20140404_1836.pdf  /tmp/x

For an additional data point, I have Fedora 20 with the mate desktop, and "atril" (the mate version of evince) is also very slow on this file while gs 9.14 opens it almost instantly.  Fedora 20 has poppler 0.24.3.

pdftoppm is also slow, which might make this easier to debug because pdftoppm is easier to build than evince.

pdftops processes the file in under 1 second.

It might be related to rasterizing a lot of text in small fonts because evince and atril have to do that while pdftops can just pass the fonts through.

pdftoppm and pdftocairo both fail with -jpeg but pdftoppm -png works although it is slow.

$ time /usr/bin/pdftoppm -png i04d_20140404_1836.pdf  /tmp/x.ps
real    1m30.491s
user    1m30.235s
sys     0m0.202s

Here are the first few lines of a profile of the pdftoppm command above.  I attached the full profile which I made from a git clone of the poppler from Feb 19, 2014.

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 62.62     16.99    16.99 749102576     0.00     0.00  Splash::pipeRunAARGB8(SplashPipe*)
 22.71     23.15     6.16    31880     0.00     0.00  Splash::fillWithPattern(SplashPath*, bool, SplashPattern*, double)
  2.65     23.87     0.72  1708951     0.00     0.00  SplashXPathScanner::renderAALine(SplashBitmap*, int*, int*, int, bool)

I suspect that the pattern fill is part of the font rasterization.
Adding "-aaVector no -aa no" reduces the time a little.
$ time /usr/bin/pdftoppm -aaVector no -aa no -png i04d_20140404_1836.pdf  /tmp/x
real    0m57.463s
user    0m57.164s
sys     0m0.247s
Comment 2 Albert Astals Cid 2014-04-05 22:46:02 UTC
There's something ultra weird, here qt4/tests/test-poppler-qt4 takes "only" 12 seconds while pdftoppm takes 50 seconds.

As far as i can remember they should be using the exact same codepaths.
Comment 3 Albert Astals Cid 2014-04-05 22:48:41 UTC
And pdftotext also takes more than a minute  ^_^
Comment 4 Jason Crain 2014-04-13 16:25:59 UTC
Tracked the regression to this commit:

9c5612f6e013a8698eff6531ec388a7e6c1fb89a is the first bad commit
commit 9c5612f6e013a8698eff6531ec388a7e6c1fb89a
Author: Marek Kasik <mkasik@redhat.com>
Date:   Fri Feb 12 14:31:01 2010 +0100

    Distinguish between columns and tables when selecting text
    
    This commit add ability to detect tables in text by checking borders
    of 4 neighbouring text blocks for arrangement (to the left, to the right,
    center, ...). Detected border of whole table is then stored in ExMin, ExMax,
    EyMin and EyMax of each block together with id of detected table. Sorting
    of blocks is then performed on the these borders to be able to distinguish
    tables from columns.
    Pasting of selected text was modified so that tables are pasted correctly
    (even with multi line cells).

:040000 040000 e58e22d3707422029f1ca753868164eb22cf8bb4 46b3ae4c8d7cee01fc6e69a4038140ecab7ce361 M      poppler


It's slowed in a recursive function, TextBlock::visitDepthFirst.  The algorithm in there runs at O(n^3) in the worst case, and the tables patch makes the worst case more likely.  I think the pdftoppm/pdftocairo issue mentioned above is a separate problem, possibly caused by the large size of the document (819 inches x 30 inches).
Comment 5 Albert Astals Cid 2014-04-13 17:23:11 UTC
Marek?
Comment 6 Marek Kasik 2014-04-14 16:04:15 UTC
(In reply to comment #5)
> Marek?

Hi,

it seems that the last 2 patches in #3188 improves the speed of "pdftotext i04d_20140404_1836.pdf" a lot. They took it from 131.5 seconds to 2.2 seconds on my computer. Brian's patch itself improved it from the 131.5 seconds to 3.2 seconds.
I've updated my patch, the Brian's one applies well still.

Regards

Marek
Comment 7 Jason Crain 2014-07-06 08:59:06 UTC
*** Bug 80573 has been marked as a duplicate of this bug. ***
Comment 8 GitLab Migration User 2018-08-21 10:56:23 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/438.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.