Bug 54746

Summary: High CPU usage on reading specific file
Product: poppler Reporter: Alexander Hunziker <alex.hunziker>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: L.Bonnaud, mr
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: the file

Description Alexander Hunziker 2012-09-10 19:04:12 UTC
A specific PDF file (http://ubuntuone.com/4ELfHGFXVtDAtU0lWsLT6G) causes very high CPU load and takes a long time to render using evince/cairo. On my dual core 2 GHz machine it takes 30 seconds to show the first page alone.

This was discovered by tracker choking when indexing this file (see https://bugzilla.gnome.org/show_bug.cgi?id=680897 comments 14 ff).

The built-in PDF reader of the chromium browser has no issues at all showing this file.
Comment 1 Albert Astals Cid 2012-09-10 19:57:14 UTC
Assigning to cairo backend, splash takes 74ms, cairo takes 14s
Comment 2 Laurent Bonnaud 2015-04-28 18:29:28 UTC
The file cannot be downloaded any longer.  Could someone please upload it again?
Comment 3 Albert Astals Cid 2015-04-28 22:02:35 UTC
Created attachment 115413 [details]
the file
Comment 4 Laurent Bonnaud 2015-04-29 07:00:41 UTC
Thanks!

I confirm that this bug still exists in poppler 0.30.0 in Ubuntu 15.04.

Viewing this file takes a very long time, both in evince and okular.
Even pdftotext takes ages to convert it (into a useless text file)
and indexing it with baloo or tracker also uses too much CPU.
Comment 5 Laurent Bonnaud 2015-04-29 07:20:49 UTC
Other data points:

 - xpdf also uses a lot of CPU to display the file

 - gv, which uses GhostScript, and gs itself can display the file instantly
Comment 6 Thomas Freitag 2015-04-29 15:11:10 UTC
The problem is here neither the splash backend nor the cairo backend. Rendering the PDF is fast with splash and with cairo.
The problem is the TextOutputDev used in okular and in evince to make the text searchable. TextOutputDev tries to sort the text into readind order by performing a topological, and this is really slow for this document. And is even not necessary: because all fonts are embedded and all use a custom encoding the content will never be searchable.

But I'm not an expert in TextOutputDev, I just figured out that pdftotext -raw is much faster, so evince and okular perhaps can detect that the text will not be readable and omit the text extracting?
Comment 7 Laurent Bonnaud 2015-10-07 09:40:10 UTC
Good catch!  Here are time measurements on my system:

$ time pdftotext bug-poppler54746.pdf

real    5m52.502s
user    5m52.884s
sys     0m0.016s

$ time pdftotext -raw bug-poppler54746.pdf

real    0m0.280s
user    0m0.244s
sys     0m0.020s

There are perhaps bugs in evince and okular (I will report them), but a similar bug exists in pdftotext and should be fixed...
Comment 8 Jose Aliste 2015-10-07 13:39:15 UTC
I think one thing to do is to run the textoutputdev and the splash/cairo output dev in paralell and if the textouputdev takes too long, then just abort this part and emit a sort of error. This sounds like something to do on Evince and Okular, but just stating the idea here. Evince is not thread-safe (we queue calls to Poppler), so making this work in evince will take more efforts. I don't know about Okular.
Comment 9 Albert Astals Cid 2015-10-07 13:48:41 UTC
The actual problem is that textouputdev sorting code is veeeeery slow, that's why raw vs nonraw is so different
Comment 10 GitLab Migration User 2018-08-21 10:33:16 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/254.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.