A specific PDF file (http://ubuntuone.com/4ELfHGFXVtDAtU0lWsLT6G) causes very high CPU load and takes a long time to render using evince/cairo. On my dual core 2 GHz machine it takes 30 seconds to show the first page alone. This was discovered by tracker choking when indexing this file (see https://bugzilla.gnome.org/show_bug.cgi?id=680897 comments 14 ff). The built-in PDF reader of the chromium browser has no issues at all showing this file.
Assigning to cairo backend, splash takes 74ms, cairo takes 14s
The file cannot be downloaded any longer. Could someone please upload it again?
Created attachment 115413 [details] the file
Thanks! I confirm that this bug still exists in poppler 0.30.0 in Ubuntu 15.04. Viewing this file takes a very long time, both in evince and okular. Even pdftotext takes ages to convert it (into a useless text file) and indexing it with baloo or tracker also uses too much CPU.
Other data points: - xpdf also uses a lot of CPU to display the file - gv, which uses GhostScript, and gs itself can display the file instantly
The problem is here neither the splash backend nor the cairo backend. Rendering the PDF is fast with splash and with cairo. The problem is the TextOutputDev used in okular and in evince to make the text searchable. TextOutputDev tries to sort the text into readind order by performing a topological, and this is really slow for this document. And is even not necessary: because all fonts are embedded and all use a custom encoding the content will never be searchable. But I'm not an expert in TextOutputDev, I just figured out that pdftotext -raw is much faster, so evince and okular perhaps can detect that the text will not be readable and omit the text extracting?
Good catch! Here are time measurements on my system: $ time pdftotext bug-poppler54746.pdf real 5m52.502s user 5m52.884s sys 0m0.016s $ time pdftotext -raw bug-poppler54746.pdf real 0m0.280s user 0m0.244s sys 0m0.020s There are perhaps bugs in evince and okular (I will report them), but a similar bug exists in pdftotext and should be fixed...
I think one thing to do is to run the textoutputdev and the splash/cairo output dev in paralell and if the textouputdev takes too long, then just abort this part and emit a sort of error. This sounds like something to do on Evince and Okular, but just stating the idea here. Evince is not thread-safe (we queue calls to Poppler), so making this work in evince will take more efforts. I don't know about Okular.
The actual problem is that textouputdev sorting code is veeeeery slow, that's why raw vs nonraw is so different
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/254.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.