Bug 85196

Summary: Huge spike in CPU and memory usage by tracker extractor due to rogue file
Product: poppler Reporter: Martyn Russell <mr>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: badshah400, dominique-freedesktop.org, rishi.is
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Martyn Russell 2014-10-19 11:43:02 UTC
Original bug is reported here for GNOME:
https://bugzilla.gnome.org/show_bug.cgi?id=738704

Here for SUSE:
https://bugzilla.opensuse.org/show_bug.cgi?id=898323

This crashes for a PDF which looks like it has no text and is just an image, see the GNOME bug for the link to it.

I can confirm this bug, but it's not a Tracker bug as far as I can see. We call:

  text = poppler_page_get_text (page);

and we run out of memory and it does take an age to come back from that API call.
Comment 1 Martyn Russell 2014-10-19 11:44:26 UTC
Meant to say, this is using version 0.26.5
Comment 2 Adrian Johnson 2014-10-20 11:39:03 UTC
The PDF is drawing the dots in the chart with the unicode character U+22C5 DOT OPERATOR. If you have enough memory and patience the file will be successfully processed. On my machine it takes 202 seconds and has peak memory usage of 2.7GB. The output file contains over 100,000 U+22C5 characters.

I recall a discussion a few years ago about improving the efficiency of the text extraction: 

http://lists.freedesktop.org/archives/poppler/2010-November/006646.html

I'm not sure what happened to those patches.
Comment 3 Martyn Russell 2014-10-20 12:27:46 UTC
(In reply to Adrian Johnson from comment #2)
> The PDF is drawing the dots in the chart with the unicode character U+22C5
> DOT OPERATOR. If you have enough memory and patience the file will be
> successfully processed. On my machine it takes 202 seconds and has peak
> memory usage of 2.7GB. The output file contains over 100,000 U+22C5
> characters.

Yea, still, for a 2Mb file, that's rather a lot of memory use to draw 100k characters. The speed is also the reason Tracker will SIGABRT on this file, that's way too long to extract some text from a PDF - arguably, there is none anyway :)

Is there another API we could use that is more efficient OR to detect if there is even any content to extract in the first place to avoid this problem?
 
> I recall a discussion a few years ago about improving the efficiency of the
> text extraction: 
> 
> http://lists.freedesktop.org/archives/poppler/2010-November/006646.html
> 
> I'm not sure what happened to those patches.

This is clearly a problem extending past Tracker if Evince is using poppler too.
Comment 4 GitLab Migration User 2018-08-20 21:47:25 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/72.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.