Bug 106135 - Pathological case demonstrating massive slowdown
Summary: Pathological case demonstrating massive slowdown
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-19 09:45 UTC by solo
Modified: 2018-08-20 21:59 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
before (4.92 MB, application/pdf)
2018-04-19 09:45 UTC, solo
Details
after decompressing and recompressing (3.02 MB, application/pdf)
2018-04-19 09:47 UTC, solo
Details
patch a (3.37 KB, patch)
2018-04-19 23:28 UTC, Albert Astals Cid
Details | Splinter Review
patch b (388 bytes, patch)
2018-04-19 23:28 UTC, Albert Astals Cid
Details | Splinter Review

Description solo 2018-04-19 09:45:57 UTC
Created attachment 138921 [details]
before

From a bug reported to pdfgrep at https://gitlab.com/pdfgrep/pdfgrep/issues/25

    The original file, before.pdf, took pdfgrep only 7 seconds to search.
    I then decompressed and recompressed the file to produce after.pdf. On
    this new file, pdfgrep now takes 80 seconds to search it. I also tested
    this procedure against some ebooks and found much worse results, such as
    an increase from 4s to 250s.

    It looks like this might be poppler related, since timing pdftotext on the
    files also exhibits a 10x difference in performance. But every other pdf
    viewer (Mac OS X Preview and Skim, mupdf, PDF.js) and parser (mutool,
    podofo, pdf-parser.py, pstotext/ghostscript) I tried doesn't exhibit any
    significant performance difference between these two files.
Comment 1 solo 2018-04-19 09:47:00 UTC
Created attachment 138922 [details]
after decompressing and recompressing
Comment 2 Albert Astals Cid 2018-04-19 23:27:41 UTC
So there's two possible patches i came to think

a) Don't parse all objects in an ObjectStream
b) Increase the number of cached ObjectStream

Using a) in this particular pdf, time goes from 40s to 10s on my machine (still more than the 3s the non compressed pdf)

Using b) it goes to 3s, but at the expense of using more memory (the 50 in there is a random number would obviously need improvement)


And about a) i'm not even sure it's an improvement in all cases since here it only helps because for most of the ObjectStream we just create it and then use one object instead all of them, if we used them all, it'd be slower.

So i guess the ideal would be checking how much more memory doing b) means and probably go with it?
Comment 3 Albert Astals Cid 2018-04-19 23:28:05 UTC
Created attachment 138933 [details] [review]
patch a
Comment 4 Albert Astals Cid 2018-04-19 23:28:17 UTC
Created attachment 138934 [details] [review]
patch b
Comment 5 GitLab Migration User 2018-08-20 21:59:06 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/141.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.