106135 – Pathological case demonstrating massive slowdown

Bug 106135 - Pathological case demonstrating massive slowdown

Summary: Pathological case demonstrating massive slowdown

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-19 09:45 UTC by solo
Modified:	2018-08-20 21:59 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
before (4.92 MB, application/pdf) 2018-04-19 09:45 UTC, solo	Details
after decompressing and recompressing (3.02 MB, application/pdf) 2018-04-19 09:47 UTC, solo	Details
patch a (3.37 KB, patch) 2018-04-19 23:28 UTC, Albert Astals Cid	Details \| Splinter Review
patch b (388 bytes, patch) 2018-04-19 23:28 UTC, Albert Astals Cid	Details \| Splinter Review
View All

Description solo 2018-04-19 09:45:57 UTC

Created attachment 138921 [details]
before

From a bug reported to pdfgrep at https://gitlab.com/pdfgrep/pdfgrep/issues/25

    The original file, before.pdf, took pdfgrep only 7 seconds to search.
    I then decompressed and recompressed the file to produce after.pdf. On
    this new file, pdfgrep now takes 80 seconds to search it. I also tested
    this procedure against some ebooks and found much worse results, such as
    an increase from 4s to 250s.

    It looks like this might be poppler related, since timing pdftotext on the
    files also exhibits a 10x difference in performance. But every other pdf
    viewer (Mac OS X Preview and Skim, mupdf, PDF.js) and parser (mutool,
    podofo, pdf-parser.py, pstotext/ghostscript) I tried doesn't exhibit any
    significant performance difference between these two files.

Comment 1 solo 2018-04-19 09:47:00 UTC

Created attachment 138922 [details]
after decompressing and recompressing

Comment 2 Albert Astals Cid 2018-04-19 23:27:41 UTC

So there's two possible patches i came to think

a) Don't parse all objects in an ObjectStream
b) Increase the number of cached ObjectStream

Using a) in this particular pdf, time goes from 40s to 10s on my machine (still more than the 3s the non compressed pdf)

Using b) it goes to 3s, but at the expense of using more memory (the 50 in there is a random number would obviously need improvement)


And about a) i'm not even sure it's an improvement in all cases since here it only helps because for most of the ObjectStream we just create it and then use one object instead all of them, if we used them all, it'd be slower.

So i guess the ideal would be checking how much more memory doing b) means and probably go with it?

Comment 3 Albert Astals Cid 2018-04-19 23:28:05 UTC

Created attachment 138933 [details] [review]
patch a

Comment 4 Albert Astals Cid 2018-04-19 23:28:17 UTC

Created attachment 138934 [details] [review]
patch b

Comment 5 GitLab Migration User 2018-08-20 21:59:06 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/141.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.