Created attachment 138921 [details] before From a bug reported to pdfgrep at https://gitlab.com/pdfgrep/pdfgrep/issues/25 The original file, before.pdf, took pdfgrep only 7 seconds to search. I then decompressed and recompressed the file to produce after.pdf. On this new file, pdfgrep now takes 80 seconds to search it. I also tested this procedure against some ebooks and found much worse results, such as an increase from 4s to 250s. It looks like this might be poppler related, since timing pdftotext on the files also exhibits a 10x difference in performance. But every other pdf viewer (Mac OS X Preview and Skim, mupdf, PDF.js) and parser (mutool, podofo, pdf-parser.py, pstotext/ghostscript) I tried doesn't exhibit any significant performance difference between these two files.
Created attachment 138922 [details] after decompressing and recompressing
So there's two possible patches i came to think a) Don't parse all objects in an ObjectStream b) Increase the number of cached ObjectStream Using a) in this particular pdf, time goes from 40s to 10s on my machine (still more than the 3s the non compressed pdf) Using b) it goes to 3s, but at the expense of using more memory (the 50 in there is a random number would obviously need improvement) And about a) i'm not even sure it's an improvement in all cases since here it only helps because for most of the ObjectStream we just create it and then use one object instead all of them, if we used them all, it'd be slower. So i guess the ideal would be checking how much more memory doing b) means and probably go with it?
Created attachment 138933 [details] [review] patch a
Created attachment 138934 [details] [review] patch b
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/141.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.