Using virtual Stream::getChar() is very expensive and results in Lexer::getChar() showing up in the profile as second most expensive function during PDF loading (and that excludes the cost of its children). The reason for that is: in order to get next character from the stream, poppler needs at least one virtual function call + stream's eof logic, but in most cases it's more than that (e.g. often a stream is an embedded stream on top of flate stream on top of file stream which brings the cost to 3 virtual function calls + eof logic for each stream). That's very expensive and since getChar() is one of the most frequently called functions, the result is that an operation that should be essentially free. The attached patch fixes this by adding a way to get direct access to Stream's underlying buffer. That way a client (e.g. a Lexer) can request a buffer and getChar() logic becomes very fast "if buffer not empty, get char from buffer, otherwise ask for another buffer". Not every stream supports that so clients have to also have fallback logic that uses the current slow way of using Stream::getChar(). The additional interface to Stream is as follows: * GBool hasGetBuf() - returns true if a given stream supports getBuf()/ungetBuf() calls * GBool getBuf(char **bufOut, int *bufSizeOut, int maxSize) - stream returns a direct pointer to its underlying buffer in <bufOut> and <bufSizeOut>. It returns gFalse if reached EOF. If returns gTrue, *bufSizeOut must be greater than zero. Note that this is not a copy of data. Stream, not the client, is in control over size of the buffer. Client can only limit the size via <maxSize> (defaults to Stream::NO_SIZE_LIMIT which means that client doesn't care). This is sequential access i.e. subsequent calls to getBuf() return next portion of stream's data. * ungetBuf(int sizeToGoBack) - a client might not need all the data it got from the Stream, which might ruin the stream for other clients. That's why ungetBuf() is needed. Client uses it to "give back" unprocessed data to Stream so that when another client calls getBuf() or getChar(), the right data is returned. It's an equivalent of seeking back in a file. It might not be the prettiest interface, but it allows going fast. I've also converted Lexer to use this interface if available. It gives me around 4% speedup on loading a PDF (really depends on the type of PDF and type of streams inside that PDF). Frankly, I was disappointed that it's only ~4%. I was expecting much more. It turns out that the culprit is current implementation of flate stream, which is frequently used to compress streams inside PDFs. It decompresses data in very small chunks (e.g. 8 bytes on average per getBuf() call in my test) so we don't save nearly as much as if we were getting, say, 256 bytes at a time. I'm working on improving that as well, but this change lays the necessary foundation. Other filter streams (e.g. CCITT/Ascii/DCT) could also be improved by taking advantage of getBuf from their underlying stream and providing getBuf() interface to their clients. So far I've only implemented getBuf() for FileStream, EmbedStream and (not so efficiently) FlateStream. This turned out tricky to implement so how do I know it works? I ran my stress test which includes running a test program over a random collection of 1500+ pdfs, renders every page, records timing informations and all errors reported by poppler via error() function and also does a visual preview that I can see that pages are rendered correctly. That way I can tell if it crashed on any PDF and compare the results with previous runs. The app survived the stress test.
Created attachment 6842 [details] [review] Patch as described in the bug
a better approach would be to make getChar an inline function of Stream with : int getChar() { return (bufPtr >= bufEnd && !fillBuf()) ? EOF : (*bufPtr++ & 0xff); } (which is what FileStream does, but cannot be inlined as the method is virtual) Once all classes use this single *inlined* implementation it's easy to replace all existing getChar either with an efficient fillBuf or if it's not possible with a fillBuf method that fills a one char buffer (with the existing getChar implementation). Each derived class is free to implement the buffer it needs. I started a patch along these lines a long time ago against xpdf 1.01 and lost interest when 2.0 got out before I could submit a final version. There are probably some parts worth recovering if there is any interests. I didn't keep the benchmarks but I recall the improvement was quite nice with pdftotext.
The patch does not apply anymore, and since it seems you are not interested anymore in poppler development i'm closing the bug.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.