8143 – ~4% speedup on loading PDF through Lexer/Stream improvements

Bug 8143 - ~4% speedup on loading PDF through Lexer/Stream improvements

Summary: ~4% speedup on loading PDF through Lexer/Stream improvements

Status:	RESOLVED WONTFIX

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Windows (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-09-06 00:45 UTC by Krzysztof Kowalczyk
Modified:	2011-06-19 14:54 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
Patch as described in the bug (38.35 KB, patch) 2006-09-06 00:46 UTC, Krzysztof Kowalczyk	Details \| Splinter Review
View All

Description Krzysztof Kowalczyk 2006-09-06 00:45:19 UTC

Using virtual Stream::getChar() is very expensive and results in
Lexer::getChar() showing up in the profile as second most expensive function
during PDF loading (and that excludes the cost of its children).

The reason for that is: in order to get next character from the stream, poppler
needs at least one virtual function call + stream's eof logic, but in most cases
it's more than that (e.g. often a stream is an embedded stream on top of flate
stream on top of file stream which brings the cost to 3 virtual function calls +
eof logic for each stream). That's very expensive and since getChar() is one of
the most frequently called functions, the result is that an operation that
should be essentially free.

The attached patch fixes this by adding a way to get direct access to Stream's
underlying buffer. That way a client (e.g. a Lexer) can request a buffer and
getChar() logic becomes very fast "if buffer not empty, get char from buffer,
otherwise ask for another buffer".

Not every stream supports that so clients have to also have fallback logic that
uses the current slow way of using Stream::getChar().

The additional interface to Stream is as follows:
* GBool hasGetBuf() - returns true if a given stream supports
getBuf()/ungetBuf() calls
* GBool getBuf(char **bufOut, int *bufSizeOut, int maxSize) - stream returns a
direct pointer to its underlying buffer in <bufOut> and <bufSizeOut>. It returns
gFalse if reached EOF. If returns gTrue, *bufSizeOut must be greater than zero.
Note that this is not a copy of data. Stream, not the client, is in control over
size of the buffer. Client can only limit the size via <maxSize> (defaults to
Stream::NO_SIZE_LIMIT which means that client doesn't care). This is sequential
access i.e. subsequent calls to getBuf() return next portion of stream's data.
* ungetBuf(int sizeToGoBack) - a client might not need all the data it got from
the Stream, which might ruin the stream for other clients. That's why ungetBuf()
is needed. Client uses it to "give back" unprocessed data to Stream so that when
another client calls getBuf() or getChar(), the right data is returned. It's an
equivalent of seeking back in a file.

It might not be the prettiest interface, but it allows going fast.

I've also converted Lexer to use this interface if available. It gives me around
4% speedup on loading a PDF (really depends on the type of PDF and type of
streams inside that PDF).

Frankly, I was disappointed that it's only ~4%. I was expecting much more. It
turns out that the culprit is current implementation of flate stream, which is
frequently used to compress streams inside PDFs. It decompresses data in very
small chunks (e.g. 8 bytes on average per getBuf() call in my test) so we don't
save nearly as much as if we were getting, say, 256 bytes at a time. I'm working
on improving that as well, but this change lays the necessary foundation. 

Other filter streams (e.g. CCITT/Ascii/DCT) could also be improved by taking
advantage of getBuf from their underlying stream and providing getBuf()
interface to their clients. So far I've only implemented getBuf() for
FileStream, EmbedStream and (not so efficiently) FlateStream.

This turned out tricky to implement so how do I know it works? 

I ran my stress test which includes running a test program over a random
collection of 1500+ pdfs, renders every page, records timing informations and
all errors reported by poppler via error() function and also does a visual
preview that I can see that pages are rendered correctly. That way I can tell if
it crashed on any PDF and compare the results with previous runs. The app
survived the stress test.

Comment 1 Krzysztof Kowalczyk 2006-09-06 00:46:01 UTC

Created attachment 6842 [details] [review]
Patch as described in the bug

Comment 2 Ludovic Aubry 2008-12-06 17:45:07 UTC

a better approach would be to make getChar an inline function of Stream with :

int getChar() { return (bufPtr >= bufEnd && !fillBuf()) ? EOF : (*bufPtr++ & 0xff); }

(which is what FileStream does, but cannot be inlined as the method is virtual)

Once all classes use this single *inlined* implementation it's easy to
replace all existing getChar either with an efficient fillBuf or if it's not
possible with a fillBuf method that fills a one char buffer (with the existing
getChar implementation). Each derived class is free to implement the buffer it
needs.

I started a patch along these lines a long time ago against xpdf 1.01 and lost interest when 2.0 got out before I could submit a final version. There are probably some parts worth recovering if there is any interests.

I didn't keep the benchmarks but I recall the improvement was quite nice with pdftotext.

Comment 3 Albert Astals Cid 2011-06-19 14:54:21 UTC

The patch does not apply anymore, and since it seems you are not interested anymore in poppler development i'm closing the bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.