Bug 14126 - pdftotext not checking success of fseek(), so gives misleading errors if input not seekable
Summary: pdftotext not checking success of fseek(), so gives misleading errors if inpu...
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-18 09:56 UTC by Ed Avis
Modified: 2008-03-22 07:51 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Patch to poppler-0.6.3 to check fseek() and similar calls (6.41 KB, patch)
2008-01-18 09:56 UTC, Ed Avis
Details | Splinter Review
Patch to check some fseek() calls, and check file is seekable on initialization (8.10 KB, patch)
2008-03-21 12:41 UTC, Ed Avis
Details | Splinter Review

Description Ed Avis 2008-01-18 09:56:11 UTC
Created attachment 13775 [details] [review]
Patch to poppler-0.6.3 to check fseek() and similar calls

pdftotext fails uncleanly if its input is not seekable.  For example

cat some.pdf | pdftotext /dev/stdin out

This produces errors such as

Error: PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table

But the PDF file is not damaged - just the file is not seekable.  pdftotext doesn't check that its fseek() calls succeed.  In general I suggest it should check the result of all file I/O calls and do something half-sensible on failure.

(Instead of plain fseek() the code uses large-file-capable seek functions on platforms that have them - those too need to be checked.)

(If you need a PDF file that reproduces this bug, you can create one with 'echo | enscript -p - | ps2pdf - some.pdf'.  But I think it happens with pretty much any input file.)

I attach a patch which simplemindedly solves the problem by checking all fseek() calls and throwing an exception on failure, which produces an 'Illegal seek' message on stderr.  This is a bit better than ignoring the failure.  You might not want to apply this particular patch because it depends on C++ exception support, but something similar will be needed to check fseek() and related calls for success or failure.
Comment 1 Ed Avis 2008-03-20 12:40:49 UTC
Hi, I'd be interested to know what you think of this bug report and patch - firstly whether you agree that it's a bug that the return value of fseek() is being ignored, and secondly whether this patch is an acceptable fix.
Comment 2 Albert Astals Cid 2008-03-20 14:22:22 UTC
Do you read minds or something?

Just yesterday i was having a look at this bug report and really i don't know what to do.

Using exceptions is a no go, as the whole project does not uses exceptions. So one has to resort to other solutions.

For FoFiBase::readFile and GfxFont::readExtFontFile it is easy because they return a * that can be NULL in case of error, but all the Stream functions return void, changing them to bool is not acceptable since that would mean checking for errors in hundreds of places where that's not checked currently.

So the solution i liked most was something like return NULL in FoFiBase::readFile and GfxFont::readExtFontFile along with an error() call when we can not seek, and for all the others i think one should try to do a
setPos(1), getPos in PDFDoc::setup and if getPos returns != 1 we just assume the stream is not seekable and return a false in GBool.

Could you try coding such a patch?
Comment 3 Ed Avis 2008-03-20 14:31:58 UTC
>but all the Stream functions
>return void, changing them to bool is not acceptable since that would mean
>checking for errors in hundreds of places where that's not checked currently.

I feel that you just have to knuckle down and put in all the error checks, tedious though it may be... of course, this is the reason why exceptions were added to the language.  However, your compromise may be enough:

>So the solution i liked most was something like return NULL in
>FoFiBase::readFile and GfxFont::readExtFontFile along with an error() call when
>we can not seek, and for all the others i think one should try to do a
>setPos(1), getPos in PDFDoc::setup and if getPos returns != 1 we just assume
>the stream is not seekable and return a false in GBool.

This would take care of the /dev/stdin case.

It wouldn't work correctly when fseek() really does fail part way through, as might happen on network filesystems.  But you might ignore such cases - most Unix software is so buggy anyway with checking the success of file I/O that everyone mounts NFS as 'hard' mounts and failing operations just hang rather than reporting error codes.

>Could you try coding such a patch?

Certainly.  Do you require copyright assignment?
Comment 4 Ed Avis 2008-03-21 12:41:31 UTC
Created attachment 15382 [details] [review]
Patch to check some fseek() calls, and check file is seekable on initialization

Hi, here's a patch that does roughly what you suggested.
Comment 5 Albert Astals Cid 2008-03-22 07:51:30 UTC
Hi, i used the first part of your patch, for the second i went a different way you can see at http://cgit.freedesktop.org/poppler/poppler/commit/?id=066595dd06c930997d5ec65a06c822616af9baa0

So that will be fixed next week when we release 0.8, hope that's good enough for you.

And no, we don't requiere copyright assignment.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.