Bug 45163

Summary: pdftotext -bbox fails to write to stdout
Product: poppler Reporter: awendt
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Proposed patch

Description awendt 2012-01-23 23:33:23 UTC
The pdftotext man page says the following about the output file specified on the command line: If text-file is '-', the text is sent to stdout.

This does not work with the -bbox option. The HTML header and footer are correctly written to stdout, but the contents of the PDF file are appended to a file actually named '-' in the current directory.

If I specify "/dev/stdout" as the output file, I get the expected behaviour.
Comment 1 Albert Astals Cid 2012-01-26 11:17:14 UTC
Can you please write the exact command line you are using, what is the real output and what is the expected output?
Comment 2 awendt 2012-01-26 15:41:27 UTC
(In reply to comment #1)
> Can you please write the exact command line you are using, what is the real
> output and what is the expected output?

Sure... This is the output without -bbox, everything works correctly:

$ pdftotext test.pdf -
Hello!
This is a sample PDF file.

Same command with -bbox added, note how the body element has no content:

$ pdftotext -bbox test.pdf -
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta name="Creator" content="Writer"/>
<meta name="Producer" content="LibreOffice 3.4"/>
<meta name="CreationDate" content=""/>
</head>
<body>
</body>
</html>

Where did the content go? Into a file literally named '-':

$ cat ./-
<doc>
  <page width="612.000000" height="792.000000">
    <word xMin="56.800000" yMin="57.208000" xMax="88.084000" yMax="70.492000">Hello!</word>
    <word xMin="56.800000" yMin="71.008000" xMax="78.064000" yMax="84.292000">This</word>
    <word xMin="81.184000" yMin="71.008000" xMax="89.152000" yMax="84.292000">is</word>
    <word xMin="92.176000" yMin="71.008000" xMax="97.492000" yMax="84.292000">a</word>
    <word xMin="100.480000" yMin="71.008000" xMax="134.392000" yMax="84.292000">sample</word>
    <word xMin="137.464000" yMin="71.008000" xMax="159.424000" yMax="84.292000">PDF</word>
    <word xMin="162.436000" yMin="71.008000" xMax="181.336000" yMax="84.292000">file.</word>
  </page>
</doc>

The expected output is to have the <doc>...</doc> content inside the body element that is sent to stdout, and no file named '-' generated.

I can get the expected output with 'pdftotext -bbox test.pdf /dev/stdout' instead, but that is not very portable.

Basically, the code that writes the header and footer has a special case to convert a filename of '-' to stdout, but the code that writes the bbox content lacks the special case, so they interpret the output filename differently. (For some reason the output file is closed and reopened by these different components, instead of being left open.)
Comment 3 Yury G. Kudryashov 2013-08-07 09:41:39 UTC
Created attachment 83776 [details] [review]
Proposed patch
Comment 4 Albert Astals Cid 2013-08-08 18:46:55 UTC
Commited, thanks

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.