Bug 89941

Summary: pdftotext: Add an option for more detailed bounding box information
Product: poppler Reporter: Jeremy Echols <jechols>
Component: utilsAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: enhancement    
Priority: medium CC: jechols
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Adds -bbox-layout command to pdftotext
Adds -bbox-layout command with man page update
Adds -bbox-layout command with man page update

Description Jeremy Echols 2015-04-07 17:58:15 UTC
Created attachment 114932 [details] [review]
Adds -bbox-layout command to pdftotext

We're looking to generate ALTO-compatible XML (http://en.wikipedia.org/wiki/ALTO_%28XML%29) from PDFs, and the current -bbox flag almost does what we need, but skips over some important data - blocks and lines.

I have created some code based on 0.22.5 (in order to ensure compatibility on our CentOS 7 system) which appears to apply cleanly to the current master, and produces the same output as my 0.22.5 hack as far as I can tell.  The change adds a new flag, -bbox-layout, which is still very generic output, but is sufficient for us to then transform as needed.
Comment 1 Albert Astals Cid 2015-07-14 22:19:40 UTC
Can you please update the man page too? pdftotext.1
Comment 2 Jeremy Echols 2015-07-15 19:54:10 UTC
Created attachment 117147 [details] [review]
Adds -bbox-layout command with man page update

Adds the new flag as well as the man page change to document the flag.  (This was created via git format-patch)
Comment 3 Albert Astals Cid 2015-07-16 22:27:55 UTC
/home/tsdgeos/devel/poppler/utils/pdftotext.cc: In function ‘void printLine(FILE*, TextLine*)’:
/home/tsdgeos/devel/poppler/utils/pdftotext.cc:512:35: warning: format not a string literal and no format arguments [-Wformat-security]
   fprintf(f, wordXML.str().c_str());
                                   ^
Please fix.
Comment 4 Jeremy Echols 2015-07-21 16:50:42 UTC
Created attachment 117280 [details] [review]
Adds -bbox-layout command with man page update

Adds the new flag, "-bbox-layout", a man page addition, and deals with security issue by using fputs instead of fprintf when there is no format string
Comment 5 Albert Astals Cid 2015-08-31 22:23:40 UTC
Pushed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.