Bug 89941 - pdftotext: Add an option for more detailed bounding box information
Summary: pdftotext: Add an option for more detailed bounding box information
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: All All
: medium enhancement
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-07 17:58 UTC by Jeremy Echols
Modified: 2015-08-31 22:23 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Adds -bbox-layout command to pdftotext (6.30 KB, patch)
2015-04-07 17:58 UTC, Jeremy Echols
Details | Splinter Review
Adds -bbox-layout command with man page update (7.32 KB, patch)
2015-07-15 19:54 UTC, Jeremy Echols
Details | Splinter Review
Adds -bbox-layout command with man page update (7.31 KB, patch)
2015-07-21 16:50 UTC, Jeremy Echols
Details | Splinter Review

Description Jeremy Echols 2015-04-07 17:58:15 UTC
Created attachment 114932 [details] [review]
Adds -bbox-layout command to pdftotext

We're looking to generate ALTO-compatible XML (http://en.wikipedia.org/wiki/ALTO_%28XML%29) from PDFs, and the current -bbox flag almost does what we need, but skips over some important data - blocks and lines.

I have created some code based on 0.22.5 (in order to ensure compatibility on our CentOS 7 system) which appears to apply cleanly to the current master, and produces the same output as my 0.22.5 hack as far as I can tell.  The change adds a new flag, -bbox-layout, which is still very generic output, but is sufficient for us to then transform as needed.
Comment 1 Albert Astals Cid 2015-07-14 22:19:40 UTC
Can you please update the man page too? pdftotext.1
Comment 2 Jeremy Echols 2015-07-15 19:54:10 UTC
Created attachment 117147 [details] [review]
Adds -bbox-layout command with man page update

Adds the new flag as well as the man page change to document the flag.  (This was created via git format-patch)
Comment 3 Albert Astals Cid 2015-07-16 22:27:55 UTC
/home/tsdgeos/devel/poppler/utils/pdftotext.cc: In function ‘void printLine(FILE*, TextLine*)’:
/home/tsdgeos/devel/poppler/utils/pdftotext.cc:512:35: warning: format not a string literal and no format arguments [-Wformat-security]
   fprintf(f, wordXML.str().c_str());
                                   ^
Please fix.
Comment 4 Jeremy Echols 2015-07-21 16:50:42 UTC
Created attachment 117280 [details] [review]
Adds -bbox-layout command with man page update

Adds the new flag, "-bbox-layout", a man page addition, and deals with security issue by using fputs instead of fprintf when there is no format string
Comment 5 Albert Astals Cid 2015-08-31 22:23:40 UTC
Pushed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.