Bug 9746 - pdftohtml: complex output: text rendered in background image
Summary: pdftohtml: complex output: text rendered in background image
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: high normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-01-23 13:46 UTC by David Mackay
Modified: 2010-08-22 14:13 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Patch to fix the described issue (1.17 KB, patch)
2007-07-11 10:29 UTC, Evan Cortens
Details | Splinter Review
Fixed an oversight on my part, forgot to include the fix to pdftohtml.cc as well (1.45 KB, patch)
2007-07-11 10:57 UTC, Evan Cortens
Details | Splinter Review
sample PDF which demonstrates bug (12.09 KB, application/pdf)
2010-06-19 21:00 UTC, Phil Mocek
Details

Description David Mackay 2007-01-23 13:46:13 UTC
If you run pdftohtml with the -c option for complex output, the text is inserted
into the background image as well as html literal strings.  The two don't
superimpose exactly (or even close, sometimes), and the result looks very fuzzy.

It would be nice if the text was not placed in the background image, which was
the  case with prior versions.
Comment 1 Evan Cortens 2007-07-11 10:29:55 UTC
Created attachment 10669 [details] [review]
Patch to fix the described issue

This patch adds the methods getPSNoText and setPSNoText to the GlobalParams class, as well as the appropriate private GBool psNoText. In PSOutputDev::drawString, a return was added if globalParams->getPSNoText() returns true.

This is an exact copy from the original pdftohtml.
Comment 2 Evan Cortens 2007-07-11 10:57:04 UTC
Created attachment 10670 [details] [review]
Fixed an oversight on my part, forgot to include the fix to pdftohtml.cc as well

In addition to the aforementioned changes, pdftohtml.cc also needs to be modified to enable psNoText.
Comment 3 Albert Astals Cid 2007-07-11 11:51:53 UTC
Fixed using a different patch because we are actually trying to kill GlobalParams.

Thanks for the report!
Comment 4 Phil Mocek 2010-06-19 20:58:40 UTC
I'm experiencing the same with pdftohtml 0.12.4 from the Ubuntu 10.4 package.  On pages which contain black boxes where text has been redacted, text is rendered in the background image.  On pages which do not have such boxes, text is not renderd in the background image.
Comment 5 Phil Mocek 2010-06-19 21:00:21 UTC
Created attachment 36372 [details]
sample PDF which demonstrates bug
Comment 6 Phil Mocek 2010-06-19 21:02:05 UTC
To reproduce: using attached PDF, run "pdftohtml -c -noframes -hidden -nomerge PoliceReport-2010202024.pdf PoliceReport-2010202024.html"
Comment 7 Albert Astals Cid 2010-08-22 14:13:11 UTC
Should be fixed in poppler >= 0.15.0


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.