Bug 9746

Summary: pdftohtml: complex output: text rendered in background image
Product: poppler Reporter: David Mackay <mackay_d>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: high CC: pmocek-freedesktop
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Patch to fix the described issue
Fixed an oversight on my part, forgot to include the fix to pdftohtml.cc as well
sample PDF which demonstrates bug

Description David Mackay 2007-01-23 13:46:13 UTC
If you run pdftohtml with the -c option for complex output, the text is inserted
into the background image as well as html literal strings.  The two don't
superimpose exactly (or even close, sometimes), and the result looks very fuzzy.

It would be nice if the text was not placed in the background image, which was
the  case with prior versions.
Comment 1 Evan Cortens 2007-07-11 10:29:55 UTC
Created attachment 10669 [details] [review]
Patch to fix the described issue

This patch adds the methods getPSNoText and setPSNoText to the GlobalParams class, as well as the appropriate private GBool psNoText. In PSOutputDev::drawString, a return was added if globalParams->getPSNoText() returns true.

This is an exact copy from the original pdftohtml.
Comment 2 Evan Cortens 2007-07-11 10:57:04 UTC
Created attachment 10670 [details] [review]
Fixed an oversight on my part, forgot to include the fix to pdftohtml.cc as well

In addition to the aforementioned changes, pdftohtml.cc also needs to be modified to enable psNoText.
Comment 3 Albert Astals Cid 2007-07-11 11:51:53 UTC
Fixed using a different patch because we are actually trying to kill GlobalParams.

Thanks for the report!
Comment 4 Phil Mocek 2010-06-19 20:58:40 UTC
I'm experiencing the same with pdftohtml 0.12.4 from the Ubuntu 10.4 package.  On pages which contain black boxes where text has been redacted, text is rendered in the background image.  On pages which do not have such boxes, text is not renderd in the background image.
Comment 5 Phil Mocek 2010-06-19 21:00:21 UTC
Created attachment 36372 [details]
sample PDF which demonstrates bug
Comment 6 Phil Mocek 2010-06-19 21:02:05 UTC
To reproduce: using attached PDF, run "pdftohtml -c -noframes -hidden -nomerge PoliceReport-2010202024.pdf PoliceReport-2010202024.html"
Comment 7 Albert Astals Cid 2010-08-22 14:13:11 UTC
Should be fixed in poppler >= 0.15.0

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.