Bug 101807

Summary: pdftohtml: fakebold and dropshadow duplicated text
Product: poppler Reporter: Jason Crain <jason>
Component: pdftohtmlAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: media-x
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: pdf-example.html - from pdftohtml -s -noframes

Description Jason Crain 2017-07-16 19:26:25 UTC
If you run pdftohtml on the PDF in bug #101770 (https://bugs.freedesktop.org/attachment.cgi?id=132659) It results in duplicated and jumbled characters.

Some PDFs draw text multiple times to emulate bold text or drop shadows.  The main TextOutputDev goes to a lot of trouble to remove this duplicated text.  pdftohtml should do this too.
Comment 1 Jason Crain 2017-07-16 19:39:15 UTC
Created attachment 132719 [details]
pdf-example.html - from pdftohtml -s -noframes

I've attached the HTML file resulting from running "pdftohtml -s -noframes pdf-example.pdf".  I haven't attached the images but this should be enough to get the idea.  The HTML has several places where lines are duplicated.  Example:

<p style="..." class="ft10">1&#160;</p>
<p style="..." class="ft11">&#160;</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJENTA</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJENT</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJEN</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJE</p>
<p style="..." class="ft12">&#160;</p>
<p style="..." class="ft12">&#160;LUNATA</p>
<p style="..." class="ft12">LUNAT</p>
<p style="..." class="ft12">LUNA</p>
<p style="..." class="ft12">LUN</p>

I've elided the style attribute above to keep the lines a reasonable length.
Comment 2 GitLab Migration User 2018-08-21 10:40:53 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/321.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.