101807 – pdftohtml: fakebold and dropshadow duplicated text

Bug 101807 - pdftohtml: fakebold and dropshadow duplicated text

Summary: pdftohtml: fakebold and dropshadow duplicated text

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	pdftohtml (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-07-16 19:26 UTC by Jason Crain
Modified:	2018-08-21 10:40 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
pdf-example.html - from pdftohtml -s -noframes (84.64 KB, text/html) 2017-07-16 19:39 UTC, Jason Crain	Details
View All

Description Jason Crain 2017-07-16 19:26:25 UTC

If you run pdftohtml on the PDF in bug #101770 (https://bugs.freedesktop.org/attachment.cgi?id=132659) It results in duplicated and jumbled characters.

Some PDFs draw text multiple times to emulate bold text or drop shadows.  The main TextOutputDev goes to a lot of trouble to remove this duplicated text.  pdftohtml should do this too.

Comment 1 Jason Crain 2017-07-16 19:39:15 UTC

Created attachment 132719 [details]
pdf-example.html - from pdftohtml -s -noframes

I've attached the HTML file resulting from running "pdftohtml -s -noframes pdf-example.pdf".  I haven't attached the images but this should be enough to get the idea.  The HTML has several places where lines are duplicated.  Example:

<p style="..." class="ft10">1&#160;</p>
<p style="..." class="ft11">&#160;</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJENTA</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJENT</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJEN</p>
<p style="..." class="ft12">UPUTSTVO&#160;ZA&#160;PACIJE</p>
<p style="..." class="ft12">&#160;</p>
<p style="..." class="ft12">&#160;LUNATA</p>
<p style="..." class="ft12">LUNAT</p>
<p style="..." class="ft12">LUNA</p>
<p style="..." class="ft12">LUN</p>

I've elided the style attribute above to keep the lines a reasonable length.

Comment 2 GitLab Migration User 2018-08-21 10:40:53 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/321.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.