Bug 12522 - pdftohtml -c generates html with very ugly/unusual spacing
Summary: pdftohtml -c generates html with very ugly/unusual spacing
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-09-22 09:14 UTC by Luis Villa
Modified: 2018-08-21 10:49 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
pdftohtml: do not set background color (566 bytes, text/plain)
2010-05-07 04:39 UTC, Jakub Wilk
Details
An example PDF which will demonstrate the font spacing issues (37.74 KB, application/pdf)
2010-06-08 13:15 UTC, Mike Slegeir
Details

Description Luis Villa 2007-09-22 09:14:38 UTC
pdftohtml -c is trying to respect the different font sizes present in this document:
http://altlaw.org/v1/cases/157903.pdf

but it generates HTML that looks very unusual:
http://altlaw.org/v1/cases/157903 (everything in the frames are generated with pdftohtml.)

Note all the extra/unusual spacing, at least in FFox 2.
Comment 1 Jakub Wilk 2010-05-07 04:39:19 UTC
Created attachment 35491 [details]
pdftohtml: do not set background color

The patch doesn't implemented the requested feature, but at least makes the default more sensible. Setting grayish background if and only if -noframes is used doesn't really make sense.
Comment 2 Jakub Wilk 2010-05-07 04:42:05 UTC
Oops, sorry, the patch was meant to be submitted to another bug. :/
Comment 3 Albert Astals Cid 2010-05-10 07:57:14 UTC
Marked this patch as obsolete, please attach the patch to the correct bug.
Comment 4 Mike Slegeir 2010-06-08 13:15:36 UTC
Created attachment 36165 [details]
An example PDF which will demonstrate the font spacing issues

The issue appears to be caused by an adjustment of the font sizes.  The test document demonstrates the issue when run through pdftohtml and explains the problem:

The symptoms of the problem are that certain effects such as underlining and strike-through are misaligned, and occasional gaps in the text are present when a new section of text begins on the same line.
	I believe the font shrinking comes from two lines of code: HTMLOutputDev.cc:124 and HTMLFonts.cc:118.  Both lines subtract 1 from the font size, the former also truncates any fractional value.

I can't understand the purpose of the two subtractions.  The truncating does make sense though: it's better to round down than to round up because having text run together is much worse than having small gaps.  Can anyone motivate subtracting 2 from the PDF font size to get the HTML font size?
Comment 5 GitLab Migration User 2018-08-21 10:49:03 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/384.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.