Bug 28052 - pdftohtml loses some double lls in duplicate check
Summary: pdftohtml loses some double lls in duplicate check
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: All All
: medium normal
Assignee: poppler-bugs
QA Contact:
Depends on:
Reported: 2010-05-10 08:29 UTC by Chris Faulhaber
Modified: 2018-08-20 22:02 UTC (History)
0 users

See Also:
i915 platform:
i915 features:

Potential fix (1.81 KB, patch)
2010-05-10 17:36 UTC, Chris Faulhaber
Details | Splinter Review
Document exhibiting the problem (55.59 KB, application/pdf)
2010-05-25 12:28 UTC, Chris Faulhaber
Potential fix, revised (1.74 KB, patch)
2011-01-17 15:03 UTC, Svein Erik
Details | Splinter Review

Description Chris Faulhaber 2010-05-10 08:29:38 UTC
Problem: In some PDF documents two lls will overlap slightly.  pdftohtml will drop latter l.  E.g., called because cal ed, all becomes al , and eventually becomes eventual y.

Version: poppler-0.13.3

Reason: In HtmlOutputDev.cc, class HtmlPage, method coalesce, there's a section of code to discard duplicate text for "fake boldface, drop shadows."  The lls are triggering the duplicate code and are thus removed from the output.

The debug output shows:
x=139.68000..143.016000  y=626.076000..641.844000  size=15 'l'
x=142.80000..146.136000  y=626.076000..641.844000  size=15 'l'

Due to my inexperience with the project I can't say what the best solution will be.  Here are a few options I've considered.  If you'd like to suggest a preferred method for solving this problem I will implement and submit a patch, however I have no test documents that involve actual duplicate text.

Solution 1: Decrease the fudge factor from 0.2 to 0.1.  This may not be reliable and could cause the duplicates which this code was originally meant to discard to resurface.  It will, however, let the lls through in my test documents.

Solution 2: Make the duplicate check a command-line option.  Documents that have both lls and duplicate text will still exhibit errors, though.

Solution 3: Use a different algorithm for determining duplicate text.  Perhaps the dupe check shouldn't drop characters that start more than halfway between the bounding box of the last character.  In this example, 141.348 is the halfway point for the first character, and 142.8 is beyond that.  It seems unlikely for boldface or drop shadows to be so far beyond the starting point of their host character.
Comment 1 Albert Astals Cid 2010-05-10 08:52:27 UTC
We really don't have an idea, so propose your patch and i'll run it against the test suite and see if it finds any regression, if it does not, i'll commit the patch.
Comment 2 Chris Faulhaber 2010-05-10 17:36:09 UTC
Created attachment 35556 [details] [review]
Potential fix

I've gone with solution #3.  Tested against a few PDFs and it seems to work OK, but I don't have any that have true duplicates.
Comment 3 Albert Astals Cid 2010-05-25 12:14:38 UTC
Can you share a document that gets fixed by this patch?
Comment 4 Chris Faulhaber 2010-05-25 12:28:01 UTC
Created attachment 35851 [details]
Document exhibiting the problem

Here's an excerpt from a user reporting this elsewhere.  Note the ll in fellows and walls.
Comment 5 Albert Astals Cid 2010-05-25 14:49:23 UTC
Unfortunately that code causes a regression in making "False" on top of "Figure 3-8.  Operation of the CMPPS (Imm8=1) Instruction" be extracted as "Fals"

So it fixes some documents but breaks some other so i can not commit the patch unless no regressions happen.
Comment 6 Svein Erik 2011-01-17 15:03:35 UTC
Created attachment 42139 [details] [review]
Potential fix, revised
Comment 7 Svein Erik 2011-01-17 15:07:21 UTC
The proposed patch at least fixed both the two previous test cases and some more. This bug is really annoying for e-book conversion as some books end up with no 'll', only 'l '. Please test it and see if it can be put into the main source tree.
Comment 8 Albert Astals Cid 2011-01-21 12:37:22 UTC
In http://www.openraw.org/files/2006rawsurveyreport.pdf a "53%" gets erroneusly converted to "53"

Can you try to fix this regression please?
Comment 9 Svein Erik 2011-02-08 13:35:56 UTC
Looking through http://www.openraw.org/files/2006rawsurveyreport.pdf it became apparent that the current algorithm is badly flawed. It would be useful to test using real example where faux bold or faux italics is used. 

Comment 10 GitLab Migration User 2018-08-20 22:02:06 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/165.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.