Problem: In some PDF documents two lls will overlap slightly. pdftohtml will drop latter l. E.g., called because cal ed, all becomes al , and eventually becomes eventual y.
Reason: In HtmlOutputDev.cc, class HtmlPage, method coalesce, there's a section of code to discard duplicate text for "fake boldface, drop shadows." The lls are triggering the duplicate code and are thus removed from the output.
The debug output shows:
x=139.68000..143.016000 y=626.076000..641.844000 size=15 'l'
x=142.80000..146.136000 y=626.076000..641.844000 size=15 'l'
Due to my inexperience with the project I can't say what the best solution will be. Here are a few options I've considered. If you'd like to suggest a preferred method for solving this problem I will implement and submit a patch, however I have no test documents that involve actual duplicate text.
Solution 1: Decrease the fudge factor from 0.2 to 0.1. This may not be reliable and could cause the duplicates which this code was originally meant to discard to resurface. It will, however, let the lls through in my test documents.
Solution 2: Make the duplicate check a command-line option. Documents that have both lls and duplicate text will still exhibit errors, though.
Solution 3: Use a different algorithm for determining duplicate text. Perhaps the dupe check shouldn't drop characters that start more than halfway between the bounding box of the last character. In this example, 141.348 is the halfway point for the first character, and 142.8 is beyond that. It seems unlikely for boldface or drop shadows to be so far beyond the starting point of their host character.
We really don't have an idea, so propose your patch and i'll run it against the test suite and see if it finds any regression, if it does not, i'll commit the patch.
Created attachment 35556 [details] [review]
I've gone with solution #3. Tested against a few PDFs and it seems to work OK, but I don't have any that have true duplicates.
Can you share a document that gets fixed by this patch?
Created attachment 35851 [details]
Document exhibiting the problem
Here's an excerpt from a user reporting this elsewhere. Note the ll in fellows and walls.
Unfortunately that code causes a regression in ftp://184.108.40.206/oldlinux/Linux.old/study/Ref-docs/manual%20Intel386/24319101.pdf making "False" on top of "Figure 3-8. Operation of the CMPPS (Imm8=1) Instruction" be extracted as "Fals"
So it fixes some documents but breaks some other so i can not commit the patch unless no regressions happen.
Created attachment 42139 [details] [review]
Potential fix, revised
The proposed patch at least fixed both the two previous test cases and some more. This bug is really annoying for e-book conversion as some books end up with no 'll', only 'l '. Please test it and see if it can be put into the main source tree.
In http://www.openraw.org/files/2006rawsurveyreport.pdf a "53%" gets erroneusly converted to "53"
Can you try to fix this regression please?
Looking through http://www.openraw.org/files/2006rawsurveyreport.pdf it became apparent that the current algorithm is badly flawed. It would be useful to test using real example where faux bold or faux italics is used.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/165.