If you run pdftohtml on the PDF in bug #101770 (https://bugs.freedesktop.org/attachment.cgi?id=132659) It results in duplicated and jumbled characters. Some PDFs draw text multiple times to emulate bold text or drop shadows. The main TextOutputDev goes to a lot of trouble to remove this duplicated text. pdftohtml should do this too.
Created attachment 132719 [details] pdf-example.html - from pdftohtml -s -noframes I've attached the HTML file resulting from running "pdftohtml -s -noframes pdf-example.pdf". I haven't attached the images but this should be enough to get the idea. The HTML has several places where lines are duplicated. Example: <p style="..." class="ft10">1 </p> <p style="..." class="ft11"> </p> <p style="..." class="ft12">UPUTSTVO ZA PACIJENTA</p> <p style="..." class="ft12">UPUTSTVO ZA PACIJENT</p> <p style="..." class="ft12">UPUTSTVO ZA PACIJEN</p> <p style="..." class="ft12">UPUTSTVO ZA PACIJE</p> <p style="..." class="ft12"> </p> <p style="..." class="ft12"> LUNATA</p> <p style="..." class="ft12">LUNAT</p> <p style="..." class="ft12">LUNA</p> <p style="..." class="ft12">LUN</p> I've elided the style attribute above to keep the lines a reasonable length.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/321.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.