Created attachment 68923 [details] pdf file inhibiting this behavior PDFTOHTML converts text positions on certain PDF documents incorrect. Attached is a document in which this happens. The following logic explains this further: The size of an image of the first page is 1024x1408. The text "Brief article" which can be seen highlighted should be positioned 19% from the top as seen here: http://imageshack.us/a/img526/6343/textshiftedpdf1.png Poppler outputs this text with the following data when using pdftohtml -xml <text top="409" left="447" width="80" height="15" font="0">Brief article</text> The dimensions of this page according to poppler taken from the same xml file: <page number="1" position="absolute" top="0" left="0" height="1488" width="1063"> This would give us that the text should be according to poppler be positioned: 409/1488=0.27=27% which is clearly wrong. No other warning messages or errors were noted when converting this document
I'm not an expert on text positions, but I'm wondering why a lot of tools based on poppler, i.e. okular, are able to highlight the text on searching if You're true and the positions are wrong...
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/342.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.