Created attachment 27184 [details] Sample PDF file Binary package hint: evince Use Evince to open a PDF-file including CJK characters, and select some of these characters with cursor, and most of them disappears, and some are rendered as different characters or ASCII characters. Using Ubuntu 8.10 with Evince 2.24.1. ProblemType: Bug Architecture: i386 DistroRelease: Ubuntu 8.10 ExecutablePath: /usr/bin/evince Package: evince 2.24.1-0ubuntu1 ProcEnviron: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games LANG=en_US.UTF-8 SHELL=/bin/bash SourcePackage: evince Uname: Linux 2.6.27-8-generic i686
Created attachment 35064 [details] [review] proposed patch for selection Hi, these are two problems: The first one is that poppler doesn't create new word (TextWord in selection code) when there are different fonts between two consecutive chars even if it does this for different size of fonts. The second one is that poppler checks 'y' coordinate when looking for begining and end of selection in line (TextLine). It shouldn't do that. It is searching for a word in line, so it should perform one dimensional search in 'x' only. Attached patch fixes theses two problems. Regards Marek P.S.: the attached patch fixes bugs #6923, #9608, #9672, #13441 and #16305 for me
Created attachment 35066 [details] another example file This is another example of the issue. The second "Hello" shows the problem when there are 2 different fonts in one word. The third Hello shows the problem of comparing of 'y' coordinates. The horizontal part of bounding box of those "ll" is smaller than bounding box of neighbouring characters, so it doesn't include them into string for painting when an user doesn't "hit" them during selection. Btw, I don't think that this is specific to Cairo backend.
Your patch breaks lots of PDF in pdftotext where you have a word like iPhone where the "i" is written in italics, your patch will output "i Phone" instead of "iPhone" because they are using different fonts.
(In reply to comment #3) > Your patch breaks lots of PDF in pdftotext where you have a word like iPhone > where the "i" is written in italics, your patch will output "i Phone" instead > of "iPhone" because they are using different fonts. I thought so too but I didn't observe it. Could you point me to a pdf against which I can review my modifications of the patch? Marek
Created attachment 35196 [details] A pdf suffering regression See how old pdftotext says "This demo example uses a channel model described in [4]" where the patched version with your code says "This demo example uses a channel model described in [4 ]" and if you visually read the file the first version is the correct one.
I'm closing this since it seems to be already fixed (by http://cgit.freedesktop.org/poppler/poppler/commit/?id=f3a1b765bd6a58d327a80feedbe30e1c0792076e and http://cgit.freedesktop.org/poppler/poppler/commit/?id=5056e33e01ce0f7db1a5401b7b38d30e84eedf69).
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.