Downstream report: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453 1) lsb_release -rd Description: Ubuntu Vivid Vervet (development branch) Release: 15.04 2) apt-cache policy evince evince: Installed: 3.14.1-0ubuntu1 Candidate: 3.14.1-0ubuntu1 Version table: *** 3.14.1-0ubuntu1 0 500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages 100 /var/lib/dpkg/status 3) What is expected to happen with the attached document is when one searches for: über it is found: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453/+attachment/102979/+files/example.pdf 4) What happens instead is it does not return any matches. This is reproducible as far back as Ubuntu 7.04 with evince 0.8.1-0ubuntu1, with the older poppler version built for it, so probably not a regression. WORKAROUND: Use the built-in PDF viewer+search with chromium-browser or chrome (doesn't work in Firefox). apt-cache policy chromium-browser chromium-browser: Installed: 39.0.2171.65-0ubuntu0.14.04.1.1064 Candidate: 39.0.2171.65-0ubuntu0.14.04.1.1064 Version table: *** 39.0.2171.65-0ubuntu0.14.04.1.1064 0 500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages 500 http://security.ubuntu.com/ubuntu/ trusty-security/universe amd64 Packages 100 /var/lib/dpkg/status 34.0.1847.116-0ubuntu2 0 500 http://us.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages apt-cache policy google-chrome-stable:i386 google-chrome-stable:i386: Installed: 39.0.2171.95-1 Candidate: 39.0.2171.95-1 Version table: *** 39.0.2171.95-1 0 500 http://dl.google.com/linux/chrome/deb/ stable/main i386 Packages 100 /var/lib/dpkg/status
If you look at the copy and paste from adobe reader and chrome, the word 'Über' is not actually in that document. The diaresis is separate from the 'U'. We could make search looser by stripping out combining characters. Looks like that's what adobe reader and chrome do. Is that the kind of thing that people would want? Being able to find 'uber' by searching for 'über' or 'ubér' or 'ũb̏ȇ̱r̽'? That might make some people upset. We already have bug # 85702 requesting to make search stricter. Though personally I think a looser search makes sense.
Jason Crain, thank you for your response. I appreciate keeping a lean code base to ease maintenance. However, I'm a strong proponent of functionality compatibility expectations, to ease the transition for folks from an alternative operating system, and/or PDF viewer. Hence, in this select case, allowing for feature compatibility expectations with Reader/Chrome makes sense here.
Created attachment 112107 [details] [review] Remove combining characters from normalized text This patch changes normalization so that combining characters are removed from the normalized text. This makes searching through TextPage::findText insensitive to these characters. Also, renames unicodeNormalizeNFKC to unicodeNormalizeSearch to make it clear it's no longer doing a regular NFKC normalization. Renames decomp_compat to decomp_compat_base because it now strips combing characters, leaving only base characters, in addition to compatibility decomposition. Removes UnicodeCompTables.h and some compose functions. They're no longer needed since we're not recomposing the characters. I'm not sure if UnicodeTypeTable.h and UnicodeCompTables.h are considered part of the public interface. They're included in the xpdf headers. Albert, is it OK to change these files in this way?
I'm not sure that removing this functionality is a good idea. Can't we just add an option to findText to enable a looser search and leave it to the front ends to decide if/how to expose this option.
I'm with Adrian, don't think changing this at such low level is a good idea.
I suppose if I add an option to findText, I should also add a flag (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front end's poppler_page_find_text_with_options(). It would be nice if someone could confirm that evince would actually use this option.
(In reply to Jason Crain from comment #6) > I suppose if I add an option to findText, I should also add a flag > (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front > end's poppler_page_find_text_with_options(). It would be nice if someone > could confirm that evince would actually use this option. I don't see a reason why someone might want to search for ü and not find a word containing ü. So, if there are two methods in poppler core, I would change the glib bindings to use the one correctly finding combining characters.
Created attachment 113036 [details] [review] [draft] combine characters I might be able to fix this in a better way by combining letters with nearby diacritic marks so that this document *would* contain ü. It seems to be a nice improvement for some latex documents. Attached patch can give you a rough idea of what I mean. It still needs a lot of work though.
I certainly remember we already did that combination somewhere, either in okular or in poppler, but i can't find it and of course the document does not work, so it may be a fake memory :D I think this may make sense, though then again preserving the old behaviour via a flag (even if not default) in the TextOutputDev may make sense if someone (not sure who though) would be depending on it.
Created attachment 114485 [details] [review] Combine base characters and diacritical marks My attempt to improve this. When you make a diacriticized character with LaTeX, ü for example, it will make a PDF with separate u and ¨ characters and draw them over each other. This patch detects when this happens and converts it to a combining character sequence so that pdftotext and the search function will see a ü and not separate characters. Also refactors some (TextWord::ensureCapacity and TextWord::setInitialBounds) to avoid duplicating code. Limitations: It doesn't handle some of LaTeX's diacritic commands, such as \b for bar under letter or \d for dot under letter, because they are positioned differently and \d would be easy to confuse with a period. They don't seem to be used very often though. If the base character is unusual, such as a math symbol or number, adding a combining character can make the result of pdftotext look a bit odd. I think this is because if the font or rendering engine don't know how to draw the character sequence, it will place the diacritic in a strange position, such as to the right of the letter. In these cases, the output of pdftotext is technically correct, it just looks odd when drawn on screen. When selecting text in evince, you can separately select the character and diacritic. If that's a problem, I think I could fix it by adding clustering support so that a group of glyphs and characters are treated as a single unit. It would make this a much more invasive change, but maybe I should try it anyway. It would be nice to also fix the assumpution that one glyph is always matched 1 character.
I think it looks good as it is. If noone disagrees i'll commit in a week.
Pushed.
Hi Jason, thank you very much for the patch, btw, today I was reading this pdf: http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/lecture_notes/GTK_textview.pdf and noticed that lot of words with double f, like 'buffer', are not found[1] when searching for it, also when copied to gedit it shows the unicode not found glyph inplace of the 'ff' in the word. So, is your patch covering this double f case? If so, please ignore this comment, but for a quick reading over this bug I thought this double f case was not handled as it wasn't accented word or diacritic. Thank you. [1] Some 'buffer' words are found, the ones in a code block, but the ones in the normal text are not. Eg. the 5th paragraph of the fourth page, that starts with "Locations within a text buffer are represented..."
(In reply to Nelson Benitez from comment #13) > Hi Jason, thank you very much for the patch, btw, today I was reading this > pdf: > > http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/ > lecture_notes/GTK_textview.pdf > > and noticed that lot of words with double f, like 'buffer', are not > found[1] when searching for it, also when copied to gedit it shows the > unicode not found glyph inplace of the 'ff' in the word. > > So, is your patch covering this double f case? No, it does not fix that. That file has a different problem and I don't see a way of fixing it. The PDF creator would need to add some extra information before we could guess that character code 27 should be a double f.
*** Bug 66569 has been marked as a duplicate of this bug. ***
(In reply to Jason Crain from comment #14) > (In reply to Nelson Benitez from comment #13) > > Hi Jason, thank you very much for the patch, btw, today I was reading this > > pdf: > > > > http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/ > > lecture_notes/GTK_textview.pdf > > > > and noticed that lot of words with double f, like 'buffer', are not > > found[1] when searching for it, also when copied to gedit it shows the > > unicode not found glyph inplace of the 'ff' in the word. > > > > So, is your patch covering this double f case? > > No, it does not fix that. That file has a different problem and I don't see > a way of fixing it. The PDF creator would need to add some extra > information before we could guess that character code 27 should be a double > f. Thanks Jason for explanation, indeed it was a problem in the PDF creator. Just for completeness I'm posting link describing the problem and solution in pdfTEX: http://tex.stackexchange.com/questions/31113/enable-searching-in-a-pdflatex-generated-document Regards,
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.