Bug 87215 - evince can not find ü in attached PDF
Summary: evince can not find ü in attached PDF
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: lowest trivial
Assignee: poppler-bugs
QA Contact:
URL: https://bugs.launchpad.net/ubuntu/+so...
Whiteboard:
Keywords:
: 66569 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-12-11 05:44 UTC by Christopher M. Penalver
Modified: 2015-09-03 17:50 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Remove combining characters from normalized text (67.43 KB, patch)
2015-01-12 03:47 UTC, Jason Crain
Details | Splinter Review
[draft] combine characters (5.34 KB, patch)
2015-02-02 07:14 UTC, Jason Crain
Details | Splinter Review
Combine base characters and diacritical marks (18.88 KB, patch)
2015-03-20 04:30 UTC, Jason Crain
Details | Splinter Review

Description Christopher M. Penalver 2014-12-11 05:44:25 UTC
Downstream report:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453

1) lsb_release -rd
Description:	Ubuntu Vivid Vervet (development branch)
Release:	15.04

2) apt-cache policy evince
evince:
  Installed: 3.14.1-0ubuntu1
  Candidate: 3.14.1-0ubuntu1
  Version table:
 *** 3.14.1-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages
        100 /var/lib/dpkg/status

3) What is expected to happen with the attached document is when one searches for:
über

it is found:
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453/+attachment/102979/+files/example.pdf

4) What happens instead is it does not return any matches. This is reproducible as far back as Ubuntu 7.04 with evince 0.8.1-0ubuntu1, with the older poppler version built for it, so probably not a regression.

WORKAROUND: Use the built-in PDF viewer+search with chromium-browser or chrome (doesn't work in Firefox).

apt-cache policy chromium-browser
chromium-browser:
  Installed: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Candidate: 39.0.2171.65-0ubuntu0.14.04.1.1064
  Version table:
 *** 39.0.2171.65-0ubuntu0.14.04.1.1064 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ trusty-security/universe amd64 Packages
        100 /var/lib/dpkg/status
     34.0.1847.116-0ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages

apt-cache policy google-chrome-stable:i386
google-chrome-stable:i386:
  Installed: 39.0.2171.95-1
  Candidate: 39.0.2171.95-1
  Version table:
 *** 39.0.2171.95-1 0
        500 http://dl.google.com/linux/chrome/deb/ stable/main i386 Packages
        100 /var/lib/dpkg/status
Comment 1 Jason Crain 2014-12-15 23:10:02 UTC
If you look at the copy and paste from adobe reader and chrome, the word 'Über' is not actually in that document.  The diaresis is separate from the 'U'.  We could make search looser by stripping out combining characters.  Looks like that's what adobe reader and chrome do.  Is that the kind of thing that people would want?  Being able to find 'uber' by searching for 'über' or 'ubér' or 'ũb̏ȇ̱r̽'?

That might make some people upset.  We already have bug # 85702 requesting to make search stricter.  Though personally I think a looser search makes sense.
Comment 2 Christopher M. Penalver 2014-12-16 02:14:04 UTC
Jason Crain, thank you for your response.

I appreciate keeping a lean code base to ease maintenance. However, I'm a strong proponent of functionality compatibility expectations, to ease the transition for folks from an alternative operating system, and/or PDF viewer.

Hence, in this select case, allowing for feature compatibility expectations with Reader/Chrome makes sense here.
Comment 3 Jason Crain 2015-01-12 03:47:02 UTC
Created attachment 112107 [details] [review]
Remove combining characters from normalized text

This patch changes normalization so that combining characters are removed from the normalized text.  This makes searching through TextPage::findText insensitive to these characters.

Also, renames unicodeNormalizeNFKC to unicodeNormalizeSearch to make it clear it's no longer doing a regular NFKC normalization.  

Renames decomp_compat to decomp_compat_base because it now strips combing characters, leaving only base characters, in addition to compatibility decomposition.

Removes UnicodeCompTables.h and some compose functions.  They're no longer needed since we're not recomposing the characters.

I'm not sure if UnicodeTypeTable.h and UnicodeCompTables.h are considered part of the public interface.  They're included in the xpdf headers.  Albert, is it OK to change these files in this way?
Comment 4 Adrian Johnson 2015-01-12 11:25:23 UTC
I'm not sure that removing this functionality is a good idea. Can't we just add an option to findText to enable a looser search and leave it to the front ends to decide if/how to expose this option.
Comment 5 Albert Astals Cid 2015-01-12 19:04:48 UTC
I'm with Adrian, don't think changing this at such low level is a good idea.
Comment 6 Jason Crain 2015-01-21 06:39:53 UTC
I suppose if I add an option to findText, I should also add a flag (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front end's poppler_page_find_text_with_options().  It would be nice if someone could confirm that evince would actually use this option.
Comment 7 Carlos Garcia Campos 2015-01-21 16:38:41 UTC
(In reply to Jason Crain from comment #6)
> I suppose if I add an option to findText, I should also add a flag
> (POPPLER_FIND_IGNORE_COMBINING?) to PopplerFindFlags, for the glib front
> end's poppler_page_find_text_with_options().  It would be nice if someone
> could confirm that evince would actually use this option.

I don't see a reason why someone might want to search for ü and not find a word containing ü. So, if there are two methods in poppler core, I would change the glib bindings to use the one correctly finding combining characters.
Comment 8 Jason Crain 2015-02-02 07:14:14 UTC
Created attachment 113036 [details] [review]
[draft] combine characters

I might be able to fix this in a better way by combining letters with nearby diacritic marks so that this document *would* contain ü.  It seems to be a nice improvement for some latex documents.  Attached patch can give you a rough idea of what I mean.  It still needs a lot of work though.
Comment 9 Albert Astals Cid 2015-02-02 23:36:56 UTC
I certainly remember we already did that combination somewhere, either in okular or in poppler, but i can't find it and of course the document does not work, so it may be a fake memory :D

I think this may make sense, though then again preserving the old behaviour via a flag (even if not default) in the TextOutputDev may make sense if someone (not sure who though) would be depending on it.
Comment 10 Jason Crain 2015-03-20 04:30:08 UTC
Created attachment 114485 [details] [review]
Combine base characters and diacritical marks

My attempt to improve this.

When you make a diacriticized character with LaTeX, ü for example, it will make a PDF with separate u and ¨ characters and draw them over each other.  This patch detects when this happens and converts it to a combining character sequence so that pdftotext and the search function will see a ü and not separate characters.  Also refactors some (TextWord::ensureCapacity and TextWord::setInitialBounds) to avoid duplicating code.

Limitations:

It doesn't handle some of LaTeX's diacritic commands, such as \b for bar under letter or \d for dot under letter, because they are positioned differently and \d would be easy to confuse with a period.  They don't seem to be used very often though.

If the base character is unusual, such as a math symbol or number, adding a combining character can make the result of pdftotext look a bit odd.  I think this is because if the font or rendering engine don't know how to draw the character sequence, it will place the diacritic in a strange position, such as to the right of the letter.  In these cases, the output of pdftotext is technically correct, it just looks odd when drawn on screen.

When selecting text in evince, you can separately select the character and diacritic.  If that's a problem, I think I could fix it by adding clustering support so that a group of glyphs and characters are treated as a single unit.  It would make this a much more invasive change, but maybe I should try it anyway.  It would be nice to also fix the assumpution that one glyph is always matched 1 character.
Comment 11 Albert Astals Cid 2015-03-27 23:06:53 UTC
I think it looks good as it is.

If noone disagrees i'll commit in a week.
Comment 12 Albert Astals Cid 2015-04-04 16:41:27 UTC
Pushed.
Comment 13 Nelson Benitez 2015-04-11 18:08:18 UTC
Hi Jason, thank you very much for the patch, btw, today I was reading this pdf:

http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/lecture_notes/GTK_textview.pdf

and noticed that lot of words with double f,  like 'buffer', are not found[1] when searching for it, also when copied to gedit it shows the unicode not found glyph inplace of the 'ff' in the word.

So, is your patch covering this double f case? 

If so, please ignore this comment, but for a quick reading over this bug I thought this double f case was not handled as it wasn't accented word or diacritic.

Thank you.


[1] Some 'buffer' words are found, the ones in a code block, but the ones in the normal text are not. Eg. the 5th paragraph of the fourth page, that starts with "Locations within a text buffer are represented..."
Comment 14 Jason Crain 2015-04-13 00:24:45 UTC
(In reply to Nelson Benitez from comment #13)
> Hi Jason, thank you very much for the patch, btw, today I was reading this
> pdf:
> 
> http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/
> lecture_notes/GTK_textview.pdf
> 
> and noticed that lot of words with double f,  like 'buffer', are not
> found[1] when searching for it, also when copied to gedit it shows the
> unicode not found glyph inplace of the 'ff' in the word.
> 
> So, is your patch covering this double f case? 

No, it does not fix that.  That file has a different problem and I don't see a way of fixing it.  The PDF creator would need to add some extra information before we could guess that character code 27 should be a double f.
Comment 15 Jason Crain 2015-04-17 04:32:37 UTC
*** Bug 66569 has been marked as a duplicate of this bug. ***
Comment 16 Nelson Benitez 2015-09-03 17:50:09 UTC
(In reply to Jason Crain from comment #14)
> (In reply to Nelson Benitez from comment #13)
> > Hi Jason, thank you very much for the patch, btw, today I was reading this
> > pdf:
> > 
> > http://www.compsci.hunter.cuny.edu/~sweiss/course_materials/csci493.70/
> > lecture_notes/GTK_textview.pdf
> > 
> > and noticed that lot of words with double f,  like 'buffer', are not
> > found[1] when searching for it, also when copied to gedit it shows the
> > unicode not found glyph inplace of the 'ff' in the word.
> > 
> > So, is your patch covering this double f case? 
> 
> No, it does not fix that.  That file has a different problem and I don't see
> a way of fixing it.  The PDF creator would need to add some extra
> information before we could guess that character code 27 should be a double
> f.

Thanks Jason for explanation, indeed it was a problem in the PDF creator. Just for completeness I'm posting link describing the problem and solution in pdfTEX:

http://tex.stackexchange.com/questions/31113/enable-searching-in-a-pdflatex-generated-document 

Regards,


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.