Bug 54268

Summary: problem copy/pasting CID? / Identity-H? text
Product: poppler Reporter: Frederic Peters <fpeters>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED NOTOURBUG QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Frederic Peters 2012-08-30 14:50:29 UTC
I got a whole lot of PDF files where poppler somehow fails (example at <http://people.gnome.org/~fpeters/pdf-identity-h-bug.pdf>).

The first page is ok but then it got a second page attached, with a single word, in a monospace font (looking in document properties in poppler it's "FreeMono, Truetype (CID), encoded as Identity-H"). That word is displayed correctly but converted to something entirely different when copy/pasting from evince, or using the pdftotext or pdftohtml entities.

The displayed word is "tapiraient" while the word extracted as text is "WDSLUDLHQW". In the serie of documents I have, other examples give:

  DQJRLVVHUD -> angoissera
  HQDPRXUHU -> enamourer
  FRQWUHFDUUDLW -> contrecarrait

It looks like the mapping is always the same, and letters are kept in the same order (ex: D->a, E->?, F->c, G->?, H->e...); I checked poppler-data and there is CMap/Identity-H but I couldn't figure if it's used, or relevant.
Comment 1 Frederic Peters 2012-08-30 15:52:22 UTC
As suggested on IRC by Carlos, I had a friend try it with Adobe Reader, and the result is somehow similar.

It gives (don't know if the unicode chars will survive): 􀁗􀁄􀁓􀁌􀁕􀁄􀁌􀁈􀁑􀁗 which is U00100057, U00100044..., if the high 1 is ignored, the output is identical to what I get with poppler.

I suppose there is nothing to do here but to consider the file invalid :/

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.