Bug 54268 - problem copy/pasting CID? / Identity-H? text
Summary: problem copy/pasting CID? / Identity-H? text
Status: RESOLVED NOTOURBUG
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-30 14:50 UTC by Frederic Peters
Modified: 2012-08-30 15:52 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Frederic Peters 2012-08-30 14:50:29 UTC
I got a whole lot of PDF files where poppler somehow fails (example at <http://people.gnome.org/~fpeters/pdf-identity-h-bug.pdf>).

The first page is ok but then it got a second page attached, with a single word, in a monospace font (looking in document properties in poppler it's "FreeMono, Truetype (CID), encoded as Identity-H"). That word is displayed correctly but converted to something entirely different when copy/pasting from evince, or using the pdftotext or pdftohtml entities.

The displayed word is "tapiraient" while the word extracted as text is "WDSLUDLHQW". In the serie of documents I have, other examples give:

  DQJRLVVHUD -> angoissera
  HQDPRXUHU -> enamourer
  FRQWUHFDUUDLW -> contrecarrait

It looks like the mapping is always the same, and letters are kept in the same order (ex: D->a, E->?, F->c, G->?, H->e...); I checked poppler-data and there is CMap/Identity-H but I couldn't figure if it's used, or relevant.
Comment 1 Frederic Peters 2012-08-30 15:52:22 UTC
As suggested on IRC by Carlos, I had a friend try it with Adobe Reader, and the result is somehow similar.

It gives (don't know if the unicode chars will survive): 􀁗􀁄􀁓􀁌􀁕􀁄􀁌􀁈􀁑􀁗 which is U00100057, U00100044..., if the high 1 is ignored, the output is identical to what I get with poppler.

I suppose there is nothing to do here but to consider the file invalid :/


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.