Bug 54268

Summary:	problem copy/pasting CID? / Identity-H? text
Product:	poppler	Reporter:	Frederic Peters <fpeters>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED NOTOURBUG	QA Contact:
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:

Description Frederic Peters 2012-08-30 14:50:29 UTC

I got a whole lot of PDF files where poppler somehow fails (example at <http://people.gnome.org/~fpeters/pdf-identity-h-bug.pdf>).

The first page is ok but then it got a second page attached, with a single word, in a monospace font (looking in document properties in poppler it's "FreeMono, Truetype (CID), encoded as Identity-H"). That word is displayed correctly but converted to something entirely different when copy/pasting from evince, or using the pdftotext or pdftohtml entities.

The displayed word is "tapiraient" while the word extracted as text is "WDSLUDLHQW". In the serie of documents I have, other examples give:

  DQJRLVVHUD -> angoissera
  HQDPRXUHU -> enamourer
  FRQWUHFDUUDLW -> contrecarrait

It looks like the mapping is always the same, and letters are kept in the same order (ex: D->a, E->?, F->c, G->?, H->e...); I checked poppler-data and there is CMap/Identity-H but I couldn't figure if it's used, or relevant.

Comment 1 Frederic Peters 2012-08-30 15:52:22 UTC

As suggested on IRC by Carlos, I had a friend try it with Adobe Reader, and the result is somehow similar.

It gives (don't know if the unicode chars will survive): 􀁗􀁄􀁓􀁌􀁕􀁄􀁌􀁈􀁑􀁗 which is U00100057, U00100044..., if the high 1 is ignored, the output is identical to what I get with poppler.

I suppose there is nothing to do here but to consider the file invalid :/

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.