I got a whole lot of PDF files where poppler somehow fails (example at <http://people.gnome.org/~fpeters/pdf-identity-h-bug.pdf>). The first page is ok but then it got a second page attached, with a single word, in a monospace font (looking in document properties in poppler it's "FreeMono, Truetype (CID), encoded as Identity-H"). That word is displayed correctly but converted to something entirely different when copy/pasting from evince, or using the pdftotext or pdftohtml entities. The displayed word is "tapiraient" while the word extracted as text is "WDSLUDLHQW". In the serie of documents I have, other examples give: DQJRLVVHUD -> angoissera HQDPRXUHU -> enamourer FRQWUHFDUUDLW -> contrecarrait It looks like the mapping is always the same, and letters are kept in the same order (ex: D->a, E->?, F->c, G->?, H->e...); I checked poppler-data and there is CMap/Identity-H but I couldn't figure if it's used, or relevant.
As suggested on IRC by Carlos, I had a friend try it with Adobe Reader, and the result is somehow similar. It gives (don't know if the unicode chars will survive): which is U00100057, U00100044..., if the high 1 is ignored, the output is identical to what I get with poppler. I suppose there is nothing to do here but to consider the file invalid :/
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.