Created attachment 62614 [details] The PDF file which renders OK Hi, the attached PDF document displays OK in ghostscript, but when I run it through poppler's pdftops, ghostscript rejects it with the error at the end of this bug. I first spotted this because I couldn't print it via KDE/Cups. I'm using poppler 0.20.0 and ghostscript 9.05 on arch linux. GPL Ghostscript 9.05 (2012-02-08) Copyright (C) 2010 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. Error: /syntaxerror in -file- Operand stack: Encoding --nostringval-- --nostringval-- 32 Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1894 1 3 %oparray_pop 1893 1 3 %oparray_pop 1877 1 3 %oparray_pop 1771 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push Dictionary stack: --dict:1154/1684(ro)(G)-- --dict:0/20(G)-- --dict:78/200(L)-- --dict:67/75(L)-- --dict:5/10(L)-- Current allocation mode is local Current file position is 11503 GPL Ghostscript 9.05: Unrecoverable error, exit code 1
Created attachment 62615 [details] The broken postscript file pdftops generated
Still non-working with poppler 0.63. File offset 11503 (0x2cef) reads: hexdump -C -s 11493 -n 32 test_orig.ps 00002ce5 64 75 70 20 33 32 20 2f 3c 61 80 ff 38 61 20 70 |dup 32 /<a..8a p| 00002cf5 75 74 0a 64 75 70 20 33 33 20 2f 65 78 63 6c 61 |ut.dup 33 /excla| ---- %!PS-TrueTypeFont- 1 10 dict begin /FontName /RVURUK+Verdana def /FontType 42 def /FontMatrix [1 0 0 1 0 0] def /FontBBox [-102 -423 2963 2049] def /PaintType 0 def /Encoding 256 array dup 0 /.notdef put dup 1 /.notdef put dup 2 /.notdef put ... dup 30 /.notdef put dup 31 /.notdef put dup 32 /<a��8a put dup 33 /exclam put dup 34 /quotedbl put ------ This is comming from the /EncodingDifferences array in the original postscript: <</Type/Encoding/Differences[32/#3Ca#80#FF8a 39/#3Ca#80#FF8a#80 46/#3Ca#80#F...
Reading a bit more, I can state the following: The PDF file is obviously broken. The /Differences array in the /Encoding dict contains invalid/broken entries: - The names are invalid (invalid characters) - Different code points have the same name While the latter is not invalid (e.g. .notdef is commonly used for all codepoints without glyphs), it is obviously broken, as the codepoints refer to *different* glyphs. There are two possible solutions I can see here: 1. The file is broken. Call it a day 2. Follow the standard PDF32008-1:2008, 9.6.6.4 [1]: --- If a (3, 1) “cmap” subtable (Microsoft Unicode) is present: • A character code shall be first mapped to a glyph name using the table described above. • The glyph name shall then be mapped to a Unicode value by consulting the Adobe Glyph List (see the Bibliography). • Finally, the Unicode value shall be mapped to a glyph description according to the (3, 1) subtable. [...] In any of these cases, if the glyph name cannot be mapped as specified, the glyph name shall be looked up in the font program’s “post” table (if one is present) and the associated glyph description shall be used. [...] If a character cannot be mapped in any of the ways described previously, a conforming reader may supply a mapping of its choosing. --- This will fail in the first step; it has no 'post' table, thus the third paragraph applies. As we know names in the AGL are ascii only, we can reject the bad names from the differences array, and in this case end up with the names from WinAnsiEncoding. [1] https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf [2] Adobe Glyph List (AGL): https://github.com/adobe-type-tools/agl-aglfn/blob/master/aglfn.txt
Created attachment 139605 [details] [review] drop invalid names from encoding table
Comment on attachment 139605 [details] [review] drop invalid names from encoding table Should we really still be increasing "code"?
In this case, it does not matter, as all names are invalid. In case only one name were invalid, I would assume a slight error on the implementers side, i.e. one entry in the sequence to be invalid. Although I doubt this really happens. In any case, 'code' is (re)set when the next sequence starts. As this is dealing with broken files, there is no "correct" per se, but I would favor incrementing 'code'.
This patch causes a regression while rendering the file i will attach with pdftoppm, the x in "10 x 10 array of tines" disappears
Created attachment 139685 [details] file that regresses with the proposed patch
Relevant part of the stream: --- (10)Tj /T5 1 Tf 8.964 0 0 8.964 313.14 385.68 Tm (\002)Tj /F6 1 Tf 8.966 0 0 8.966 321.96 385.68 Tm [(10)-338(array)-335(of)-338(tines,) --- Definition of the font: --- 797 0 obj << /Name /T5 /Type /Font /Subtype /Type3 /Resources 796 0 R /FontBBox [ 8 0 34 27 ] /FontMatrix [ .01859 0 0 .01859 0 0 ] /FirstChar 2 /LastChar 2 /Encoding 799 0 R /CharProcs 798 0 R /Widths [ 43 ] >> endobj 798 0 obj << /2 800 0 R >> endobj 799 0 obj << /Type /Encoding /Differences [ 2 /2 ] >> endobj --- According to the AGL, "/2" is not a valid name, i.e. the file is broken. Relaxing the check to allow digits in the first position should solve/workaround the problem. Just comment out the first check from the patch.
It doesn't matter if the file is broken or not, you can't say "your file is broken" when all the other renderers out there do it correctly as far as the user is concerned. Can you please attach the updated patch, it's much easier for me to make sure i'm not messing up something :)
Created attachment 139780 [details] [review] drop invalid names from encoding table
Patch causes a regression in page 4 of https://bugs.kde.org/attachment.cgi?id=11961 in pdftops, the new ps output doesn't have some of the "special" characters rendered correctly (at least when running gs over the converted ps).
The missing glyphs have names of the form "afiiNNNN;" (mind the trailing semicolon). Contrary to the original problematic file, the missing glyphs are not in codepoints where there already is a valid name, but in low positions where the default name is ".notdef". Fontforge deals with these broken encodings by keeping the name from the default encoding if it is valid, and replaces any ".notdef" codepoints by ("uni%0x4", code). Doing the same as fontforge makes the pdftops output for both PDFs wellformed and visually correct.
Created attachment 139797 [details] [review] drop invalid names from encoding table
It still causes regressions in that file when run through pdftops see https://i.imgur.com/9vZcSgM.png the text above "Asst. Prof. Lamun Janhom, 1999" seems to be missing some glyph on top of the leftmost glyph
The fallback to default to the name from the encoding instead of overwriting it always causes problems for the "/space" name. The /Differences contains an invalid name for codepoint 32, so "/space" is used. It also contains a valid "/space" at e.g. 106. The second name overwrites the name-to-glyph mapping. Either we do a second pass removing any conflicts, or never default to the name from the base encoding and always assign a cooked up name.
Created attachment 139821 [details] [review] drop invalid names from encoding table
gs -sDEVICE=ppm output is bit-identical now
Page 1 of https://bugs.freedesktop.org/attachment.cgi?id=14092 also regresses in its ps output.
Hm, another one with completely broken glyph names - ';#2323#2323#2323', with varying number of repetitions of the #2323 pattern for different glpyhs ... For reasons I don't understand yet, it replaces the names in the codepoint -> glyphname table, but not in the glyphname -> glyphnumber table.
The bad names are not only in the /Differences array, but also in /CharSet, so if it requires mangling both ...
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/135.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.