50750 – pdftops generates broken (according to ghostscript) postscript

Bug 50750 - pdftops generates broken (according to ghostscript) postscript

Summary: pdftops generates broken (according to ghostscript) postscript

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-06-05 16:04 UTC by Andrew de Quincey
Modified:	2018-08-20 21:57 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
The PDF file which renders OK (115.68 KB, application/x-gzip) 2012-06-05 16:04 UTC, Andrew de Quincey	Details
The broken postscript file pdftops generated (206.83 KB, application/x-gzip) 2012-06-05 16:05 UTC, Andrew de Quincey	Details
drop invalid names from encoding table (1.84 KB, patch) 2018-05-17 01:50 UTC, Stefan Brüns	Details \| Splinter Review
file that regresses with the proposed patch (291.68 KB, application/pdf) 2018-05-22 22:52 UTC, Albert Astals Cid	Details
drop invalid names from encoding table (1.81 KB, patch) 2018-05-26 00:20 UTC, Stefan Brüns	Details \| Splinter Review
drop invalid names from encoding table (2.42 KB, patch) 2018-05-27 13:52 UTC, Stefan Brüns	Details \| Splinter Review
drop invalid names from encoding table (2.36 KB, patch) 2018-05-29 01:34 UTC, Stefan Brüns	Details \| Splinter Review
Show Obsolete (3) View All

Description Andrew de Quincey 2012-06-05 16:04:34 UTC

Created attachment 62614 [details]
The PDF file which renders OK

Hi, the attached PDF document displays OK in ghostscript, but when I
run it through poppler's pdftops, ghostscript rejects it with the
error at the end of this bug. I first spotted this because I couldn't
print it via KDE/Cups. 

I'm using poppler 0.20.0 and ghostscript 9.05 on arch linux.



GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Error: /syntaxerror in -file-
Operand stack:
  Encoding   --nostringval--   --nostringval--   32
Execution stack:
  %interp_exit   .runexec2   --nostringval--   --nostringval--
--nostringval--   2   %stopped_push   --nostringval--
--nostringval--   --nostringval--   false   1   %stopped_push   1894
1   3   %oparray_pop   1893   1   3   %oparray_pop   1877   1   3
%oparray_pop   1771   1   3   %oparray_pop   --nostringval--
%errorexec_pop   .runexec2   --nostringval--   --nostringval--
--nostringval--   2   %stopped_push
Dictionary stack:
  --dict:1154/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
--dict:67/75(L)--   --dict:5/10(L)--
Current allocation mode is local
Current file position is 11503
GPL Ghostscript 9.05: Unrecoverable error, exit code 1

Comment 1 Andrew de Quincey 2012-06-05 16:05:03 UTC

Created attachment 62615 [details]
The broken postscript file pdftops generated

Comment 2 Stefan Brüns 2018-05-15 21:36:01 UTC

Still non-working with poppler 0.63.

File offset 11503 (0x2cef) reads:
hexdump -C  -s 11493 -n 32 test_orig.ps 
00002ce5  64 75 70 20 33 32 20 2f  3c 61 80 ff 38 61 20 70  |dup 32 /<a..8a p|
00002cf5  75 74 0a 64 75 70 20 33  33 20 2f 65 78 63 6c 61  |ut.dup 33 /excla|
----
%!PS-TrueTypeFont- 1
10 dict begin
/FontName /RVURUK+Verdana def
/FontType 42 def
/FontMatrix [1 0 0 1 0 0] def
/FontBBox [-102 -423 2963 2049] def
/PaintType 0 def
/Encoding 256 array
dup 0 /.notdef put
dup 1 /.notdef put
dup 2 /.notdef put
...
dup 30 /.notdef put
dup 31 /.notdef put
dup 32 /<a��8a put
dup 33 /exclam put
dup 34 /quotedbl put
------

This is comming from the /EncodingDifferences array in the original postscript:
<</Type/Encoding/Differences[32/#3Ca#80#FF8a 39/#3Ca#80#FF8a#80 46/#3Ca#80#F...

Comment 3 Stefan Brüns 2018-05-16 22:28:43 UTC

Reading a bit more, I can state the following:

The PDF file is obviously broken. The /Differences array in the /Encoding dict contains invalid/broken entries:
- The names are invalid (invalid characters)
- Different code points have the same name

While the latter is not invalid (e.g. .notdef is commonly used for all codepoints without glyphs), it is obviously broken, as the codepoints refer to *different* glyphs.

There are two possible solutions I can see here:

1. The file is broken. Call it a day

2. Follow the standard PDF32008-1:2008, 9.6.6.4 [1]:
---
 If a (3, 1) “cmap” subtable (Microsoft Unicode) is present:
 • A character code shall be first mapped to a glyph name using the table described above. 
 • The  glyph  name  shall  then  be  mapped  to  a  Unicode  value  by consulting  the  Adobe  Glyph  List (see  the Bibliography). 
 • Finally, the Unicode value shall be mapped to a glyph description according to the (3, 1) subtable.

[...]
In any of these cases, if the glyph name cannot be mapped as specified, the glyph name shall be looked up in the font program’s “post” table (if one is present) and the associated glyph description shall be used. 
[...]
If a character cannot be mapped in any of the ways described previously, a conforming reader may supply a mapping of its choosing. 
---
This will fail in the first step; it has no 'post' table, thus the third paragraph applies.

As we know names in the AGL are ascii only, we can reject the bad names from the differences array, and in this case end up with the names from WinAnsiEncoding. 


[1] https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
[2] Adobe Glyph List (AGL): https://github.com/adobe-type-tools/agl-aglfn/blob/master/aglfn.txt

Comment 4 Stefan Brüns 2018-05-17 01:50:18 UTC

Created attachment 139605 [details] [review]
drop invalid names from encoding table

Comment 5 Albert Astals Cid 2018-05-19 14:45:43 UTC

Comment on attachment 139605 [details] [review]
drop invalid names from encoding table

Should we really still be increasing "code"?

Comment 6 Stefan Brüns 2018-05-19 15:51:11 UTC

In this case, it does not matter, as all names are invalid.

In case only one name were invalid, I would assume a slight error on the implementers side, i.e. one entry in the sequence to be invalid. Although I doubt this really happens.

In any case, 'code' is (re)set when the next sequence starts.

As this is dealing with broken files, there is no "correct" per se, but I would favor incrementing 'code'.

Comment 7 Albert Astals Cid 2018-05-22 22:51:43 UTC

This patch causes a regression while rendering the file i will attach with pdftoppm, the x in "10 x 10 array of tines" disappears

Comment 8 Albert Astals Cid 2018-05-22 22:52:38 UTC

Created attachment 139685 [details]
file that regresses with the proposed patch

Comment 9 Stefan Brüns 2018-05-24 15:15:36 UTC

Relevant part of the stream:
---
(10)Tj
/T5 1 Tf
8.964 0 0 8.964 313.14 385.68 Tm
(\002)Tj
/F6 1 Tf
8.966 0 0 8.966 321.96 385.68 Tm
[(10)-338(array)-335(of)-338(tines,)
---

Definition of the font:
---
797 0 obj
<<
  /Name /T5
  /Type /Font
  /Subtype /Type3
  /Resources 796 0 R
  /FontBBox [ 8 0 34 27 ]
  /FontMatrix [ .01859 0 0 .01859 0 0 ]
  /FirstChar 2
  /LastChar 2
  /Encoding 799 0 R
  /CharProcs 798 0 R
  /Widths [ 43 ]
>>
endobj

798 0 obj
<<
  /2 800 0 R
>>
endobj

799 0 obj
<<
  /Type /Encoding
  /Differences [ 2 /2 ]
>>
endobj
---

According to the AGL, "/2" is not a valid name, i.e. the file is broken.

Relaxing the check to allow digits in the first position should solve/workaround the problem. Just comment out the first check from the patch.

Comment 10 Albert Astals Cid 2018-05-24 17:29:05 UTC

It doesn't matter if the file is broken or not, you can't say "your file is broken" when all the other renderers out there do it correctly as far as the user is concerned.

Can you please attach the updated patch, it's much easier for me to make sure i'm not messing up something :)

Comment 11 Stefan Brüns 2018-05-26 00:20:03 UTC

Created attachment 139780 [details] [review]
drop invalid names from encoding table

Comment 12 Albert Astals Cid 2018-05-27 08:56:57 UTC

Patch causes a regression in page 4 of https://bugs.kde.org/attachment.cgi?id=11961 in pdftops, the new ps output doesn't have some of the "special" characters rendered correctly (at least when running gs over the converted ps).

Comment 13 Stefan Brüns 2018-05-27 13:49:49 UTC

The missing glyphs have names of the form "afiiNNNN;" (mind the trailing semicolon).

Contrary to the original problematic file, the missing glyphs are not in codepoints where there already is a valid name, but in low positions where the default name is ".notdef".

Fontforge deals with these broken encodings by keeping the name from the default encoding if it is valid, and replaces any ".notdef" codepoints by ("uni%0x4", code).

Doing the same as fontforge makes the pdftops output for both PDFs wellformed and visually correct.

Comment 14 Stefan Brüns 2018-05-27 13:52:51 UTC

Created attachment 139797 [details] [review]
drop invalid names from encoding table

Comment 15 Albert Astals Cid 2018-05-28 07:01:34 UTC

It still causes regressions in that file when run through pdftops see https://i.imgur.com/9vZcSgM.png 

the text above "Asst. Prof. Lamun Janhom, 1999" seems to be missing some glyph on top of the leftmost glyph

Comment 16 Stefan Brüns 2018-05-28 12:57:41 UTC

The fallback to default to the name from the encoding instead of overwriting it always causes problems for the "/space" name.

The /Differences contains an invalid name for codepoint 32, so "/space" is used. It also contains a valid "/space" at e.g. 106. The second name overwrites the name-to-glyph mapping.

Either we do a second pass removing any conflicts, or never default to the name from the base encoding and always assign a cooked up name.

Comment 17 Stefan Brüns 2018-05-29 01:34:06 UTC

Created attachment 139821 [details] [review]
drop invalid names from encoding table

Comment 18 Stefan Brüns 2018-05-29 01:35:11 UTC

gs -sDEVICE=ppm output is bit-identical now

Comment 19 Albert Astals Cid 2018-05-30 06:59:39 UTC

Page 1 of https://bugs.freedesktop.org/attachment.cgi?id=14092 also regresses in its ps output.

Comment 20 Stefan Brüns 2018-06-02 18:12:25 UTC

Hm, another one with completely broken glyph names - ';#2323#2323#2323', with varying number of repetitions of the #2323 pattern for different glpyhs ...

For reasons I don't understand yet, it replaces the names in the codepoint -> glyphname table, but not in the glyphname -> glyphnumber table.

Comment 21 Stefan Brüns 2018-06-02 18:26:26 UTC

The bad names are not only in the /Differences array, but also in /CharSet, so if it requires mangling both ...

Comment 22 GitLab Migration User 2018-08-20 21:57:58 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/135.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.