Bug 13338

Summary: Making pdftops preserve PDF page labels
Product: poppler Reporter: Chung-chieh Shan <ccshan>
Component: generalAssignee: Brad Hards <bradh>
Status: RESOLVED FIXED QA Contact:
Severity: enhancement    
Priority: medium CC: bradh
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Patch to poppler-0.6.2 from Debian sources
Test input to pdftops
Desired output from pdftops
pdflatex document that generates test case
Test input to pdftops with Unicode strings
pdflatex document that generates test case with Unicode strings
Patch implementing page label support
Patch implementing page label support

Description Chung-chieh Shan 2007-11-21 14:56:00 UTC
Currently, pdftops discards the PDF page labels in the input document.  I attach a patch that makes it produce the PDF page labels as PostScript page labels.
Comment 1 Chung-chieh Shan 2007-11-21 14:58:17 UTC
Created attachment 12678 [details] [review]
Patch to poppler-0.6.2 from Debian sources
Comment 2 Albert Astals Cid 2007-11-22 11:16:43 UTC
Are you the coder of the patch?

I particulary dislike the 

    default:
      if (c < 0x20 || c > 0x7e) {
    case '(':
    case ')':
        writePSFmt("\\{0:03o}", c);
        j += 4;
        break;
      } else {
       writePSChar(c);
       ++j;
      }

construct, having a case in middle of the if makes my cry, besides, that's not what the former code does, this always outputs 
writePSFmt("\\{0:03o}", c);
when finding a ( while the original one only does on first character.

Also from what i understand

    // Omit parentheses if the string is comprised of one or more
    // alphanumeric characters.

is actually

    // Omit parentheses if the string is comprised of one or more
    // non alphanumeric characters.
Comment 3 Brad Hards 2007-11-29 01:02:08 UTC
As I understand it, the change we are looking for (in pseudocode form) is:
if we have a page label:
        write "%%Page pageLabel {1:[seqPage]}" to the file
else:
        write "%%Page {0:pageNum} {1:[seqPage]}" to the file


There has to be a cleaner patch than this. In any case, this part of the patch is wrong:
   if (mode == psModePS) {
-    writePSFmt("%%Page: {0:d} {1:d}\n", pageNum, seqPage);
+    writePS("%%Page: ");
+    if (pageLabel) {
+      writePSText(pageLabel);
+    } else {
+      writePSFmt("{0:d}", pageNum);
+    }
+    writePSFmt(" {0:d}\n", seqPage);
     writePS("%%BeginPageSetup\n");
   }
 
Since the writePSFmt(" {0:d}\n", seqPage); should be writePSFmt(" {1:d}\n", seqPage);
Comment 4 Brad Hards 2007-11-29 01:26:48 UTC
Also, this bug could use a file that demonstrates the problem.
Comment 5 Chung-chieh Shan 2007-11-29 18:24:52 UTC
I am the coder of the patch.

Attached please find a PDF file, the desired PS output, and the LaTeX code I used to make the PDF file.

I'm sorry that I am unfamiliar with how clean you want the patch and how much behavior of the original code you want to preserve.  Would it be acceptable to preserve the original behavior of PSOutputDev::writePSTextLine simply by not sharing any code between it and PSOutputDev::writePSText?

My limited understanding of GooString suggests that it is not the case that
> writePSFmt(" {0:d}\n", seqPage);
> should be
> writePSFmt(" {1:d}\n", seqPage);
because the number 0 is an index for which argument to format.
Comment 6 Chung-chieh Shan 2007-11-29 18:26:51 UTC
Created attachment 12853 [details]
Test input to pdftops
Comment 7 Chung-chieh Shan 2007-11-29 18:27:14 UTC
Created attachment 12854 [details]
Desired output from pdftops
Comment 8 Chung-chieh Shan 2007-11-29 18:27:44 UTC
Created attachment 12855 [details]
pdflatex document that generates test case
Comment 9 Chung-chieh Shan 2007-11-29 18:30:32 UTC
Let me note that there is a bug in the patch that I just discovered (thanks for asking for a test case :).  The bug is that "c = s->getChar(i) & 0xff;" should be "c = s->getChar(k) & 0xff;".
Comment 10 Brad Hards 2007-11-29 22:23:57 UTC
You may well be correct in the GooString - my apologies.

Given the test example, I see the difference in the PostScript. I'll take another look at the patch with this test case.

Is there an application that actually shows the page labels? 
Comment 11 Brad Hards 2007-11-29 23:27:02 UTC
Ah, gsview does. I think I've found a way to do this without changing anything in the poppler core (except PSOutputDev, of course).
Comment 12 Brad Hards 2007-11-30 16:33:18 UTC
Here is a simpler version:
diff --git a/poppler/PSOutputDev.cc b/poppler/PSOutputDev.cc
index 7252d99..d9cb2a0 100644
--- a/poppler/PSOutputDev.cc
+++ b/poppler/PSOutputDev.cc
@@ -1056,6 +1056,7 @@ void PSOutputDev::init(PSOutputFunc outputFuncA, void *outputStreamA,
   outputFunc = outputFuncA;
   outputStream = outputStreamA;
   fileType = fileTypeA;
+  m_catalog = catalog;
   xref = xrefA;
   level = globalParams->getPSLevel();
   mode = modeA;
@@ -3068,7 +3069,13 @@ void PSOutputDev::startPage(int pageNum, GfxState *state) {


   if (mode == psModePS) {
-    writePSFmt("%%Page: {0:d} {1:d}\n", pageNum, seqPage);
+    GooString pageLabel;
+    const GBool gotLabel = m_catalog->indexToLabel(pageNum -1, &pageLabel);
+    if (gotLabel) {
+      writePSFmt("%%Page: ({0:t}) {1:d}\n", &pageLabel, seqPage);
+    } else {
+      writePSFmt("%%Page: {0:d} {1:d}\n", pageNum, seqPage);
+    }
     writePS("%%BeginPageSetup\n");
   }

diff --git a/poppler/PSOutputDev.h b/poppler/PSOutputDev.h
index a245d85..e6ad814 100644
--- a/poppler/PSOutputDev.h
+++ b/poppler/PSOutputDev.h
@@ -333,6 +333,7 @@ private:
   void *overlayCbkData;

   XRef *xref;                  // the xref table for this PDF file
+  Catalog *m_catalog;           // the catalog for this PDF file

   Ref *fontIDs;                        // list of object IDs of all used fonts
   int fontIDLen;               // number of entries in fontIDs array


As I understand it, the functional differences between the original patch and this version are:
1. Incomplete handling of the label (always uses parentheses, no escaping)
2. I'm not removing the trailing null on a UCS2 encoded name

My reading of the Postscript Language manual (Section 3.2, page 29) and the DSC spec (page 52) leads me to understand that the label is always text, and literal text strings are required to be enclosed in parentheses. I take it there is some reason that you wanted to omit them, but I can't see why. Can you tell me what the intent was? 

Also, can you tell me more about the problem with the UCS2 names and trailing null? Is it caused by some particular package/tool? Test case?

Comment 13 Brad Hards 2007-11-30 16:34:17 UTC
Only indirectly related to this bug, but I can't get your generating file to work on Fedora 7.
I see:
[bradh@conferta test-latex]$ pdflatex -shell-escape pagelabel
This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4)
 \write18 enabled.
entering extended mode
(./pagelabel.tex
LaTeX2e <2003/12/01>
Babel <v3.8d> and hyphenation patterns for american, french, german, ngerman, b
ahasa, basque, bulgarian, catalan, croatian, czech, danish, dutch, esperanto, e
stonian, finnish, greek, icelandic, irish, italian, latin, magyar, norsk, polis
h, portuges, romanian, russian, serbian, slovak, slovene, spanish, swedish, tur
kish, ukrainian, nohyphenation, loaded.
(/usr/share/texmf/tex/latex/base/article.cls
Document Class: article 2004/02/16 v1.4f Standard LaTeX document class
(/usr/share/texmf/tex/latex/base/size10.clo))
(/usr/share/texmf/tex/latex/hyperref/hyperref.sty
(/usr/share/texmf/tex/latex/graphics/keyval.sty)
(/usr/share/texmf/tex/latex/hyperref/pd1enc.def)
(/usr/share/texmf/tex/latex/hyperref/hyperref.cfg)
Implicit mode ON; LaTeX internals redefined
(/usr/share/texmf/tex/latex/url/url.sty))
*hyperref using default driver hpdftex*
(/usr/share/texmf/tex/latex/hyperref/hpdftex.def
(/usr/share/texmf/tex/latex/psnfss/pifont.sty
(/usr/share/texmf/tex/latex/psnfss/upzd.fd)
(/usr/share/texmf/tex/latex/psnfss/upsy.fd))) (./pagelabel.aux)
(/usr/share/texmf/tex/latex/hyperref/nameref.sty) (./pagelabel.out)
(./pagelabel.out)
! Undefined control sequence.
l.4 \thispdfpagelabel
                     {Hello world!}
?

Any suggestions?
Comment 14 Chung-chieh Shan 2007-11-30 20:03:33 UTC
> literal text strings are required to be enclosed in parentheses.

I believe the parentheses are not required for "simple" text such as a
string of one or more digits.

> I take it there is some reason that you wanted to omit them, but I
> can't see why. Can you tell me what the intent was?

I wanted to use pspresent, a program that uses numeric page labels to
help the user navigate in a PostScript presentation.  Unfortunately, the
current version of pspresent (1.3) does not deal with the parentheses
correctly, so it would help me if pdftops omits the parentheses when the
page label is a string of one or more digits.  Of course, this problem
is really pspresent's fault, so it is less important than performing
escaping correctly.

> Also, can you tell me more about the problem with the UCS2 names and
> trailing null? Is it caused by some particular package/tool? Test
> case?

I believe the PDF file I already attached contains a UCS2 page label.
Of course, the PS output should not contain raw UCS2, so pdftops must
convert any UCS2 page label to ASCII before writing it to the PS file.
PageLabelInfo::indexToLabel (inside "if (label->hasUnicodeMarker())")
appends the trailing null, which also should not be written to the PS
file as is.

> Only indirectly related to this bug, but I can't get your generating
> file to work on Fedora 7.

\thispdfpagelabel requires hypertex version 6.71f.
Comment 15 Brad Hards 2007-11-30 20:44:34 UTC
Understood on the parentheses.

This is what I'm seeing from https://bugs.freedesktop.org/attachment.cgi?id=12853 ("Test input to pdftops"): 
22 0 obj <<
/Type /Catalog
/Pages 12 0 R
/Names 21 0 R
/PageMode/UseOutlines/PageLabels << /Nums [0 << /P (Hello world!) >> 1 << /S /D /St 2 >> ] >>
/OpenAction 5 0 R
>> endobj

I don't think that is UCS2. Any chance you can give me a test file? I have v6.74m, but I still can't make hyperref work.
Comment 16 Chung-chieh Shan 2007-11-30 22:45:18 UTC
Created attachment 12892 [details]
Test input to pdftops with Unicode strings
Comment 17 Chung-chieh Shan 2007-11-30 22:51:28 UTC
Created attachment 12893 [details]
pdflatex document that generates test case with Unicode strings

My apologies; I should have passed the [unicode] option to \usepackage{hyperref}.
Comment 18 Brad Hards 2007-12-01 00:13:06 UTC
Created attachment 12894 [details] [review]
Patch implementing page label support
Comment 19 Brad Hards 2007-12-01 00:16:43 UTC
Thanks for the test case.

I've just attached a patch that I'm considering applying. Let me know if you have any additional comments.

I am concerned about one case: what happens if there is a mismatched right parenthesis ")" in the label. Would it be possible to generate yet another test case that has "Hello) World!" as the label?
Comment 20 Chung-chieh Shan 2007-12-02 00:02:28 UTC
> I am concerned about one case: what happens if there is a mismatched
> right parenthesis ")" in the label. Would it be possible to generate
> yet another test case that has "Hello) World!" as the label?

The simplest escaping strategy is to escape both opening and closing
parentheses, even those not at the beginning of the string.

Also, if the page label is empty, then isNumeric would stay true yet
parentheses *are* required.

By the way, perhaps label2->append(GooString::format(...)) could be just
label2->appendfv(...).
Comment 21 Brad Hards 2007-12-02 00:42:23 UTC
Created attachment 12899 [details] [review]
Patch implementing page label support

Incorporates changes from Comment #20, except for the appendfv() change, which I'm not convinced about (and probably isn't important enough to worry about chasing fully).
Comment 22 Brad Hards 2007-12-02 00:42:54 UTC
How does that look?
Comment 23 Chung-chieh Shan 2007-12-03 09:48:25 UTC
It all looks good to me, though let me note that " || (j == 0 && c == '(')" is unnecessary.

Thank you!
Comment 24 Brad Hards 2007-12-04 00:15:35 UTC
Pushed to master, let me know if you still have problems.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.