Summary: | Making pdftops preserve PDF page labels | ||
---|---|---|---|
Product: | poppler | Reporter: | Chung-chieh Shan <ccshan> |
Component: | general | Assignee: | Brad Hards <bradh> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | enhancement | ||
Priority: | medium | CC: | bradh |
Version: | unspecified | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Patch to poppler-0.6.2 from Debian sources
Test input to pdftops Desired output from pdftops pdflatex document that generates test case Test input to pdftops with Unicode strings pdflatex document that generates test case with Unicode strings Patch implementing page label support Patch implementing page label support |
Description
Chung-chieh Shan
2007-11-21 14:56:00 UTC
Created attachment 12678 [details] [review] Patch to poppler-0.6.2 from Debian sources Are you the coder of the patch? I particulary dislike the default: if (c < 0x20 || c > 0x7e) { case '(': case ')': writePSFmt("\\{0:03o}", c); j += 4; break; } else { writePSChar(c); ++j; } construct, having a case in middle of the if makes my cry, besides, that's not what the former code does, this always outputs writePSFmt("\\{0:03o}", c); when finding a ( while the original one only does on first character. Also from what i understand // Omit parentheses if the string is comprised of one or more // alphanumeric characters. is actually // Omit parentheses if the string is comprised of one or more // non alphanumeric characters. As I understand it, the change we are looking for (in pseudocode form) is: if we have a page label: write "%%Page pageLabel {1:[seqPage]}" to the file else: write "%%Page {0:pageNum} {1:[seqPage]}" to the file There has to be a cleaner patch than this. In any case, this part of the patch is wrong: if (mode == psModePS) { - writePSFmt("%%Page: {0:d} {1:d}\n", pageNum, seqPage); + writePS("%%Page: "); + if (pageLabel) { + writePSText(pageLabel); + } else { + writePSFmt("{0:d}", pageNum); + } + writePSFmt(" {0:d}\n", seqPage); writePS("%%BeginPageSetup\n"); } Since the writePSFmt(" {0:d}\n", seqPage); should be writePSFmt(" {1:d}\n", seqPage); Also, this bug could use a file that demonstrates the problem. I am the coder of the patch.
Attached please find a PDF file, the desired PS output, and the LaTeX code I used to make the PDF file.
I'm sorry that I am unfamiliar with how clean you want the patch and how much behavior of the original code you want to preserve. Would it be acceptable to preserve the original behavior of PSOutputDev::writePSTextLine simply by not sharing any code between it and PSOutputDev::writePSText?
My limited understanding of GooString suggests that it is not the case that
> writePSFmt(" {0:d}\n", seqPage);
> should be
> writePSFmt(" {1:d}\n", seqPage);
because the number 0 is an index for which argument to format.
Created attachment 12853 [details]
Test input to pdftops
Created attachment 12854 [details]
Desired output from pdftops
Created attachment 12855 [details]
pdflatex document that generates test case
Let me note that there is a bug in the patch that I just discovered (thanks for asking for a test case :). The bug is that "c = s->getChar(i) & 0xff;" should be "c = s->getChar(k) & 0xff;". You may well be correct in the GooString - my apologies. Given the test example, I see the difference in the PostScript. I'll take another look at the patch with this test case. Is there an application that actually shows the page labels? Ah, gsview does. I think I've found a way to do this without changing anything in the poppler core (except PSOutputDev, of course). Here is a simpler version: diff --git a/poppler/PSOutputDev.cc b/poppler/PSOutputDev.cc index 7252d99..d9cb2a0 100644 --- a/poppler/PSOutputDev.cc +++ b/poppler/PSOutputDev.cc @@ -1056,6 +1056,7 @@ void PSOutputDev::init(PSOutputFunc outputFuncA, void *outputStreamA, outputFunc = outputFuncA; outputStream = outputStreamA; fileType = fileTypeA; + m_catalog = catalog; xref = xrefA; level = globalParams->getPSLevel(); mode = modeA; @@ -3068,7 +3069,13 @@ void PSOutputDev::startPage(int pageNum, GfxState *state) { if (mode == psModePS) { - writePSFmt("%%Page: {0:d} {1:d}\n", pageNum, seqPage); + GooString pageLabel; + const GBool gotLabel = m_catalog->indexToLabel(pageNum -1, &pageLabel); + if (gotLabel) { + writePSFmt("%%Page: ({0:t}) {1:d}\n", &pageLabel, seqPage); + } else { + writePSFmt("%%Page: {0:d} {1:d}\n", pageNum, seqPage); + } writePS("%%BeginPageSetup\n"); } diff --git a/poppler/PSOutputDev.h b/poppler/PSOutputDev.h index a245d85..e6ad814 100644 --- a/poppler/PSOutputDev.h +++ b/poppler/PSOutputDev.h @@ -333,6 +333,7 @@ private: void *overlayCbkData; XRef *xref; // the xref table for this PDF file + Catalog *m_catalog; // the catalog for this PDF file Ref *fontIDs; // list of object IDs of all used fonts int fontIDLen; // number of entries in fontIDs array As I understand it, the functional differences between the original patch and this version are: 1. Incomplete handling of the label (always uses parentheses, no escaping) 2. I'm not removing the trailing null on a UCS2 encoded name My reading of the Postscript Language manual (Section 3.2, page 29) and the DSC spec (page 52) leads me to understand that the label is always text, and literal text strings are required to be enclosed in parentheses. I take it there is some reason that you wanted to omit them, but I can't see why. Can you tell me what the intent was? Also, can you tell me more about the problem with the UCS2 names and trailing null? Is it caused by some particular package/tool? Test case? Only indirectly related to this bug, but I can't get your generating file to work on Fedora 7. I see: [bradh@conferta test-latex]$ pdflatex -shell-escape pagelabel This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) \write18 enabled. entering extended mode (./pagelabel.tex LaTeX2e <2003/12/01> Babel <v3.8d> and hyphenation patterns for american, french, german, ngerman, b ahasa, basque, bulgarian, catalan, croatian, czech, danish, dutch, esperanto, e stonian, finnish, greek, icelandic, irish, italian, latin, magyar, norsk, polis h, portuges, romanian, russian, serbian, slovak, slovene, spanish, swedish, tur kish, ukrainian, nohyphenation, loaded. (/usr/share/texmf/tex/latex/base/article.cls Document Class: article 2004/02/16 v1.4f Standard LaTeX document class (/usr/share/texmf/tex/latex/base/size10.clo)) (/usr/share/texmf/tex/latex/hyperref/hyperref.sty (/usr/share/texmf/tex/latex/graphics/keyval.sty) (/usr/share/texmf/tex/latex/hyperref/pd1enc.def) (/usr/share/texmf/tex/latex/hyperref/hyperref.cfg) Implicit mode ON; LaTeX internals redefined (/usr/share/texmf/tex/latex/url/url.sty)) *hyperref using default driver hpdftex* (/usr/share/texmf/tex/latex/hyperref/hpdftex.def (/usr/share/texmf/tex/latex/psnfss/pifont.sty (/usr/share/texmf/tex/latex/psnfss/upzd.fd) (/usr/share/texmf/tex/latex/psnfss/upsy.fd))) (./pagelabel.aux) (/usr/share/texmf/tex/latex/hyperref/nameref.sty) (./pagelabel.out) (./pagelabel.out) ! Undefined control sequence. l.4 \thispdfpagelabel {Hello world!} ? Any suggestions? > literal text strings are required to be enclosed in parentheses. I believe the parentheses are not required for "simple" text such as a string of one or more digits. > I take it there is some reason that you wanted to omit them, but I > can't see why. Can you tell me what the intent was? I wanted to use pspresent, a program that uses numeric page labels to help the user navigate in a PostScript presentation. Unfortunately, the current version of pspresent (1.3) does not deal with the parentheses correctly, so it would help me if pdftops omits the parentheses when the page label is a string of one or more digits. Of course, this problem is really pspresent's fault, so it is less important than performing escaping correctly. > Also, can you tell me more about the problem with the UCS2 names and > trailing null? Is it caused by some particular package/tool? Test > case? I believe the PDF file I already attached contains a UCS2 page label. Of course, the PS output should not contain raw UCS2, so pdftops must convert any UCS2 page label to ASCII before writing it to the PS file. PageLabelInfo::indexToLabel (inside "if (label->hasUnicodeMarker())") appends the trailing null, which also should not be written to the PS file as is. > Only indirectly related to this bug, but I can't get your generating > file to work on Fedora 7. \thispdfpagelabel requires hypertex version 6.71f. Understood on the parentheses. This is what I'm seeing from https://bugs.freedesktop.org/attachment.cgi?id=12853 ("Test input to pdftops"): 22 0 obj << /Type /Catalog /Pages 12 0 R /Names 21 0 R /PageMode/UseOutlines/PageLabels << /Nums [0 << /P (Hello world!) >> 1 << /S /D /St 2 >> ] >> /OpenAction 5 0 R >> endobj I don't think that is UCS2. Any chance you can give me a test file? I have v6.74m, but I still can't make hyperref work. Created attachment 12892 [details]
Test input to pdftops with Unicode strings
Created attachment 12893 [details]
pdflatex document that generates test case with Unicode strings
My apologies; I should have passed the [unicode] option to \usepackage{hyperref}.
Created attachment 12894 [details] [review] Patch implementing page label support Thanks for the test case. I've just attached a patch that I'm considering applying. Let me know if you have any additional comments. I am concerned about one case: what happens if there is a mismatched right parenthesis ")" in the label. Would it be possible to generate yet another test case that has "Hello) World!" as the label? > I am concerned about one case: what happens if there is a mismatched
> right parenthesis ")" in the label. Would it be possible to generate
> yet another test case that has "Hello) World!" as the label?
The simplest escaping strategy is to escape both opening and closing
parentheses, even those not at the beginning of the string.
Also, if the page label is empty, then isNumeric would stay true yet
parentheses *are* required.
By the way, perhaps label2->append(GooString::format(...)) could be just
label2->appendfv(...).
Created attachment 12899 [details] [review] Patch implementing page label support Incorporates changes from Comment #20, except for the appendfv() change, which I'm not convinced about (and probably isn't important enough to worry about chasing fully). How does that look? It all looks good to me, though let me note that " || (j == 0 && c == '(')" is unnecessary. Thank you! Pushed to master, let me know if you still have problems. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.