Bug 18460

Summary: pdftohtml puts garbage into <title> tags and "Document Outline"
Product: poppler Reporter: Ross Moore <ross>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: minor    
Priority: low CC: jwilk
Version: unspecified   
Hardware: Other   
OS: Mac OS X (All)   
URL: http://www.maths.mq.edu.au/~ross/poppler/ZhangPeng/readme.html
Whiteboard:
i915 platform: i915 features:
Attachments: Zhang Peng's PDF; it contains a chinese font

Description Ross Moore 2008-11-09 16:38:38 UTC
Created attachment 20170 [details]
Zhang Peng's PDF; it contains a chinese font

With the attached PDF, (supplied by Zhang Peng for another purpose)
      http://lists.freedesktop.org/archives/poppler/2008-November/004216.html 
pdftohtml  fails to set the <title> tags correctly, resulting in invalid UTF8 bytes  <FE><FF> .

Within the "Document Outline" section, both entries start this way, with the first
being followed by more garbage.

This can be seen at the URL stated for this bug report:
     http://www.maths.mq.edu.au/~ross/poppler/ZhangPeng/readme.html
(You may need to set the encoding manually to UTF8.)


Facts:
-----
The document contains chinese characters, with the following font info:

<</Subtype/Type0
/DescendantFonts 33 0 R
/BaseFont/AdobeSongStd-Light
/Encoding/UniGB-UCS2-H
/Type/Font>>

There is no embedded CMap resource:

> pdffonts readme.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
AdobeSongStd-Light                   CID Type 0        no  no  no      32  0


Observations:
-----------
    (see also  http://lists.freedesktop.org/archives/poppler/2008-November/004220.html)

     pdftotext  worked fine for me,
       both with Poppler v0.8.2  and  Poppler v0.10.0

   However there were problems with  readme.pdf
   when using other software.

   e.g.,  Adobe Reader v8.1.0 and v9.0.0
       both showed just blank pages;

        Adobe Acrobat Pro v8.1.2
       displayed the PDF just fine

        Preview (MacOS X, v10.4.11)
       displayed the PDF just fine


   pdftohtml  translated the PDF to a 2-page HTML, with frames
       *but* there were some errors.
Comment 1 Albert Astals Cid 2011-06-21 11:46:09 UTC
Will be fixed in poppler 0.17.2

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.