Bug 18460 - pdftohtml puts garbage into <title> tags and "Document Outline"
Summary: pdftohtml puts garbage into <title> tags and "Document Outline"
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other Mac OS X (All)
: low minor
Assignee: poppler-bugs
QA Contact:
URL: http://www.maths.mq.edu.au/~ross/popp...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-09 16:38 UTC by Ross Moore
Modified: 2011-06-21 11:46 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Zhang Peng's PDF; it contains a chinese font (11.02 KB, application/octet-stream)
2008-11-09 16:38 UTC, Ross Moore
Details

Description Ross Moore 2008-11-09 16:38:38 UTC
Created attachment 20170 [details]
Zhang Peng's PDF; it contains a chinese font

With the attached PDF, (supplied by Zhang Peng for another purpose)
      http://lists.freedesktop.org/archives/poppler/2008-November/004216.html 
pdftohtml  fails to set the <title> tags correctly, resulting in invalid UTF8 bytes  <FE><FF> .

Within the "Document Outline" section, both entries start this way, with the first
being followed by more garbage.

This can be seen at the URL stated for this bug report:
     http://www.maths.mq.edu.au/~ross/poppler/ZhangPeng/readme.html
(You may need to set the encoding manually to UTF8.)


Facts:
-----
The document contains chinese characters, with the following font info:

<</Subtype/Type0
/DescendantFonts 33 0 R
/BaseFont/AdobeSongStd-Light
/Encoding/UniGB-UCS2-H
/Type/Font>>

There is no embedded CMap resource:

> pdffonts readme.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
AdobeSongStd-Light                   CID Type 0        no  no  no      32  0


Observations:
-----------
    (see also  http://lists.freedesktop.org/archives/poppler/2008-November/004220.html)

     pdftotext  worked fine for me,
       both with Poppler v0.8.2  and  Poppler v0.10.0

   However there were problems with  readme.pdf
   when using other software.

   e.g.,  Adobe Reader v8.1.0 and v9.0.0
       both showed just blank pages;

        Adobe Acrobat Pro v8.1.2
       displayed the PDF just fine

        Preview (MacOS X, v10.4.11)
       displayed the PDF just fine


   pdftohtml  translated the PDF to a 2-page HTML, with frames
       *but* there were some errors.
Comment 1 Albert Astals Cid 2011-06-21 11:46:09 UTC
Will be fixed in poppler 0.17.2


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.