Bug 37900 - pdftotext -htmlmeta and pdftohtml fail to decode U+2019
Summary: pdftotext -htmlmeta and pdftohtml fail to decode U+2019
Status: RESOLVED FIXED
Alias: None
Product: poppler
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: All All
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2011-06-03 16:17 UTC by Steven Murdoch
Modified: 2011-06-20 15:26 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Fix pdftotext -htmlmeta to correctly output U+2019 in PDF metadata (1.53 KB, patch)
2011-06-03 16:17 UTC, Steven Murdoch
Details | Splinter Review
Test case demonstrating problem with U+2019 in title (28.53 KB, application/x-pdf)
2011-06-04 03:24 UTC, Steven Murdoch
Details
Fix encoding of PDF document metadata in output of pdftohtml (3.54 KB, patch)
2011-06-06 17:05 UTC, Steven Murdoch
Details | Splinter Review

Description Steven Murdoch 2011-06-03 16:17:16 UTC
Created attachment 47501 [details] [review]
Fix pdftotext -htmlmeta to correctly output U+2019 in PDF metadata

pdftotext -htmlmeta is supposed to parse the PDF metadata and output it as HTML metadata. It generally works, but fails when decoding U+2019 (right single quotation mark).

This is because U+2019 may be encoded in PDF documents as 0x90, because the PDF document encoding uses some of the reserved areas of ISO 8859-1. pdfinfo does the right thing, so I have attached a patch which makes pdftotext use the same approach as pdfinfo. pdftohtml has the same problem, but I haven't attempted to fix this.
Comment 1 Albert Astals Cid 2011-06-04 02:57:26 UTC
Please attach a pdf with such a problem.
Comment 2 Steven Murdoch 2011-06-04 03:24:37 UTC
Created attachment 47513 [details]
Test case demonstrating problem with U+2019 in title

Attached as requested (generated by Word 2007 + Acrobat 9, the same as the document that was actually causing the problem).

$ pdftotext -htmlmeta /tmp/u2019test.pdf - | xxd | less
...
00000a0: 6d6c 223e 0a3c 6865 6164 3e0a 3c74 6974  ml">.<head>.<tit
00000b0: 6c65 3e54 6573 7420 6f66 2070 6466 746f  le>Test of pdfto
00000c0: 7465 7874 c290 7320 636f 6e76 6572 7369  text..s conversi
00000d0: 6f6e 206f 6620 552b 3230 3139 2e3c 2f74  on of U+2019.</t
...

[0xc2 0x90 is the UTF-8 encoding of U+0090]

$ pdfinfo /tmp/u2019test.pdf | xxd | less

...
0000000: 5469 746c 653a 2020 2020 2020 2020 2020  Title:          
0000010: 5465 7374 206f 6620 7064 6674 6f74 6578  Test of pdftotex
0000020: 74e2 8099 7320 636f 6e76 6572 7369 6f6e  t...s conversion
0000030: 206f 6620 552b 3230 3139 2e0a 4175 7468   of U+2019..Auth
0000040: 6f72 3a20 2020 2020 2020 2020 736a 6d32  or:         sjm2
...

[0xe2 0x80 0x99 is the UTF-8 encoding of U+2019]
Comment 3 Albert Astals Cid 2011-06-04 12:24:58 UTC
I've commited your patch and it will be in poppler >= 0.17.1

If you are interested in fixing pdftohtml we'd like a patch for it.
Comment 4 Steven Murdoch 2011-06-06 17:05:31 UTC
Created attachment 47627 [details] [review]
Fix encoding of PDF document metadata in output of pdftohtml

pdftohtml simply copies the PDF document title into the <title> HTML
tag, which fails when the title is UCS-2 encoded, or if it contains
characters which are in pdfDocEncoding (a ISO 8859-1 superset), but not
in ISO 8859-1.  This patch fixes the problem by decoding UCS-2 or
pdfDocEncoding into Unicode, then encoding this in the desired output
encoding.  HTML escaping wasn't being done either, so I have used the
existing function HtmlFont::HtmlFilter to perform both HTML escaping
and character set encoding. This static method had to be made public
to call it from pdftohtml. See bug #37900.
Comment 5 Albert Astals Cid 2011-06-20 15:26:28 UTC
Fix commited, your help is appreciated, keep patches comming :-)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.