Bug 69454 - pdftohtml should include charset encoding in head section of *s.html files
Summary: pdftohtml should include charset encoding in head section of *s.html files
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: Other All
: medium enhancement
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-17 08:36 UTC by Samuel Thibault
Modified: 2018-08-21 10:58 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
test file (309.92 KB, application/octet-stream)
2013-09-17 08:36 UTC, Samuel Thibault
Details

Description Samuel Thibault 2013-09-17 08:36:54 UTC
Created attachment 85950 [details]
test file

Hello,

After having converted a pdf file to html, all the UTF-8 characters such
as ● get bogus in the web browser, because the html file does not
advertise the character set encoding of the file. pdftohtml should add
this inside its <head>:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

Pino Toscano added on http://bugs.debian.org/722281 that “This is
added already in some occasions, but apparently not in frames when
doing the "complex HTML output".”

For instance, after converting http://brl.thefreecat.org/ghm13.pdf
(also attached here), ghm13s.html does not contain any encoding.

Samuel
Comment 1 Albert Astals Cid 2013-09-25 18:29:31 UTC
A patch for this should be pretty trivial, any taker?
Comment 2 Albert Astals Cid 2013-12-12 21:03:56 UTC
Running pdftohtml on that file gives me a nice

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

So can you please try with a newer poppler version or tell us if you use any special command line arguments?
Comment 3 Samuel Thibault 2013-12-12 23:42:34 UTC
I'm not using any option. ghm13.html does have a content-type meta, but ghm13s.html does not, I have tested with both poppler 0.22.5 and poppler 0.24.4
Comment 4 Albert Astals Cid 2013-12-12 23:46:17 UTC
And ghm13s.html is not the file you're supposed to open, you are supposed to open ghml13.html. So i don't see what's the problem
Comment 5 Samuel Thibault 2013-12-15 16:49:09 UTC
I don't see why one shouldn't be able to open ghm13s.html directly. In the precise use case I have, it's on the contrary what I do want to use, I don't want a slide bar on the left and background image etc.

At any rate, a html file should be self-contained anyway, there is no reason why each file shouldn't declare its own encoding.
Comment 6 GitLab Migration User 2018-08-21 10:58:55 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/457.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.