Created attachment 85950 [details]
After having converted a pdf file to html, all the UTF-8 characters such
as ● get bogus in the web browser, because the html file does not
advertise the character set encoding of the file. pdftohtml should add
this inside its <head>:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
Pino Toscano added on http://bugs.debian.org/722281 that “This is
added already in some occasions, but apparently not in frames when
doing the "complex HTML output".”
For instance, after converting http://brl.thefreecat.org/ghm13.pdf
(also attached here), ghm13s.html does not contain any encoding.
A patch for this should be pretty trivial, any taker?
Running pdftohtml on that file gives me a nice
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
So can you please try with a newer poppler version or tell us if you use any special command line arguments?
I'm not using any option. ghm13.html does have a content-type meta, but ghm13s.html does not, I have tested with both poppler 0.22.5 and poppler 0.24.4
And ghm13s.html is not the file you're supposed to open, you are supposed to open ghml13.html. So i don't see what's the problem
I don't see why one shouldn't be able to open ghm13s.html directly. In the precise use case I have, it's on the contrary what I do want to use, I don't want a slide bar on the left and background image etc.
At any rate, a html file should be self-contained anyway, there is no reason why each file shouldn't declare its own encoding.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/457.