39385 – pdftohtml: add image and font extraction

Bug 39385 - pdftohtml: add image and font extraction

Summary: pdftohtml: add image and font extraction

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	utils (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium enhancement
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-07-19 14:02 UTC by Joshua Richardson
Modified:	2018-08-21 10:54 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
patches to resolve this enhancement (78.42 KB, application/x-gzip) 2011-07-19 14:02 UTC, Joshua Richardson	Details
Only the patches we've done for this feature (64.52 KB, application/x-gzip) 2011-07-28 11:47 UTC, Joshua Richardson	Details
More tweaks; greatly improve rotation / spacing (54.79 KB, application/x-gzip) 2011-08-16 16:34 UTC, Joshua Richardson	Details
New patches, per Albert's request (126.78 KB, patch) 2011-08-25 16:34 UTC, Joshua Richardson	Details \| Splinter Review
Show Obsolete (3) View All

Description Joshua Richardson 2011-07-19 14:02:37 UTC

Created attachment 49314 [details]
patches to resolve this enhancement

Now, instead of generating one large background image per page, smaller images, just large enough to capture the graphical elements are generated, and only on pages where there actually are graphical elements.

In addition, the images may optionally be generated at a larger viewing size, so that the images can be viewed larger, but they still show up the same size in the generated HTML.  We added a new "-dpi" switch for this.

Finally, there is a new option "-embedfonts" which will make the generated html utilize extracted fonts.  (For now, you'll have to use another utility like mu pdfextract to actually extract those fonts.)

These features were all built in parallel, so the easiest way to merge them back to the poppler public repository will probably be to apply all the patches together in order.  In the tarball, I've also included other patches from the public repository that happened while we were developing.  This is in the hope that it will be easier to figure out the right way to apply the patches in order without conflicts.  But so as not to be confused, the only patches that are relevant to this enhancement bug are the ones authored by Joshua Richardson or Stephen Reichling.

Comment 1 Albert Astals Cid 2011-07-28 03:55:25 UTC

I'm confused by the patches attached here since some of them are already on poppler master, can you please send patches on top of master?

Comment 2 Joshua Richardson 2011-07-28 11:47:14 UTC

Created attachment 49682 [details]
Only the patches we've done for this feature

Let me know if this works!

Comment 3 Joshua Richardson 2011-08-16 16:34:20 UTC

Created attachment 50288 [details]
More tweaks; greatly improve rotation / spacing

In addition to greatly improving the rotation and spacing of the -embedfonts option to pdftohtml, we actually extract out the fonts now, so now no 3rd-party utility is required to get font perfection in the html.

Comment 4 Albert Astals Cid 2011-08-18 12:02:59 UTC

Some comments:
 * Please attach only one set of consecutive patches, having two sets of patches that have "conflicting" order (e.g. there is a patch with number 10 in both tars) is difficult to follow
 * Please rebase the patches against master since the first set of patches include parts of the text rotation feature and other patches that are already commited to master
 * Please squash commits 0007-created-SplashOutputDevHtmlImages-class.patch, 0008-Fixed-spacing-line-length-issues.patch and 0009-Moved-SplashOutputDevHtmlImages-into-the-utils-direc.patch since there is no need for us to see you moved the code around
 * Please remove the .gitignore and README.contributors and similar changes from 0009, they have nothing to do with this feature, we might or might not want them but cramming everything into a single bugreport makes it almost impossible to review in a timely fashion
 * 0010-Add-background-color-to-main-div-so-it-doesn-t-rely-.patch Seems like a non related feature and i'd prefer it to be sent separately, and giving the user the option to set the background color (i can easily imagine situations where the background color is not white)
 * Please do not break the encoding of the files, e.g.
-// Copyright (C) 2010 Christian Feuersänger <cfeuersaenger@googlemail.com>
+// Copyright (C) 2010 Christian Feuersï¿œnger <cfeuersaenger@googlemail.com>
is not good
 * Please also squash unneded commits like 0015-Modified-Splash.cc-to-store-coordinates-of-most-rece.patch that you seem to revert in 0019-Images-now-spliced-out-of-splashed-and-output-to-fil.patch it is not fun to review a patch just to discover the code just disappears in the next commit
 * 0039-Speed-up-the-extraction-of-images-by-40.patch includes adding a debug option to GlobalParams, that is unwanted and has nothing to do with the definition of the patch "Subject: [PATCH 39/52] Speed up the extraction of images by 40%"
 * 0043-Clean-up-debug-output.-Remove-a-few-compiler-warning.patch contains a fetch-user-archives.pl that is unwanted and again has nothing to do with the definition of the patch "[PATCH 43/52] Clean up debug output.  Remove a few compiler warnings."

Please fix these issues and i'll have a look once you provide an new set of patches.

Comment 5 Joshua Richardson 2011-08-25 16:34:21 UTC

Created attachment 50579 [details] [review]
New patches, per Albert's request

Some comments:
 * Please attach only one set of consecutive patches, having two sets of
patches that have "conflicting" order (e.g. there is a patch with number 10 in
both tars) is difficult to follow

>> Done

* Please rebase the patches against master since the first set of patches
include parts of the text rotation feature and other patches that are already
commited to master

>> Done.  When I run "git format-patch origin", I still get some patches that were in those earlier bugs.  I think it's because you "cleaned up" my earlier patches when committing, so it shouldn't be a problem, and I just removed them.

 * Please squash commits 0007-created-SplashOutputDevHtmlImages-class.patch,
0008-Fixed-spacing-line-length-issues.patch and
0009-Moved-SplashOutputDevHtmlImages-into-the-utils-direc.patch since there is
no need for us to see you moved the code around

>> Done.  Heads-up that automatic rebase worked flawlessly, with no merge conflicts.  When I did a subsequent interactive rebase, git got totally confused, and I spent an entire day resolving merge conflicts, and still arrived at a slightly different result, which I went back and manually corrected.  But, if I hadn't done the automatic rebase first, there's a good chance I would have never found out about those errors.  Do you have any pointers on how to make this smoother?

 * Please remove the .gitignore and README.contributors and similar changes
from 0009, they have nothing to do with this feature, we might or might not
want them but cramming everything into a single bugreport makes it almost
impossible to review in a timely fashion

>> Done.

 * 0010-Add-background-color-to-main-div-so-it-doesn-t-rely-.patch Seems like a
non related feature and i'd prefer it to be sent separately, and giving the
user the option to set the background color (i can easily imagine situations
where the background color is not white)

>> 0010 is integral to the image extraction feature, because without it, parts of the page without graphics on them will not have a background color set, and may default to gray on some browsers.  This change ensures that the appearance of the document is the same as before image-extraction was implemented, and that the document will appear as it does in any PDF reader, since AFAIK, they all assume a white background.  If a colored background is added to the PDF, it will still be extracted and appear correctly in the HTML.  Adding an option to override the background color in the HTML might be interesting for someone, but it holds no allure for me, and I don't know who would use it.  It would involve making changes to the guts of Splash (so that the background of areas that are painted will match the broader background.)

 * Please do not break the encoding of the files, e.g.
-// Copyright (C) 2010 Christian Feuersänger <cfeuersaenger@googlemail.com>
+// Copyright (C) 2010 Christian Feuersï¿œnger <cfeuersaenger@googlemail.com>
is not good

>> Yes, sorry.  Modern tools assume UTF-8 -- not sure which one broke it.  Unfortunately that character is encoded like the leading byte of a three-byte UTF-8 character.  We should probably have a UTF-8 convention.

* Please also squash unneded commits like
0015-Modified-Splash.cc-to-store-coordinates-of-most-rece.patch that you seem
to revert in 0019-Images-now-spliced-out-of-splashed-and-output-to-fil.patch it
is not fun to review a patch just to discover the code just disappears in the
next commit

>> Done.

 * 0039-Speed-up-the-extraction-of-images-by-40.patch includes adding a debug
option to GlobalParams, that is unwanted and has nothing to do with the
definition of the patch "Subject: [PATCH 39/52] Speed up the extraction of
images by 40%"I have removed it.

>> Done.  I hope you may reconsider, because I think that having a debug switch is a useful diagnostic tool, and it common for a complex system.  It was useful to me in diagnosing the speedup.

 * 0043-Clean-up-debug-output.-Remove-a-few-compiler-warning.patch contains a
fetch-user-archives.pl that is unwanted and again has nothing to do with the
definition of the patch "[PATCH 43/52] Clean up debug output.  Remove a few
compiler warnings."

>> Removed it.

Please fix these issues and i'll have a look once you provide an new set of patches.

>> Thanks Albert!!!

Comment 6 suzuki toshiya 2011-08-28 20:06:03 UTC

Dear Josh,

I really apologize that I have been (and still am) unable to help
for your great effort to improve pdftohtml.

Just I've checked the support code of embedded font feature,
and I want to ask a question and give a few comments.

1) Basically, the embedded font extraction feature seems to be
designed to save a embedded font stream as a separated file.
If the embedded font lacks the essential tables that are required
to stand as a self-standing font file (e.g. embedded TrueType
may lack "cmap" table), it is out of scope? Sorry, I don't mean
and I'm not requesting that such fonts should be supported,
just I want to ask.

2) I'm not so familiar with the coverage of the font formats supported
by HTML rendering systems, but, I guess, some font formats are
not so widely supported by most web browsers. I'm not against the idea
to extract all possible fonts from PDF (extract all is far intuitive than
selective extraction), but some warning would be expected for users
to indicate that "font XXX is extracted but your HTML browser may
not be able to use it".

* Type3: In Type3 embedded in PDF, any PDF graphic operations
can be used. Thus, to render Type3 in PDF, yet another PDF rendering
system is required. Considering that most web browsers don't have
their builtin PDF renderer, Type3 won't be able to be used correctly.
In fact, FreeType font rasterizer does not support PS or PDF Type3.

* CIDType0, CIDType2: Maybe you know that CID-keyed font is designed
to be used with CMap resource to translate the character code to CID
number: the glyph identifier in CID-keyed font), and CMap may or may
not be embedded in PDF document. Thus, it is possible to say CID-keyed
font is not self-standing. Although there had ever been a patch for
FreeType2 to combine CID-keyed font & CMap and make a self-standing
face object long ago, it is not adopted yet (not refused but considered as
"more TODO"). I'm afraid that most web browsers assume the simple
font loading mechanism like "giving a font file pathname, and an indice
to specify the face in TTC (or font suitcase) if required, then get a self-
standing face object", so they cannot support CIDType0 or CIDType2.

* CFF: I think most FreeType based applications don't distinguish
CFF from other PS Type1 fonts (PFA/PFB), but Microsoft Windows
supports only PFB, PFA and CFF are not supported (although OpenType
including CFF is supported!). Yet I've not checked Mac OS X.
One of the problem is that Adobe Acrobat (on Microsoft Windows)
transform PFB fonts to CFF fonts when it embeddes PFB fonts to
PDF, oops.

3) About GfxFont::getFileExtension(), some correction is recommended.
Especially, most font formats designed for PostScript language lack the
definition of the standard suffixes. I think...

* the extension for "fontType1" may be "pfa" or "pfb". although I
don't have good referential PDF generator that embeddes PS Type1 as
PS Type1, checking the header is recommended to determine appropriate
suffix. However, I'm not sure if there is a software going wrong when PFB
fonts are given with the suffix PFA. FreeType does not care, and, most
systems caring PFA or PFB are supposed to be the systems supporting
PFB only or PFA only.

* the extension for "fontType3" is not officially standardized by the spec author.

* the extension for "CIDType0" is not officially standardized by the spec author.

* the extension for "CIDType2" is not officially standardized by the spec author.

* the extension ".otf" must be used for the font including "CFF" table,
so it should not be used for "fontTrueTypeOT", "fontCIDType2OT" that
use "glyf" instead of "CFF ".

Comment 7 Joshua Richardson 2011-08-30 11:48:29 UTC

(In reply to comment #6)
> 1) Basically, the embedded font extraction feature seems to be
> designed to save a embedded font stream as a separated file.
> If the embedded font lacks the essential tables that are required
> to stand as a self-standing font file (e.g. embedded TrueType
> may lack "cmap" table), it is out of scope? Sorry, I don't mean
> and I'm not requesting that such fonts should be supported,
> just I want to ask.
No, this is not out of scope for me.  If I have not already, I will soon submit new patches which export other font data, which can be used to create a free-standing font.

> 2) I'm not so familiar with the coverage of the font formats supported
> by HTML rendering systems, but, I guess, some font formats are
> not so widely supported by most web browsers. I'm not against the idea
> to extract all possible fonts from PDF (extract all is far intuitive than
> selective extraction), but some warning would be expected for users
> to indicate that "font XXX is extracted but your HTML browser may
> not be able to use it".
I like the idea, but I think it's the user's responsibility to understand the options that he uses.  I don't want to punish users who understand what they're doing with an annoying warning message.  Once the fonts are extracted, they must be converted to formats that the various browsers can use in order to work on those browsers.  Thank you for reminding me to document this.  I will put it into the utils/README.pdftohtml document in a future patch.

> * Type3: In Type3 embedded in PDF, any PDF graphic operations
> can be used. Thus, to render Type3 in PDF, yet another PDF rendering
> system is required. Considering that most web browsers don't have
> their builtin PDF renderer, Type3 won't be able to be used correctly.
> In fact, FreeType font rasterizer does not support PS or PDF Type3.
Yes, that is currently a limitation.  Luckily, Type 3 fonts are exceedingly rare in the domain I care about.  The best solution I can think of is to do any drawing operations with Type 3 fonts as an image, instead of extracting as text.  Then use "alt" text in the HTML for that image.

> * CIDType0, CIDType2: Maybe you know that CID-keyed font is designed
> to be used with CMap resource to translate the character code to CID
> number: the glyph identifier in CID-keyed font), and CMap may or may
> not be embedded in PDF document. Thus, it is possible to say CID-keyed
> font is not self-standing. Although there had ever been a patch for
> FreeType2 to combine CID-keyed font & CMap and make a self-standing
> face object long ago, it is not adopted yet (not refused but considered as
> "more TODO"). I'm afraid that most web browsers assume the simple
> font loading mechanism like "giving a font file pathname, and an indice
> to specify the face in TTC (or font suitcase) if required, then get a self-
> standing face object", so they cannot support CIDType0 or CIDType2.
For my application, we are converting everything to Unicode (default pdftohtml encoding).  In later patches we have modified pdftohtml to ensure that everything has a valid and unique unicode mapping.  Then, we use a FontForge script to ensure that the font contains that mapping when converting it to browser-compatible formats.

> * CFF: I think most FreeType based applications don't distinguish
> CFF from other PS Type1 fonts (PFA/PFB), but Microsoft Windows
> supports only PFB, PFA and CFF are not supported (although OpenType
> including CFF is supported!). Yet I've not checked Mac OS X.
> One of the problem is that Adobe Acrobat (on Microsoft Windows)
> transform PFB fonts to CFF fonts when it embeddes PFB fonts to
> PDF, oops.
These patches assume that all fonts will be converted to appropriate type for the given browsers.  WOFF for good browsers, EOT for IE, etc.

> 3) About GfxFont::getFileExtension(), some correction is recommended.
> Especially, most font formats designed for PostScript language lack the
> definition of the standard suffixes. I think...
> 
> * the extension for "fontType1" may be "pfa" or "pfb". although I
> don't have good referential PDF generator that embeddes PS Type1 as
> PS Type1, checking the header is recommended to determine appropriate
> suffix. However, I'm not sure if there is a software going wrong when PFB
> fonts are given with the suffix PFA. FreeType does not care, and, most
> systems caring PFA or PFB are supposed to be the systems supporting
> PFB only or PFA only.
Thanks, I'll keep this in mind if we run into trouble.

> * the extension for "fontType3" is not officially standardized by the spec
> author.
>
> * the extension for "CIDType0" is not officially standardized by the spec
> author.
> 
> * the extension for "CIDType2" is not officially standardized by the spec
> author.
Good to know.  Hopefully these also don't create problems.

> * the extension ".otf" must be used for the font including "CFF" table,
> so it should not be used for "fontTrueTypeOT", "fontCIDType2OT" that
> use "glyf" instead of "CFF ".
Can you give me a reference here?  I thought that ".otf" stands for "Open Type Font" and that both "fontTrueTypeOT" and "fontCIDType2OT" are variants of the Open Type.

Thanks for your help!!!

Comment 8 suzuki toshiya 2011-08-30 19:40:39 UTC

>> * the extension ".otf" must be used for the font including "CFF" table,
>> so it should not be used for "fontTrueTypeOT", "fontCIDType2OT" that
>> use "glyf" instead of "CFF ".
>Can you give me a reference here?  I thought that ".otf" stands for "Open Type
>Font" and that both "fontTrueTypeOT" and "fontCIDType2OT" are variants of the
>Open Type.

Your understanding that "fontXXXOT"s are the variants of OpenType
is correct. But the suffixes for OpenType is not always ".otf". In the
recommendation of OpenType (http://www.microsoft.com/typography/otspec/recom.htm),
* a font including "CFF" should use ".otf"
* a font including "glyf" and OpenType layout feature may use ".otf" or ".ttf".
some implementations have the back compatibility issue with the font including
"glyf" and OpenType layout tables when its suffix is ".otf".
* a font including "glyf" and no OpenType layout feature should use ".ttf".
Also I have to note that many PDF production systems with subsetting feature
remove OpenType layout table when they embed the fonts into PDF.

Comment 9 suzuki toshiya 2011-08-31 00:31:42 UTC

>> * the extension for "fontType1" may be "pfa" or "pfb". although I
>> don't have good referential PDF generator that embeddes PS Type1 as
>> PS Type1, checking the header is recommended to determine appropriate
>> suffix. However, I'm not sure if there is a software going wrong when PFB
>> fonts are given with the suffix PFA. FreeType does not care, and, most
>> systems caring PFA or PFB are supposed to be the systems supporting
>> PFB only or PFA only.
>Thanks, I'll keep this in mind if we run into trouble.

If you're interested in, maybe I can contribute a small patch to check
the embedded font stream is PFA or PFB. According to Adobe TechNote
5040: "Supporting Downloadable PostScript Language Fonts" (1992),
section 3.3 "IBM PC Format", the leading bytes of PFB (Binary format)
would be 0x80 0x01 ..., apparently such should not appear in PFA (ASCII format).

Comment 10 marcin.karpeta 2011-10-12 04:28:52 UTC

I'v tried to apply this patch but fail. This doesn't work with master and no version is provided. Could You write how to apply this patches? Is it enought to clone master repo and add patches with git-am?

Thanks for your time,
Martin

Comment 11 Joshua Richardson 2012-01-10 14:15:25 UTC

Sorry Marcin, this patch was created off of master prior to version 0.18 -- not sure exactly what version it was against, but you should be able to guess it based on the date of submission of the patch.

Comment 12 GitLab Migration User 2018-08-21 10:54:25 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/422.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.