Bug 106444 - pdftocairo -pdf output breaks extracted text
Summary: pdftocairo -pdf output breaks extracted text
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: utils (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-08 20:58 UTC by nopbin+freedeskbugs
Modified: 2018-08-21 11:19 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized. (1020.41 KB, application/zip)
2018-05-08 20:58 UTC, nopbin+freedeskbugs
Details

Description nopbin+freedeskbugs 2018-05-08 20:58:11 UTC
Created attachment 139431 [details]
PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized.

Under Ubuntu 16.04 processing select PDFs with pdftocairo -pdf (both versions 0.41.0 (pkg) and 0.64.0 (src)) results in text extracted from the resulting PDF to appear as question mark symbols (suggesting a text encoding problem).  The rendered image output appears correct.

I initially observed the problem with the extracted text when programmatically processing the text layer when rendered with pdf.js but then confirmed the behavior looking at the output of pdftotext. (Also when copying text from other pdf viewers.)

Interestingly when the same PDF is processed on a Mac with pdftocairo (0.64.0) the output PDFs extracted text appears *correct*.  I am not sure if it is relevant but in the attached example I do observe some differences in the font encoding as shown below.


pdffonts from original PDF:

    name                                 type              encoding         emb sub uni object ID
    ------------------------------------ ----------------- ---------------- --- --- --- ---------
    FFXDHY+ArialMT                       TrueType          MacRoman         yes yes no      10  0
    EESSLH+Helvetica                     TrueType          WinAnsi          yes yes yes      9  0



pdffonts after processing on Ubuntu:

    name                                 type              encoding         emb sub uni object ID
    ------------------------------------ ----------------- ---------------- --- --- --- ---------
    DFUWOB+ArialMT                       CID TrueType      Identity-H       yes yes yes      5  0


pdffonts after processing on Mac:

    name                                 type              encoding         emb sub uni object ID
    ------------------------------------ ----------------- ---------------- --- --- --- ---------
    DFUWOB+ArialMT                       TrueType          WinAnsi          yes yes yes      5  0
Comment 1 GitLab Migration User 2018-08-21 11:19:10 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/617.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.