Bug 26077

Summary: Mis-placed ActualText strings
Product: poppler Reporter: Khaled Hosny <dr.khaled.hosny>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: ajohnson, dr.khaled.hosny
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: PDF file used
Version of the file that extract correctly
Broken PDF file
XeLaTex document
XeTeX output
XeLaTeX output (as binary)

Description Khaled Hosny 2010-01-16 22:19:46 UTC
Created attachment 32670 [details]
PDF file used

The ActualText strings seems to be placed incorrectly, in the attached PDF file the extracted text (using pdftotext) should be "This ⁂ is an asterism." but what I get is:

This

is an asterism.

⁂
Comment 1 Khaled Hosny 2010-01-16 22:21:40 UTC
With Evince, the copied text is "This is an asterism."
Comment 2 Khaled Hosny 2010-01-16 23:07:48 UTC
Manually removing "q 1 0 0 1 110.42 -62.76 cm" and "Q" from around the "EMC" fixes this issue (ignoring the warning about broken PDF file).
Comment 3 Khaled Hosny 2010-01-16 23:08:43 UTC
Created attachment 32671 [details]
Version of the file that extract correctly
Comment 4 Khaled Hosny 2010-01-16 23:12:56 UTC
Created attachment 32672 [details]
Broken PDF file
Comment 5 Albert Astals Cid 2010-01-19 14:42:56 UTC
Adrian can you have a look?
Comment 6 Adrian Johnson 2010-01-23 05:58:23 UTC
It works fine using "pdftotext -raw" but without "-raw" it fails. The asterism is drawn using 3 asterisks. Adding the following debug statements:

--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -4518,8 +4518,10 @@ void ActualText::addChar(GfxState *state, double x, double y,
                         double dx, double dy,
                         CharCode c, int nBytes, Unicode *u, int uLen) {
   if (actualTextBMCLevel == 0) {
+      printf("addChar %f %f %f %f '%c'\n", x, y, dx, dy, c);
     text->addChar(state, x, y, dx, dy, c, nBytes, u, uLen);
   } else {
+      printf("actualText %f %f %f %f '%c'\n", x, y, dx, dy, c);
     // Inside ActualText span.


I get the output:

addChar 76.710000 -62.760000 7.195279 0.000000 'T'
addChar 83.905279 -62.760000 5.535443 0.000000 'h'
addChar 89.440721 -62.760000 2.767721 0.000000 'i'
addChar 92.208443 -62.760000 3.929407 0.000000 's'
actualText 102.450000 -61.920000 4.981500 0.000000 '*'
actualText 99.460000 -66.400000 4.981500 0.000000 '*'
actualText 105.437800 -66.400000 4.981500 0.000000 '*'
addChar 113.740000 -62.760000 2.767721 0.000000 'i'
addChar 116.507721 -62.760000 3.929407 0.000000 's'
addChar 123.754808 -62.760000 4.981500 0.000000 'a'
addChar 128.736308 -62.760000 5.535443 0.000000 'n'
addChar 137.589429 -62.760000 4.981500 0.000000 'a'
addChar 142.570929 -62.760000 3.929407 0.000000 's'
addChar 146.500337 -62.760000 3.874611 0.000000 't'
addChar 150.374947 -62.760000 4.427557 0.000000 'e'
addChar 154.802505 -62.760000 3.902507 0.000000 'r'
addChar 158.705012 -62.760000 2.767721 0.000000 'i'
addChar 161.472733 -62.760000 3.929407 0.000000 's'
addChar 165.402140 -62.760000 8.302168 0.000000 'm'
addChar 173.704308 -62.760000 2.767721 0.000000 '.'
addChar 231.130000 -630.640000 4.981500 0.000000 '1'

It looks like the y position of the asterisks makes TextOutputDev think the asterisks are not on the same line.
Comment 7 Albert Astals Cid 2010-01-23 13:24:46 UTC
Any idea how to fix it?
Comment 8 Adrian Johnson 2010-01-23 15:47:32 UTC
(In reply to comment #7)
> Any idea how to fix it?
> 

No.
Comment 9 Steve White 2013-07-09 11:37:10 UTC
Created attachment 82224 [details]
XeLaTex document
Comment 10 Steve White 2013-07-09 11:38:06 UTC
Created attachment 82225 [details]
XeTeX output
Comment 11 Steve White 2013-07-09 11:43:53 UTC
Hi guys,

I ran across a similar bug, here using libpoppler 0.20.4.
Perhaps examples using plain Latin would be helpful.

The attachments are made for and with XeLaTex, 
using the 'accsupp' package.

Adobe Reader 9 renders the text properly.  When all the text
is selected and copied, it pastes as expected into a text editor.
(I see a glitch in the selection hiliting though.  Don't know
what causes that.)

Evince 3.6.0 mangles it.  All the text with ActualText entries
fails to display, instead it is offset far below where it should
be, as invisible "unknown character" symbols.  To see these, 
Select All.  Copying is similarly mangled.


It would be nice to see this fixed!
Comment 12 Steve White 2013-07-09 11:48:14 UTC
Created attachment 82228 [details]
XeLaTeX output (as binary)
Comment 13 GitLab Migration User 2018-08-21 10:54:54 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/426.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.