Created attachment 32670 [details] PDF file used The ActualText strings seems to be placed incorrectly, in the attached PDF file the extracted text (using pdftotext) should be "This ⁂ is an asterism." but what I get is: This is an asterism. ⁂
With Evince, the copied text is "This is an asterism."
Manually removing "q 1 0 0 1 110.42 -62.76 cm" and "Q" from around the "EMC" fixes this issue (ignoring the warning about broken PDF file).
Created attachment 32671 [details] Version of the file that extract correctly
Created attachment 32672 [details] Broken PDF file
Adrian can you have a look?
It works fine using "pdftotext -raw" but without "-raw" it fails. The asterism is drawn using 3 asterisks. Adding the following debug statements: --- a/poppler/TextOutputDev.cc +++ b/poppler/TextOutputDev.cc @@ -4518,8 +4518,10 @@ void ActualText::addChar(GfxState *state, double x, double y, double dx, double dy, CharCode c, int nBytes, Unicode *u, int uLen) { if (actualTextBMCLevel == 0) { + printf("addChar %f %f %f %f '%c'\n", x, y, dx, dy, c); text->addChar(state, x, y, dx, dy, c, nBytes, u, uLen); } else { + printf("actualText %f %f %f %f '%c'\n", x, y, dx, dy, c); // Inside ActualText span. I get the output: addChar 76.710000 -62.760000 7.195279 0.000000 'T' addChar 83.905279 -62.760000 5.535443 0.000000 'h' addChar 89.440721 -62.760000 2.767721 0.000000 'i' addChar 92.208443 -62.760000 3.929407 0.000000 's' actualText 102.450000 -61.920000 4.981500 0.000000 '*' actualText 99.460000 -66.400000 4.981500 0.000000 '*' actualText 105.437800 -66.400000 4.981500 0.000000 '*' addChar 113.740000 -62.760000 2.767721 0.000000 'i' addChar 116.507721 -62.760000 3.929407 0.000000 's' addChar 123.754808 -62.760000 4.981500 0.000000 'a' addChar 128.736308 -62.760000 5.535443 0.000000 'n' addChar 137.589429 -62.760000 4.981500 0.000000 'a' addChar 142.570929 -62.760000 3.929407 0.000000 's' addChar 146.500337 -62.760000 3.874611 0.000000 't' addChar 150.374947 -62.760000 4.427557 0.000000 'e' addChar 154.802505 -62.760000 3.902507 0.000000 'r' addChar 158.705012 -62.760000 2.767721 0.000000 'i' addChar 161.472733 -62.760000 3.929407 0.000000 's' addChar 165.402140 -62.760000 8.302168 0.000000 'm' addChar 173.704308 -62.760000 2.767721 0.000000 '.' addChar 231.130000 -630.640000 4.981500 0.000000 '1' It looks like the y position of the asterisks makes TextOutputDev think the asterisks are not on the same line.
Any idea how to fix it?
(In reply to comment #7) > Any idea how to fix it? > No.
Created attachment 82224 [details] XeLaTex document
Created attachment 82225 [details] XeTeX output
Hi guys, I ran across a similar bug, here using libpoppler 0.20.4. Perhaps examples using plain Latin would be helpful. The attachments are made for and with XeLaTex, using the 'accsupp' package. Adobe Reader 9 renders the text properly. When all the text is selected and copied, it pastes as expected into a text editor. (I see a glitch in the selection hiliting though. Don't know what causes that.) Evince 3.6.0 mangles it. All the text with ActualText entries fails to display, instead it is offset far below where it should be, as invisible "unknown character" symbols. To see these, Select All. Copying is similarly mangled. It would be nice to see this fixed!
Created attachment 82228 [details] XeLaTeX output (as binary)
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/426.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.