Created attachment 22702 [details] [review] This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . Trying to extract text from PDFs constructed using pdfTeX and containing /ActualText or /Alt tags does not give the desired results. 1. /Alt tagging is not supported at all. 2. /ActualText tagging is recognised but no content is extracted. Here are some examples from http://www.maths.mq.edu.au/~ross/poppler/Big5/ Big5-actual.pdf (170kb) --- has /ActualText tagging Big5-actual.txt (97 bytes) Big5-alt.pdf (169kb) --- has /Alt tagging Big5-alt.txt (434 bytes) Big5-notags.pdf (157kb) --- no special tagging Big5-notags.txt (432 bytes) The corresponding .txt files were obtained using pdftotext -raw with pdftotext/Poppler version as follows: [GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help pdftotext version 0.10.3 Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2004 Glyph & Cog, LLC It is clear just from the file-size of Big5-actual.txt that Poppler isn't extracting the /ActualText in this case. Also, if you look at the contents of Big5-notags.txt you'll see the same kind of "multiple-striking" to get the bold effect. With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking is meant to be mapped to a single Unicode character. But Poppler has no support for /Alt tagging, which is why Big5-alt.txt is practically the same size as Big5-notags.txt . With these three PDFs, Adobe Reader cannot extract the chinese characters from Big5-notags.pdf whereas it can do so from Big5-actual.pdf and Big5-alt.pdf due to the extra tagging. Apple's Preview and Poppler, on the other hand, can identify the characters (presumably from information in the fonts or their encoding arrays --- a CMap is not applicable). But both extract three copies when the multiple striking occurs, so are not dealing with the /Alt or /ActualText tags. Furthermore, Poppler gives nothing for the ideographs marked with /ActualText tagging. Speculation: poppler may not be extracting the information in the tagging strings since they contain octal character codes? For example, the tagging looks like this: /Span<</ActualText(\376\377\307\164)>> BDC ... Chinese/Korean ideograph ... EMC whereas the coding in TextOutputDev.cc that handles this is: actualText = obj.getString(); and if (!actualText->hasUnicodeMarker()) { if (actualText->getLength() > 0) { //non-unicode string -- assume pdfDocEncoding and //try to convert to UTF16BE uniString = pdfDocEncodingToUTF16(actualText, &length); } else { length = 0; } } else { uniString = actualText->getCString(); length = actualText->getLength(); } Shouldn't there be some use of GooString within this coding block, to properly handle those octal character codes? There are some more similar examples, involving Korean fonts, at: http://www.maths.mq.edu.au/~ross/poppler/KS/
Comment on attachment 22702 [details] [review] This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . This bug is due to improper extraction of the text in the /ActualText entry. Here is a better description of the effects observed. I'm now creating PDFs with /ActualText strings for CJK ideographs. These strings are given in big-endian UTF-16 format. Using pdftotext to extract the text, what I find is that: a) some, but not all, UTF-16 byte-pairs produce an extractable character. b) Whenever the *first* byte of the pair is in the upper range 128--255 then the whole character is omitted. For example, with the PDF string: (˛ˇt»»tt») the text extracted using Adobe Reader is 瓈존瓈 but Poppler produces 珈珈 , which exhibits two errors. Firstly, ... the portion '»t' has been extracted as '', the empty string, between the chinese ideographs. In alternative representations, this is: (<FE><FF>t<C8><C8>tt<C8>) producing <E7><8F><88><E7><8F><88> , where t<C8> representing 't»' extracts to <E7><8F><88> which is 珈 . Secondly, ... c) There is an error in the translation of UTF-16 characters into UTF-8. For example, the above t<C8> should actually convert in UTF-8 to <E7><93><88> which is 瓈 , as done by Adobe and other software. The <E7><8F><88> is what correctly comes from s<C8> ; the top-order byte is being mistranslated by -1. Further comment. d) octal codes can be used, contrary to a question that I raised in bug report 20013 . There my testing was with codes which produced 1st bytes within the upper range, so the difficulties were the same as in b) above. e) the example PDF http://www.unicode.org/udhr/d/udhr_san.pdf used to test the /ActualText support involved only characters in the range Ux0A.. so that the problem (b) with higher range characters did not occur; and nor does (c) for this range. Here's a patch that fixes the problem. The new line of coding is based upon similar methods used in poppler/Outline.cc . *** TextOutputDev-prev.cc Wed Feb 18 04:59:28 2009 --- TextOutputDev.cc Wed Feb 18 05:42:22 2009 *************** void TextOutputDev::endMarkedContent(Gfx *** 4657,4663 **** length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1]; text->addChar(state, actualText_x, actualText_y, --- 4657,4663 ---- length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff); text->addChar(state, actualText_x, actualText_y,
Comment on attachment 22702 [details] [review] This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . *** TextOutputDev-prev.cc Wed Feb 18 04:59:28 2009 --- TextOutputDev.cc Wed Feb 18 05:42:22 2009 *************** void TextOutputDev::endMarkedContent(Gfx *** 4657,4663 **** length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1]; text->addChar(state, actualText_x, actualText_y, --- 4657,4663 ---- length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff); text->addChar(state, actualText_x, actualText_y,
Comment on attachment 22702 [details] [review] This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . Having trouble with this system. Here's the patch again (3rd attempt): *** TextOutputDev-prev.cc Wed Feb 18 04:59:28 2009 --- TextOutputDev.cc Wed Feb 18 05:42:22 2009 *************** void TextOutputDev::endMarkedContent(Gfx *** 4657,4663 **** length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1]; text->addChar(state, actualText_x, actualText_y, --- 4657,4663 ---- length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff); text->addChar(state, actualText_x, actualText_y,
The supplied patch has been tested with pdftotext and evince. viz. On Tue, Feb 24, 2009 at 09:37:46AM +0100, The Thanh Han wrote: > Hi Ross, > >On Tue, Feb 24, 2009 at 05:53:35AM +1100, Ross Moore wrote: >> evince, okular and kpdf all use Poppler, right? >> Were they rebuilt after applying my patch? > > oops I forgot this :(. Will do it now. But will take some > time, there are way many build dependencies. this turned out to be quite tough. I tried several ways and ended up installing a very recent linux system (ubuntu 9.04 alpha), and rebuild poppler on that system with your patch. The good news is that it works; I can copy from evince and paste to gvim, and get the correct characters.
Fixed thanks for the patch!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.