Summary: | pdftotext doesn't support /Alt nor /ActualText with octal content | ||
---|---|---|---|
Product: | poppler | Reporter: | Ross Moore <ross> |
Component: | general | Assignee: | poppler-bugs <poppler-bugs> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | All | ||
URL: | http://www.maths.mq.edu.au/~ross/poppler/Big5/ | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: | This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . |
Description
Ross Moore
2009-02-08 21:55:05 UTC
Comment on attachment 22702 [details] [review] This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . This bug is due to improper extraction of the text in the /ActualText entry. Here is a better description of the effects observed. I'm now creating PDFs with /ActualText strings for CJK ideographs. These strings are given in big-endian UTF-16 format. Using pdftotext to extract the text, what I find is that: a) some, but not all, UTF-16 byte-pairs produce an extractable character. b) Whenever the *first* byte of the pair is in the upper range 128--255 then the whole character is omitted. For example, with the PDF string: (˛ˇt»»tt») the text extracted using Adobe Reader is 瓈존瓈 but Poppler produces 珈珈 , which exhibits two errors. Firstly, ... the portion '»t' has been extracted as '', the empty string, between the chinese ideographs. In alternative representations, this is: (<FE><FF>t<C8><C8>tt<C8>) producing <E7><8F><88><E7><8F><88> , where t<C8> representing 't»' extracts to <E7><8F><88> which is 珈 . Secondly, ... c) There is an error in the translation of UTF-16 characters into UTF-8. For example, the above t<C8> should actually convert in UTF-8 to <E7><93><88> which is 瓈 , as done by Adobe and other software. The <E7><8F><88> is what correctly comes from s<C8> ; the top-order byte is being mistranslated by -1. Further comment. d) octal codes can be used, contrary to a question that I raised in bug report 20013 . There my testing was with codes which produced 1st bytes within the upper range, so the difficulties were the same as in b) above. e) the example PDF http://www.unicode.org/udhr/d/udhr_san.pdf used to test the /ActualText support involved only characters in the range Ux0A.. so that the problem (b) with higher range characters did not occur; and nor does (c) for this range. Here's a patch that fixes the problem. The new line of coding is based upon similar methods used in poppler/Outline.cc . *** TextOutputDev-prev.cc Wed Feb 18 04:59:28 2009 --- TextOutputDev.cc Wed Feb 18 05:42:22 2009 *************** void TextOutputDev::endMarkedContent(Gfx *** 4657,4663 **** length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1]; text->addChar(state, actualText_x, actualText_y, --- 4657,4663 ---- length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff); text->addChar(state, actualText_x, actualText_y, Comment on attachment 22702 [details] [review] This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . *** TextOutputDev-prev.cc Wed Feb 18 04:59:28 2009 --- TextOutputDev.cc Wed Feb 18 05:42:22 2009 *************** void TextOutputDev::endMarkedContent(Gfx *** 4657,4663 **** length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1]; text->addChar(state, actualText_x, actualText_y, --- 4657,4663 ---- length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff); text->addChar(state, actualText_x, actualText_y, Comment on attachment 22702 [details] [review] This patch fixes just 1 line, using coding that appears elsewhere within the Poppler code-base; e.g. in poppler/CMap.cc and poppler/Outline.cc . Having trouble with this system. Here's the patch again (3rd attempt): *** TextOutputDev-prev.cc Wed Feb 18 04:59:28 2009 --- TextOutputDev.cc Wed Feb 18 05:42:22 2009 *************** void TextOutputDev::endMarkedContent(Gfx *** 4657,4663 **** length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1]; text->addChar(state, actualText_x, actualText_y, --- 4657,4663 ---- length = length/2 - 1; uni = new Unicode[length]; for (i = 0 ; i < length; i++) ! uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff); text->addChar(state, actualText_x, actualText_y, The supplied patch has been tested with pdftotext and evince. viz. On Tue, Feb 24, 2009 at 09:37:46AM +0100, The Thanh Han wrote: > Hi Ross, > >On Tue, Feb 24, 2009 at 05:53:35AM +1100, Ross Moore wrote: >> evince, okular and kpdf all use Poppler, right? >> Were they rebuilt after applying my patch? > > oops I forgot this :(. Will do it now. But will take some > time, there are way many build dependencies. this turned out to be quite tough. I tried several ways and ended up installing a very recent linux system (ubuntu 9.04 alpha), and rebuild poppler on that system with your patch. The good news is that it works; I can copy from evince and paste to gvim, and get the correct characters. Fixed thanks for the patch! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.