Bug 52406

Summary: Wrong text extracted from attached example: 2012 extracted as 2512
Product: poppler Reporter: Alon Levy <alevy>
Component: generalAssignee: poppler-bugs <poppler-bugs>
Status: RESOLVED INVALID QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: wrong text extraction example: first page bottom left, copy year 2012 -> 2512

Description Alon Levy 2012-07-23 16:55:47 UTC
Created attachment 64556 [details]
wrong text extraction example: first page bottom left, copy year 2012 -> 2512

See the first page of the attached pdf. While it is in hebrew (which is surely related to the bug), you don't need to understand hebrew - the hebrew text is actually fine, the problem is with anything numeric.

The bottom left of the first page has this text (typing the text from watching the correctly *rendered* document in evince):

2012
בפברואר
15

However, copying the text with the cursor and pasting produces the following text:
15 בפברואר 2512

Clearly the text is the same, the numbers are different for the year.
Comment 1 Albert Astals Cid 2012-07-23 17:19:53 UTC
The text is wrong in the document, open it with Adobe Reader and you will see that it also not 2012
Comment 2 Alon Levy 2012-07-23 18:41:51 UTC
Isn't it possible that Acrobat is wrong as well? I've confirmed what you said (haven't tested adobe until now). Anyway thanks for looking into this.

Alon
Comment 3 Albert Astals Cid 2012-07-23 19:47:31 UTC
It could, but i doubt it, creation of files whose text is correctly extracteable is not a given and some pdf creators don't put enough care on making it work.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.