Bug 52406

Summary:	Wrong text extracted from attached example: 2012 extracted as 2512
Product:	poppler	Reporter:	Alon Levy <alevy>
Component:	general	Assignee:	poppler-bugs <poppler-bugs>
Status:	RESOLVED INVALID	QA Contact:
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	wrong text extraction example: first page bottom left, copy year 2012 -> 2512

Description Alon Levy 2012-07-23 16:55:47 UTC

Created attachment 64556 [details]
wrong text extraction example: first page bottom left, copy year 2012 -> 2512

See the first page of the attached pdf. While it is in hebrew (which is surely related to the bug), you don't need to understand hebrew - the hebrew text is actually fine, the problem is with anything numeric.

The bottom left of the first page has this text (typing the text from watching the correctly *rendered* document in evince):

2012
בפברואר
15

However, copying the text with the cursor and pasting produces the following text:
15 בפברואר 2512

Clearly the text is the same, the numbers are different for the year.

Comment 1 Albert Astals Cid 2012-07-23 17:19:53 UTC

The text is wrong in the document, open it with Adobe Reader and you will see that it also not 2012

Comment 2 Alon Levy 2012-07-23 18:41:51 UTC

Isn't it possible that Acrobat is wrong as well? I've confirmed what you said (haven't tested adobe until now). Anyway thanks for looking into this.

Alon

Comment 3 Albert Astals Cid 2012-07-23 19:47:31 UTC

It could, but i doubt it, creation of files whose text is correctly extracteable is not a given and some pdf creators don't put enough care on making it work.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.