28362 – wrong spaces in text

Bug 28362 - wrong spaces in text

Summary: wrong spaces in text

Status:	RESOLVED MOVED

Alias:	None

Product:	poppler
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	poppler-bugs
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-06-02 14:29 UTC by Pablo Rodríguez
Modified:	2018-08-20 21:58 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
wrong text extraction with letterspace (12.21 KB, application/pdf) 2010-06-11 08:11 UTC, Pablo Rodríguez	Details
pdf test case (11.08 KB, application/pdf) 2018-06-02 19:53 UTC, Germán Poo-Caamaño	Details
View All

Description Pablo Rodríguez 2010-06-02 14:29:37 UTC

Compiling the below XeLaTeX document, I get the attached PDF file.

\documentclass[12pt]{article}
\usepackage{fontspec}
\setmainfont{Theano Didot}
\usepackage{polyglossia}
\setmainlanguage[variant=ancient]{greek}
\setotherlanguage{english}
\begin{document}
χαλεπὰ \addfontfeature{LetterSpace=12}τὰ καλά

χαλεπὰ τὰ καλά

χαλεπὰ \addfontfeature{LetterSpace=0}τὰ καλά

\selectlanguage{english}
Beauty \addfontfeature{LetterSpace=12}is difficult

Beauty is difficult

Beauty \addfontfeature{LetterSpace=0}is difficult

\end{document}

The issue with this PDF file is that text is extracted in the following way:

χαλεπὰ τ ὰ κ α λ ά
χαλεπὰ τὰ καλά
χ α λ ε π ὰ τὰ καλά
Beauty i s d i ffi c u l t
Beauty is difficult
B e a u t y is difficult

Although there should be no spaces between letters, only between words.

Could you fix this?

Thanks,


Pablo

Comment 1 Albert Astals Cid 2010-06-10 14:33:29 UTC

pdf is not attached

Comment 2 Pablo Rodríguez 2010-06-11 08:11:02 UTC

Created attachment 36216 [details]
wrong text extraction with letterspace 

Sorry, I totally forgot that.

Attached you have the file.

Thanks,


Pablo

Comment 3 Germán Poo-Caamaño 2018-06-02 19:53:50 UTC

Created attachment 139972 [details]
pdf test case

A simpler example is (from https://gitlab.gnome.org/GNOME/evince/issues/111) is:

\documentclass{article}
\begin{document}
TODO
$TODO$
\end{document}

which pdftotext extracts as:


-------
TODO
T ODO

1


-------

A consequence is that searching for 'TODO' only finds the first line, but there is not match for the second one.

Even more, Poppler-glib (cairo backend), the second TODO is rendered with a slight space between T and O.  Acroread and Foxit renders the text as expected, and find both lines when searching for 'TODO'.

Comment 4 GitLab Migration User 2018-08-20 21:58:06 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/137.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.